Taking AIMS at Digital Design: Analysis, Improvement, Modeling, and Synthesis 3031356047, 9783031356049

This is an introductory textbook for courses in Synchronous Digital Design that enables students to develop useful intui

139 66

English Pages 316 [311] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Contents
1 Introduction
1.1 AIMS Design Flow
1.2 Moore's Law
1.3 VLSI History
1.3.1 Very Large System Integration
1.3.2 Systems on Chip
1.3.3 Multi-Core Chips
1.3.4 The Semiconductor Technology Roadmap
2 CMOS Basics
2.1 MOSFET Transistor
2.2 CMOS Inverter
2.2.1 The Inverter Abstraction
2.2.2 Inverter Performance
2.2.3 Inverter Power Consumption
2.3 CMOS Gates
2.3.1 Arbitrary Boolean Functions with One Output
2.3.2 Transmission Gate
2.3.3 XOR
2.3.4 Tri-state Buffer
2.3.5 Latch and Flip-Flop
2.3.6 Static RAM Cells
Six-Transistor Memory Cell
Four-Transistor Memory Cell
2.4 Historic Development of CMOS Characteristics
2.5 Exercises
3 Modeling and Design Flow Basics
3.1 Design Flow
3.2 Basic Modeling Styles
3.2.1 Multiplexer
3.3 Test Benches
3.3.1 Simple Test Bench
3.3.2 Self-Checking Test Bench
3.3.3 Test Bench with Test Vectors
3.4 Adder Design in AIMS
3.4.1 Half Adder
Technology-Independent Optimization
Mapping onto an iCE40 FPGA
Placement and Routing
Timing Analysis
3.4.2 Full Adder
Simple Optimization
Logic Synthesis
Technology Mapping
Placement, Routing, and Timing Analysis
Power Analysis
3.5 Summary
3.6 Exercises
4 Modeling Latches, Flip-Flops, and Registers
4.1 Latch and Flip-Flop
4.1.1 RS-Latch
4.1.2 D-Latch
4.1.3 D-Flip-Flop
4.2 DFF Decision Window
4.3 Meta-Stability
4.4 DFF Variants
4.5 Verilog Always Block
4.6 Modeling Delays
4.6.1 Inertial and Transport Delay
4.6.2 Delay in Verilog
Delay in Blocking Assignments
Delay in Non-blocking Assignments
Delay in Continuous Assignments
Single Assignment Summary
Multiple Assignments
Recommendation
4.7 Exercises
5 Register Transfer Paradigm
5.1 The Principle of RTL
5.2 Pipelining
5.2.1 Pipelining Scheme
5.2.2 Array Multiplier
Array Multiplier with Input and Output Registers Only
Two-Stage Pipelined Multiplier
n-Stage Pipelined Multiplier
Evaluation
5.2.3 Carry-Save Multiplier
5.2.4 Behavioral Multiplier
5.3 Static Timing Analysis
5.3.1 Setup Time Violations
5.3.2 Hold Time Violations
5.3.3 Clock Skew
5.4 Simultaneous Events in HDL Simulations
5.4.1 Verilog Event Scheduler
5.4.2 VHDL Delta Delay Model
5.5 RTL Coding Style Guidelines for Verilog
5.6 Summary
5.7 Exercises
Untitled
6 FPGA Architecture
6.1 Counter Design in AIMS
6.1.1 Synchronous Counter
6.1.2 Asynchronous Counter
6.2 The Lattice FPGA Architecture
6.2.1 Programmable Logic Block
Look-Up Table
FlipFlop
Carry Logic
6.2.2 Block RAM
6.2.3 Programmable I/O
6.2.4 Interconnect
6.2.5 Clocks
Phase Locked Loop
Phase Detector
Low Pass Filter
Voltage-Controlled Oscillator
Clock Divider
Clock Generator in the iCE40
6.3 Summary
6.4 Exercises
7 Logic Optimization
7.1 Data Representations
7.1.1 Terminology
7.1.2 Truth Table
7.1.3 Sum of Product
Conjunction
Disjunction
Negation
Satisfiability
Tautology
7.1.4 Product of Sum
Conjunction
Disjunction
Negation
Satisfiability
Tautology
7.1.5 Binary Decision Diagram
Negation
Binary Operations
Satisfiability
Tautology
7.1.6 Conversion
Conversion Between SoP and PoS
Conversion Between Boolean Formula and BDD
7.2 Two-Level Optimization
7.2.1 Karnaugh Diagrams
7.2.2 Terminology
7.2.3 Basic Principle
7.2.4 Don't Cares
7.2.5 Multiple Solutions
7.2.6 Dominating Rows and Columns
7.2.7 Branching Heuristics
7.2.8 Quine-McCluskey Summary
7.3 Multi-Level Optimization
7.3.1 Factoring
7.3.2 Decomposition
7.3.3 Elimination
7.3.4 Algebraic Division
7.4 Technology Mapping
7.5 Summary
7.6 Exercises
8 Datapath
8.1 Tri-state Bus
8.2 One-Bit and n Bit Adder
8.3 n Bit Adder Based on Full Expansion
8.4 n Bit Adder Based on Full Reuse
8.5 n Bit Adder Based on Factoring
8.6 Verilog Models and Implementations
8.6.1 Ripple Carry Adder
8.6.2 Carry Lookahead Adder
8.6.3 Behavioral Model
8.7 Summary
8.8 Exercises
9 Controller
9.1 Sequence Detector Example
9.2 Design Process
9.2.1 FSM Specification
9.2.2 State Space Minimization
9.2.3 State Encoding
9.2.4 Selection of Registers
9.3 Coin Changing Machine Example
9.3.1 Mealy GWS
9.3.2 Moore GWS
9.3.3 Implementation
9.4 Summary
9.5 Exercises
10 Synchronization and Communication
10.1 Clock Domain Crossing
10.1.1 Meta-stability Protection
10.1.2 Pulse Shortener
10.1.3 Pulse Stretcher
10.2 Synchronous Handshake
10.3 Asynchronous Handshake
10.4 Synchronous FIFO
10.5 Asynchronous FIFO
10.5.1 Gray Code
10.5.2 Binary to Gray Code Conversion
10.5.3 Gray Code to Binary Conversion
10.5.4 FIFO Design
10.5.5 Performance
10.6 Performance Evaluation
10.7 Summary
10.8 Exercises
11 Summary
References
Index
Recommend Papers

Taking AIMS at Digital Design: Analysis, Improvement, Modeling, and Synthesis
 3031356047, 9783031356049

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Axel Jantsch

Taking AIMS at Digital Design Analysis, Improvement, Modeling, and Synthesis

Taking AIMS at Digital Design

Axel Jantsch

Taking AIMS at Digital Design Analysis, Improvement, Modeling, and Synthesis

Axel Jantsch Institute of Computer Technology TU Wien Vienna, Austria

ISBN 978-3-031-35604-9 ISBN 978-3-031-35605-6 https://doi.org/10.1007/978-3-031-35605-6

(eBook)

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland Paper in this product is recyclable.

Preface

This textbook is motivated by an introductory course on digital design that I have been given at TU Wien for the Embedded Systems master program since 2014. During the first years, I relied on standard textbooks and on-line material, but gradually I realized, something else is needed. Today there is a wealth of material available to the student of digital design. Good textbooks cover in-depth all relevant aspects of digital design and on-line tutorials elaborate on specific problems, design challenges and peculiarities of syntax and semantics of design languages and tools. However, since the field is covered so extensively, it is easy for a student to get lost and they have to learn what is important, what is essential, and what are mere details that can be looked up on demand. Therefore, I have focused on building solid intuitions for core concepts in digital design with the hope that these intuitions will guide the students and future engineers in asking the right questions, finding the right information, and assessing new developments in tools, flows, and design languages. Building right, enduring intuitions for synchronous design styles, for register transfer level, for synthesis and verification tools, and for all other key concepts in the design and the tool landscape, was therefore the main objective in writing this introductory book. Once a good intuition is in place, it is straightforward to learn in depth about any particular question given the vast amount of on-line material and the many excellent textbooks that are easily available. Good intuitions are built by an interplay between trying to comprehend the theory and trying out things. Therefore, the book offers introduction to the CMOS gate, logic optimization, and communication along an iterative exploration of designs. No design ever starts from scratch but is an iterative cycle of improvements going through the phases of Analysis, Improvement, Modeling, and Synthesis, which I call AIMS. This iterative cyclic activity, where analysis and performance evaluation is as important as architecture innovation and modeling, is so central to the engineering of integrated circuits that students should adopt this style as early as possible. Several chapters of this book try to illustrate it by asking questions like “What

v

vi

Preface

impact does a change of modeling style have on performance and structure of my implementation?”, and exploring the answers. Thus, the most benefit will be gained when the book is read from start to end. If a detailed solution to a specific design question is sought, this book is not the place to look it up. But if a good intuition of the role of the CMOS gate in digital design or synthesis tools in the design flow is desired, the student will hopefully be well rewarded when reading through the relevant chapters. I would like to thank all the students who gave feedback on earlier versions, helping to improve the book and correct mistakes. In particular, I would like to thank Nefise Ezgi Ozer who prepared first versions of several of the exercises. Above all, I am utterly grateful to the many curious students who asked interesting questions, because these questions have defined the flow and contents of the book. I appreciate and greatly enjoy this interaction and hope this edition of the book will trigger even more questions and contemplation. Vienna, Austria May, 2023

Axel Jantsch

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 AIMS Design Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Moore’s Law. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 VLSI History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Very Large System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Systems on Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Multi-Core Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 The Semiconductor Technology Roadmap . . . . . . . . . . . . . . . .

1 3 4 6 6 9 10 13

2

CMOS Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 MOSFET Transistor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 CMOS Inverter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Inverter Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Inverter Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Inverter Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 CMOS Gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Arbitrary Boolean Functions with One Output. . . . . . . . . . . . 2.3.2 Transmission Gate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.4 Tri-state Buffer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.5 Latch and Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.6 Static RAM Cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Historic Development of CMOS Characteristics . . . . . . . . . . . . . . . . . . . 2.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17 18 19 21 24 26 29 29 32 33 34 36 38 41 43

3

Modeling and Design Flow Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Basic Modeling Styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Multiplexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Test Benches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Simple Test Bench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 45 49 50 54 55

vii

viii

Contents

3.4

3.5 3.6

3.3.2 Self-Checking Test Bench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Test Bench with Test Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adder Design in AIMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Half Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Full Adder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58 59 61 61 71 77 78

4

Modeling Latches, Flip-Flops, and Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1 Latch and Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.1.1 RS-Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.1.2 D-Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1.3 D-Flip-Flop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2 DFF Decision Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Meta-Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.4 DFF Variants. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.5 Verilog Always Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.6 Modeling Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.1 Inertial and Transport Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.6.2 Delay in Verilog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5

Register Transfer Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 The Principle of RTL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Pipelining Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Array Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Carry-Save Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Behavioral Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Setup Time Violations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Hold Time Violations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Simultaneous Events in HDL Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Verilog Event Scheduler. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 VHDL Delta Delay Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 RTL Coding Style Guidelines for Verilog. . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

107 108 111 111 113 122 126 128 129 130 131 133 134 137 141 142 143

6

FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Counter Design in AIMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Synchronous Counter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Asynchronous Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 The Lattice FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Programmable Logic Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

147 147 148 156 159 160

Contents

ix

6.2.2 Block RAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Programmable I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.5 Clocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

166 169 170 172 177 178

7

Logic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Truth Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.3 Sum of Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.4 Product of Sum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.5 Binary Decision Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.6 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Two-Level Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Karnaugh Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Basic Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 Don’t Cares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.5 Multiple Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.6 Dominating Rows and Columns . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.7 Branching Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.8 Quine-McCluskey Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Multi-Level Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Elimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Algebraic Division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Technology Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

179 180 180 181 182 186 189 196 198 199 200 201 203 205 206 208 209 209 210 214 215 215 216 222 223

8

Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Tri-state Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 One-Bit and n Bit Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 n Bit Adder Based on Full Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 n Bit Adder Based on Full Reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 n Bit Adder Based on Factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6 Verilog Models and Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.1 Ripple Carry Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.2 Carry Lookahead Adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.6.3 Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

227 228 230 230 233 235 238 238 240 242 244 245

6.3 6.4

x

Contents

9

Controller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1 Sequence Detector Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 FSM Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 State Space Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 State Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Selection of Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Coin Changing Machine Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Mealy GWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Moore GWS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

247 248 252 253 253 254 255 256 257 260 263 266 267

10

Synchronization and Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Clock Domain Crossing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Meta-stability Protection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.2 Pulse Shortener . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.3 Pulse Stretcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Synchronous Handshake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Asynchronous Handshake . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Synchronous FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Asynchronous FIFO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Gray Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Binary to Gray Code Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.3 Gray Code to Binary Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.4 FIFO Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.5 Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Performance Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

271 273 273 275 276 278 279 283 285 286 288 288 289 289 292 296 297

11

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

Chapter 1

Introduction

This book introduces the basics of digital design with emphasis on analysis and continuous improvement of a given solution. In the course of this book, we introduce all the basic elements of digital design starting with the Complementary MetalOxide Semiconductor (CMOS) transistor to provide an abstraction upon which everything else is built. A good intuition of the CMOS gate is necessary because it is the foundation for billion transistor chips and multi-million Euro design projects. The abstraction of the CMOS gate is a component with a certain functionality and a set of assumptions about its timing and electrical behavior. These assumptions are wisely chosen because they hold, whatever the context of the component, in whatever way we connect the CMOS gate to other gates. This property, that neither function nor the assumptions about time and electrical behavior of the gate are affected by the environment of the gate, is fundamental, because it allows to build chips of any size. It is the reason why we can today design and manufacture billion transistor chips and perhaps soon chips with trillion transistors. Chapter 2 introduces the abstraction of the CMOS gate and the genius behind it. As important as the manufacturing technology is the design methodology and the design process. For Application Specific Integrated Circuits (ASICs), the design flow is of mind-boggling complexity and requires tools that cost hundreds of thousands of Euro. Field Programmable Gate Arrays (FPGAs) need only half the design flow with tools that are often freely available. Chapter 3 gives an overview of the full ASIC flow but focuses on FPGA designs, as does the rest of the book. Chapters 4 and 5 deepen the discussion of modeling concepts. The Register Transfer Level (RTL) paradigm is the basis of all synchronous digital design, facilitating powerful synthesis and verification tools. The RTL methodology is as important as the CMOS gate abstraction, and it is the design process counterpart to the latter. Our main vehicle for discussing design principles is the FPGA. Chapter 6 introduces the basics of FPGA architectures. We use a specific FPGA family of devices, the Lattice ICE40, but the main elements and the architecture templates are the same in all FPGAs. More advanced and complex FPGAs use the same basic © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Jantsch, Taking AIMS at Digital Design, https://doi.org/10.1007/978-3-031-35605-6_1

1

2

1 Introduction

Table 1.1 Each chapter introduces and uses important design components, design patterns, and design methods Chapter Introduction 1 2 CMOS basics 3 Modeling basics 4 Latches and FFs RTL 5

Component

Method

Gates Half and full adder Flip-flops Pipelining, multiplier

6 7

FPGA Logic optimization

Counter

8 9 10

Datapath Controller Communication

n-bit adder FSM Protocols

Delay and power models Test bench-based validation Meta-stability, delay modeling Static timing analysis, event-driven simulation Synthesis Data representation, optimization algorithms Design space exploration FSM design space exploration Design space exploration

elements and add more specialized circuits such as Digital Signal Processors (DSPs) and special-purpose accelerators. Chapter 7 takes a step back and discusses the basic theory of logic optimization. Although designers rarely use this theory directly, it is under the hood of all synthesis tools. In order to obtain an understanding and an intuition of what happens during synthesis, we need a basic understanding of data structures and algorithms used. Chapter 7 provides this understanding. The remaining three Chaps. 8, 9, and 10 introduce the three main design elements: datapath, controller, and communication. Every design consists of these three elements to varying degrees. Throughout the book, we introduce and use the design of specific components of modeling techniques for illustrating the topic at hand. For instance, in Chap. 8, we study various architectures of n-bit adders, how to model them in Verilog, how synthesis tools interpret and implement these models, and how their performance scales. This aids the understanding of the design process and the interrelation of models, architectures, and synthesis tools. Similarly, all other chapters also exemplify the chapter’s topic with the design and study of specific design components. On the way, we learn a number of techniques and get insight into synthesis tools. Table 1.1 gives an overview of the book’s chapters and what they cover. This book is meant as a tutorial not as a reference. It can be used as a textbook in a course, which was the purpose when writing it, but it will not be very useful to consult for specific questions during a design project. When the main concepts are absorbed and understood, and you are unsure about the semantics of a specific Verilog construct, how to deal with meta-stability in a specific context, or what exactly a logic synthesis tool is doing, it is better to use specific manuals, comprehensive articles, or books about a specific topic or, probably fastest, to search the Internet. In contrast, this book is meant to help the reader to construct good intuitions of important, basic concepts, upon which to build when digging deeper.

1.1 AIMS Design Flow

3

Each chapter can be read separately and is to a great extent independent of other chapters, but most will be gained if a chapter is read completely from beginning to end.

1.1 AIMS Design Flow No design, except the most trivial, is ever created from scratch in one straight progression from conception to implementation. Designing a digital circuit is always an iterative procedure. We start from a pre-existing model, which we either design ourselves or obtained from somewhere. We analyze its various aspects, its function, its timing characteristics, its power and energy consumption, its resource usage, and many other aspects. Usually, we are not content and improve or add some small bit or a huge amount of code. While doing it, we keep checking and analyzing the design in an extensive, iterative process until we gradually converge on a design that we proudly hand over to someone else for further processing. We call this iterative design activity AIMS for Analyze-Improve-ModelSynthesize. Its main idea is illustrated in Fig. 1.1. The various chapters of this book introduce these four aspects in quite some detail, but always in the context of the iteration of the AIMS flow. Thus, when we introduce a modeling concept, we follow it through the synthesis step to see how model variants influence the Fig. 1.1 The AIMS design flow

Required product

Specification

Model

Sythesis

Analysis

OK ?

Finished product

Improve

4

1 Introduction

timing and the power consumption and what resources are recruited to implement the modeled behavior. Each chapter exemplifies this general approach by designing and studying specific design components, like arithmetic units, controllers, First In First Outs (FIFOs), etc.

1.2 Moore’s Law Before we delve in depths into the subject matter, let’s review the main trends of the last half-century that lead us to where we stand today. The history of Integrated Circuits (ICs) began in the late 1950s when Jack Kilby, Robert Noyce, and others developed the first Integrated Circuit (Berlin, 2005; Lécuyer, 2006). In 1958, at Texas Instruments, Jack Kilby developed a circuit with a transistor integrated in the semiconductor germanium. It is called a hybrid IC, because the wires, connecting the transistor terminals, are not integrated. A few months later, Robert Noyce developed independently a monolithic IC in silicon, with the wires also integrated. After the initial development of ICs, the scaling potential of this technology became soon apparent. The benefits of integrating more transistors into a single monolithic semiconductor were manifold. Among them was improved reliability due to fewer mechanical connections, reduced latency, and reduced power consumption. Furthermore, cost per transistor went down because the manufacturing cost was approximately constant for a single IC, independent of the number of transistors integrated. In 1965, Gordon E. Moore published his influential paper “Cramming More Components onto Integrated Circuits” (Moore, 1965). At Fairchild Semiconductor, which was founded in 1957, G. Moore observed four generations of IC design and manufacturing and noted the trend that steady progress in manufacturing technology allowed to integrate in each generation double as many transistors as in the generation before. Figure 1.2 shows the data points available to G. Moore in 1965 and the forecast he made. His prediction was based on the insight that steady improvement of the purity of the material and control over the geometries of the manufacturing steps would lead to this exponential growth of transistors per unit area. Furthermore, he assumed that no major obstacle to this steady progress was in sight for the next few years. Time proved him correct, and, past his and anybody’s expectations, no show stopper to this steady progress appeared during the next 55 years up to now. Figure 1.3 shows the transistor count of selected processor ICs up to 2022. In the first decade, the transistor count doubled every year, which in the 1970s fell to 2 years. Over the whole period since 1959, the transistor count doubled every 1.75 years. We have to keep in mind though that G. Moore formulated only a capacity trend, the exponential growth of the number of transistors in an IC. Later on, it has been observed that performance in terms of transistor delay and chip frequency also improved exponentially. The same was true for power consumption per transistor, which conveniently fell at the same rate as transistor count increased, keeping the

5

1975

1973

1974

1972

1970

1971

1969

1968

1966

1967

1964

1965

1963

1961

1962

1960

1959

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1958

Log2 of number of components

1.2 Moore’s Law

Year

Fig. 1.2 Plot by Moore (1965) showing the estimated growth of transistors per integrated circuit Chips Doubling in 21 months

Apple M1 Max 32-core AMD Epyc IBM z14 Storage Controller

1x1010

Cell Sony,IBM,Toshiba

1x108

16-core SPARC T3

Pentium Pro 1x106

Motorola 68020 Intel 8086 ARM 1

10000

100 G. Moore Data Points

1

1960

1970

1980

1990

2000

2010

2020

Fig. 1.3 Transistor count of processor ICs 1959–2022. The data is from https://en.wikipedia. org/wiki/Transistor_counten.wikipedia.org/wiki/Transistor_count, except the first data points until 1965, which are from Gordon Moore’s paper. As reference, the line shows the doubling of transistors every 1.75 years starting in 1959

power density constant. However, all these other trends have been broken at some point, leaving only the capacity law still intact. It is worth to take a short tour through the history of the last 60 years to understand the major trends, not only of IC manufacturing technology but also of the accompanying development of design tools and methodologies.

6

1 Introduction

1.3 VLSI History The continuous increase of the number of transistors per chip caused, every decade or so, disruptions. Broadly speaking, we see two types of implications: First, scaling of transistor size has not been aligned with scaling in delay and power consumption. Second, design methodologies and tools have not scaled easily and therefore had to radically change when designs entered new territories of unprecedented complexity. We take a closer look at three deflection points in the history of IC design: The Mead-Conway revolution in the 1980s, the emergence of System on Chips (SoCs) in the 2000s, and the multi-core revolution in the 2010s.

1.3.1 Very Large System Integration

1980:

20,000

Transistors

Twenty years after the commencement of the IC integration race, 20,000 transistors could be packed onto 1 silicon die. Up to that time, the layout was manually arranged with the help of layout editing tools, and each and every transistor had to be manually placed and routed. Designers were responsible not only for the functionality but also for all electrical aspects, the .VDD and .VSS distribution network, the power delivery, the geometrical design rules, and the delay on all paths that were impacted not only by the length of connections but also by the activities in neighboring wires. By the end of the 1970s this burden became overwhelming and the productivity of designers could in no way keep pace with the increase of available transistors. The solution was to structure the layout, automate the layout generation, and develop tools for design rule checking and estimation of delay and electrical properties. This approach, then became known as standard cell design, was pioneered by Carver Mead at Caltech and Lynn Conway at Xerox PARC and MIT. In 1980, they together published the book Introduction to VLSI Systems, which sparked a new design methodology in university courses around the world and then revolutionized the industrial practice. The main idea was to use only gates with standardized layout and arrange them in rows of standard height separated by routing channels. All permitted logic gates are collected in a cell library. Each gate, called cell, has the same height with .VDD lines on top and .VSS lines at the bottom. They had different widths to accommodate different functionalities and different numbers of inputs and outputs. Figure 1.4 shows the layout of three such gates, a NAND gate, a NOR gate, and an inverter. When they are placed beside each other in a row of the standard cell chip, their

1.3 VLSI History

7

VDD

B

VDD

VDD

A

B

A

OUT

VSS

Poly

N Diffusion

N−Well

P Difussion

Contact

OUT

OUT

VSS

Metal1

A

VSS

Fig. 1.4 Standard cell gates have equal height and varying widths. When connected by abutment into one row, .VDD and .VSS are connected into power supply rails. (a) NAND gate. (b) NOR gate. (c) Inverter

power lines .VDD and .VSS are connected by abutment and become part of the power delivery rails. In a standard cell design, the rows with logic gates are separated from each other by rows reserved for interconnect wires, illustrated in Fig. 1.5b. The cell library contains all required information about each cell. Specifically, it contains the logic function, the cell width, the geometric locations of all inputs and outputs, and the delays from inputs to outputs. Based on the schematic netlist provided by the designer (Fig. 1.5a), a placement and routing tool can generate the layout. First, all gates from the schematic netlist are placed in one of the rows. Then the routing tool connects outputs of cells to the inputs of other cells using the routing channels. Connection between two different routing channels can be established either by using empty logic cells, which are reserved for wiring, or by routing on top of the logic cells in a different metal interconnect layer. By way of automation, the standard cell approach simplified the design of ICs significantly. Designers could focus on the functionality of the design using schematic entry tools. Tools would take care of the generation of the layout data and make sure that all the geometric and electric design rules are adhered to. Productivity of designers were boosted from a few transistors to a few thousand transistors per day, which was badly needed to keep up with the continuously increasing transistor count. As the back-end design flow regarding placement, routing, and layout generation was increasingly automated, the problem of logic optimization came into focus. Designers could quickly write down the desired functionality in terms of a netlist of gates or a set of Boolean equation, but to find an optimum solution with respect to

8

1 Introduction

Fig. 1.5 A schematic netlist is transformed into a standard cell layout. Placement and routing are the main steps. The gates are placed in the rows for logic, which are separated by routing channels. After placement of the logic gates, the interconnect between the gates is generated. (a) Schematic netlist of gates. (b) Standard cell layout after gate placement and routing. The wires above the cells are not shown

the number of required gates and the computation delay is a hard problem, called logic optimization. The fact that a netlist of logic gates can be considered equivalent to a set of Boolean equations has been established 40 years earlier by Claude Shannon in his master thesis (Shannon, 1940). Shannon’s work facilitated to use the mathematical theories of truth values, which has been developed by Boole and others in the second half of the nineteenth century (Boole, 2017). This theory, which is now known as the theory of Boolean equations, comes in handy for formulating the optimization problem that digital designers faced in the 1970s and 1980s. As a result, logic optimization algorithms, together with novel data structure, have been developed and implemented in tools. A milestone book that marked the fact that this problem has essentially been solved is Logic Minimization Algorithms for VLSI Synthesis by Brayton et al. (1984). At about the same time and partially based on the work described in this book, the company Synopsys has been founded, which dominated the logic synthesis business for many years to come. Having tools available that took care of finding optimal implementations of a given specification expressed as logic functions facilitated the design and fabrication of IC devices with ever more transistors. In addition to logic synthesis, placement, and routing tools, a large set of other algorithms and tools have been developed for functional simulation; formal verification; equivalence checking; design rule checking; circuit, electrical, and thermal simulation; test pattern generation; and many other purposes. The rapid development of design methodologies and tools made it possible that a reasonably small number of engineers could design and

1.3 VLSI History

9

implement increasingly complex ICs, of which surprisingly many worked correctly and as expected after they came back from fabrication. Since team sizes could not grow exponentially to match transistor count growth, the productivity of engineers has increased dramatically aided by all these tools.

1.3.2 Systems on Chip

2000:

20,000,000

Transistors

By the end of the century, designs with several million transistors could be squeezed into one silicon die, which continued to put strain on designers to increase productivity. However, that nature of the challenges started to change compared to the time of the VLSI revolution. While placement, routing, and logic optimization algorithms could uniformly be applied to the whole design, no such general tool was insight at the end of the 1990s because the ICs became more heterogeneous. Researchers developed high-level synthesis algorithms, which synthesized a sequential algorithm into a parallel hardware implementation, but it turned out to be useful only for a small part of the design, the datapath. The controller parts required Finite State Machine (FSM) optimization, and the global problem of minimizing data storage and movement was not amenable to a single, well-defined algorithm but required to take care of many different situations and constraints. Moreover, analog parts should also be implemented in the same silicon die which raised challenges of electric noise and protection between the very different designs on the same chip. It began to dawn on researchers and engineers that the recipe for progress, which has been successful during the second half of the twentieth century, does not scale much further. The dominant idea has been to move the design entry to a higher level of abstraction and then have tools generating all the implementation details. The progression from designing the layout to structural netlists to Boolean equations and sequential algorithms has followed this idea, but seems to have come to a dead end by 2000. Gradually, an alternative approach gained traction: design by reuse. New designs are built with components from the previous generation. The designer’s job is now to make sure that the new system, composed of well-proven parts, works correctly. The book Surviving the SoC Revolution by Chang et al. (1999) marked the emergence of this new approach under the term platform-based design. SoCs, as these design were named at that time, consisted of parts like processor cores, DSP cores, caches, bus systems, protocols for external communication, analog components for wired and wireless interfaces, etc. Figure 1.6 shows typical block diagram of a SoC that included many different components, some of them analog designs. A generation earlier, all these components occupied an IC of their own, but now they were integrated in one SoC. This reuse-based approach was a big change

10

1 Introduction

PLL Oscillator RC Oscillator Reset Ctrl

Peripheral Bridge

Brownout

EBI

Memory Controller

Intrp Ctrl Power Mngt

AHS/AHB

System Controller

PIO

Voltage Regulator

ARM Processor

JTAG Scan

SRAM

Peripheral Data Controller

Flash Programmer

Flash

Power on Reset Interrupt timer Watchdog timer Debug unit

Application Specific Locgic

Ethernet MAC

CAN

USART0-1

USB Device

SPI

PWM Ctrl

Two Wire Interface

Synch Serial Ctrl

ADC0-7

Timer Counter

PIO

PIO

APB

PID Ctrl

Fig. 1.6 SoC block diagram with an ARM processor, an AHS, an APB, and a range of reused and custom blocks including analog designs (based on a block diagram by Colin M.L. Burnett in 2007)

in methodology and marked the end of one era and the beginning of a new era in IC design. Before, 1960–2000, ICs were implemented from scratch based on a single specification; after that, ICs are assembled from well-proven components, and the task of the system designer is to select the right components and make sure they work with each other. Before the SoC revolution, reuse also took place and was in fact very common, but it was informal and unsystematic. Now it became a central part of the design methodology.

1.3.3 Multi-Core Chips

2010:

400,000,000

Transistors

Worryingly, around 2000, other trends seemed to run out of steam. For a long time, it was well understood that performance parameters for transistors and wires scale differently as technology feature sizes shrink. When transistors shrink, they become smaller and faster and consume less power. This is an engineer’s paradise because now trade-offs have to be contemplated. Making a transistor smaller makes everything better. At first sight, that seems also to be true for wires. The delay and the power consumption of a wire are determined by its capacitance and resistance.

1.3 VLSI History

11

Given a wire material, both properties depend on the wire geometry. Making a wire shorter decreases both capacitance and resistance. However, making it thinner is more ambiguous. A thinner wire has less capacitance but higher resistance. It turns out that, if a wire’s cross section is reduced by half, the delay of the electrical signal does not change. So making wires shorter is great, but making them thinner only reduces the required area but does not improve performance. The first 40 years of IC scaling, this phenomenon was not an issue, because the transistor dominated the IC performance. But by 2000, it became apparent that these different scaling trends would pose a challenge. Not all wires could be made shorter, and researchers started to distinguish local and global wires. Local wires are used to connect transistors and gates locally. They also get shorter as transistors shrink. But global wires have to globally connect the different components of the SoC. However, dies did not become small over the decades; on the contrary, dies have continuously grown from a few mm.2 to a few cm.2 . Consequently, global wires were as fast (or slow) as they were in 1970. In his Ph.D. thesis in 2003, Ron Ho from Stanford University analyzed this trend thoroughly with a main conclusion summarized in Fig. 1.7. It shows that the distance reachable within one clock cycle (blue box) would shrink dramatically as technology scales from 180 nm down to 18 nm. The clock period is assumed to be 16 FO4 delays, where 1 FO4 is the delay of 1 inverter driving 4 equally sized inverters at its output. FO4 or FO3 delays are often used to denote a technology-independent delay. 1 FO4 delay is assumed to be 90 in 180 technology and decreases to 9 in the

1.76 mm

180 nm

130 nm

100 nm

70 nm

50 nm

35 nm

25 nm

18 nm

Fig. 1.7 The area reachable within one clock cycle compared to the size of the chip decreases with each technology generation. The gray box represents a chip of size .1.76 cm × 1.76 cm in each technology. The yellow box is a design with approximately 30 M gates which becomes smaller and smaller as technology shrinks. The blue bounded box is the distance from the chip center that can be reached within one clock cycle, when the clock period is assumed to be 16 FO4 delays. Routing is done along a Manhattan grid, motivating the shape of the blue box (from Ho, 2003)

12

1 Introduction

18 nm node. The distance corresponding to 16 FO4, or 1 clock cycle, is 17.6 mm in the 180 nm node and decreases to 1.5 mm in the 18 nm node. Ron Ho concluded that within a few years, global signals could not be sent across the chip within one clock cycle anymore, unless the clock frequency was proportionally decreased. Thus, either a significant performance loss was accepted or the chip would break up into multiple clock domains. Chip-global synchronization would not be possible anymore. This was deeply worrying because a major part of the performance gain between 1960 and 2000 came from an everincreasing clock frequency. Even worse, in the first decade of this century, it became also clear that further increasing the clock frequency was hitting a power wall. Everything else equal, increasing the frequency means increasing power density and power consumption. So far, this has successfully been compensated by decreasing the voltage. Power consumption is proportional to the square of the voltage, which means that reducing the voltage from 10 V in 1960 to 1 V in 2010 has to a great extent neutralized the increase in frequency. However, voltage decrease has leveled off during the first decade after 2000. This has resulted in a leveling off of frequency increase, illustrated in Fig. 1.8. Putting everything together, this meant that in the beginning of the 2000s, we had the following situation. (i) Number transistors continue to follow Moore’s law; (ii) global synchrony on chip was not viable anymore; and (iii) frequency does not increase As a consequence, the computation on chip became parallel, slowly and gradually at first and then massively parallel at the end of the 2010s. During the 2000s, Network on Chip (NoC) technology was developed, a much more scalable communication scheme than the traditional bus-based communication. Once scalable on-chip communication with high bandwidth and reasonable latency was available, the additional transistors along Moore’s law have been invested in additional 10000

Frequency (MHz) Power (W)

1000

100

10

1

0.1 1970

1980

1990

2000

2010

2020

Fig. 1.8 The increase of frequency and power consumption is leveling off between 2000 and 2022. (Data from Rupp, 2022)

1.3 VLSI History

13

Fig. 1.9 Trends for key characteristics for microprocessor ICs between 1970 and 2021. (Data from Rupp, 2022)

processing units. First, it was unclear if applications could make use of the massive on-chip parallelism, because many traditional applications from word processing to event-driven simulation to web browsing seemed hard to parallelize. However, gaming, graphics processing, video processing, and machine learning turned out to be able to absorb any level of parallelism that is on offer. Thus, today’s Graphics Processing Units (GPUs) and machine learning accelerators exhibit hundreds of relatively simple and partially specialized processing elements, which can make use of the vast amount of transistors afforded by IC manufacturing technology without the need for global synchrony or ever-increasing clock frequency. A nice summary of the trends for microprocessor ICs is prepared by Rupp (2022) and shown in Fig. 1.9.

1.3.4 The Semiconductor Technology Roadmap By the 1990s, the semiconductor industry has become diverse and an important international player. Designing and manufacturing an IC requires investment in the manufacturing site, in highly specialized manufacturing tools for processing the silicon material, for the masks, for lithography, for wafer and mask handling, for the chemical processes in the manufacturing line, for test equipment, for bonding and packaging, and for many other steps. Since many different companies have specialized on some part of the overall process, there was a growing need for an industry-wide roadmap, which would set the expectations what kind of technology would be needed in 2, 5, or 7 years. Hence, in the 1990s, national roadmaps were written in the USA, Europe, Japan, South Korea, and Taiwan. Under the

14

1 Introduction

Fig. 1.10 The evolution of the CMOS transistor structure as expected in the IRDS (2021). (a) Planar transistor, –2015. (b) FinFET, 2011–2025. (c) Lateral GAA, 2022–2034. (d) Stacked GAA devices, 2030–2034

coordination of the Semiconductor Industriy Association (SIA), a first international roadmap was produced in 1998 which came to be known as the International Technology Roadmap for Semiconductors (ITRS). Until 2016, the SIA published a new edition of the roadmap every even year, with updates in between. Under the umbrella of IEEE, a successor organization was founded which has published the International Roadmap for Devices and Systems (IRDS) since then. The goals of these roadmaps are (IEEE, 2016): • Identifying key trends related to devices, systems, and all related technologies by generating a roadmap with a 15-year horizon • Determining generic devices and systems’ needs, challenges, potential solutions, and opportunities for innovation • Encouraging related activities worldwide through collaborative events such as related IEEE conferences and roadmap workshops The IRDS documents offer an insight how the experts of the field perceive the technology development and what is to be expected in the coming years. Figure 1.10 shows how the transistor structure is expected to evolve. The current FinFET device is shown in Fig. 1.10b. Gradually, all-around structures (illustrated in Fig. 1.10c) become dominant, because when the channel is completely surrounded by the gate, the control over the channel is much better which helps to avoid many undesired effects of parasitic capacitance and leakage currents. To increase the device density further, transistors will be vertically stacked (Fig. 1.10d). Table 1.2 lists key figures of the predictions from the IRDS (2021) edition. The node labels in terms of minimal feature size, which have been used since the 1980s, do not really make much sense today, but they are kept as labels because everybody in the community and many outside have become used to it. For instance, the stateof-the-art 3 nm node exhibits a minimal gate-to-gate pitch of 51 nm and a M0 pitch of 30 nm. M0 is the metal wiring layer closest to the transistor and has the smallest pitch. The metal layers above have wider wires and larger pitches. The wiring layers highest up in the wiring stack are the widest, perhaps double the size of M0, and usually host the .VDD and .VSS power supply.

1.3 VLSI History

15

Table 1.2 Projection of key technology parameters until 2034, from IRDS (2021). GxxMyy/Tz notation refers to xx nm contacted gate pitch, yy nm tightest metal pitch, and z number of tiers

As we can see from the table, it is hard to shrink the transistors further in the horizontal plain. Thus, researchers start to stack the devices on top of each other to further increase density and increase the integration level. Advanced transistor architectures together with advances in materials continue to improve the electrical properties. In particular, wrapping the gate around the channel gives much better control over the channel which effectuates a number of benefits. Therefore, we can continue to expect to slowly decrease the voltage, delay, and energy usage. In summary, the roadmap expects the following improvements over the next 10 years: • • • •

Performance: .>30% more maximum operating frequency at constant energy Power: .>50% less energy per switching at a given performance Area: .>50% area reduction Cost: . VT S

Fig. 2.3 The transistor is open, and the channel between S and D does conduct because there is a sufficient voltage between the gate and bulk Table 2.1 The conditions for NMOS and PMOS transistors to connect the source and drain

NMOS > VT .VGS ≤ VT

.VGS

PMOS < −VT .VGS ≥ −VT

.VGS

Open Closed

is applied, .VGS > VT , negative charge accumulates below the gate and forms a conducting region together with the .n+ areas at the source and drain (Fig. 2.3). Thus, the transistor is open and allows a current to flow between the drain and source, if a drain-source voltage is applied. Similarly, the PMOS transistor opens when a negative gate-source voltage .VGS < −VT is applied by attracting positive charges (holes) to the channel below the gate. These conditions are summarized in Table 2.1.

2.2 CMOS Inverter The salient feature of CMOS circuits is the mirror-like combination of NMOS and PMOS transistors, thus forming CMOS gates. The simplest of these gates is the inverter which is illustrated in Fig. 2.4. NMOS transistors are placed in the slightly positively doped substrate, while PMOS transistors are placed in a well with slight negative doping. Let’s assume that the ground voltage level .GND = 0 V and the supply voltage is positive and .VDD > VT . Consider the case when a positive voltage .VDD > VT is

20

2 CMOS Basics VDD

in S

VSS Wn B

Wp

G D

S n+

out

B p+ Lp

p-substrate

in

S

p+

Ln

NMOS

D

G

D

n+

G

VDD

out

n-well

D G S

PMOS

Fig. 2.4 An NMOS and a PMOS transistor connected to form a CMOS inverter. .Ln and .Lp denote the channel lengths, and .Wn and .Wp denote the channel widths of the two transistors

applied to the input of the inverter. The NMOS transistor opens because of .VGS = VDD > VT , thus connecting its drain terminal to its source and pulling the inverter output to GND . Conversely, at the PMOS transistor, there is no voltage between its gate and source since both are at .VDD , and we have .VGS = VDD − VDD = 0 V ≥ −VT (compare with Table 2.1). Now consider the case that GND is applied to the inverter input. The NMOS transistor closes because there is no voltage difference between the gate and the source, thus disconnecting the output from GND . At the PMOS transistor, we have a negative gate-source voltage .VGS = GND − VDD < −VT , which opens the transistor allowing for a connection between .VDD and its drain, thus pulling the inverter output to .VDD . Figure 2.5 shows the circuit diagram, where the output load is indicated as a capacitor. The right-hand graph shows the output voltage and the current through the inverter as a function of the input voltage. At point .Vin , the output is pulled up to .VDD by the PMOS transistor, and no current is flowing through the inverter because the NMOS transistor is closed. When the input voltage increases, the NMOS transistor gradually opens, and the PMOS transistor gradually closes until the point .Vin = VDD , where again no current flows. However, in the region between (.VT < Vin < VDD − VT ), both transistors are partially open, an electric current flows, and power is consumed. The magnitude of power consumption depends on the characteristics of the two transistors, on the length of the transition period, and also on the load C at the output. C represents the wires and the inputs of connected gates driven by this inverter. C has to be charged or discharged during each transition. Figure 2.5 also depicts the regions where the transistors are conducting and where they are closed, and the conditions for these states are derived from the conditions for the NMOS and PMOS transistors in Table 2.1 based on the observation that the NMOS .VGS = Vin and the PMOS .VGS = Vin − VDD . For the inverter, these conditions are summarized in Table 2.2

2.2 CMOS Inverter

21 Vout

I NMOS closed

VDD

NMOS open

VDD

PMOS open

V

PMOS closed

I

S G D

in

out D G

C S 0V 0V

VT

VDD − VT

VDD

Vin

Fig. 2.5 The characteristics of the CMOS inverter. The output voltage and the current are shown as a function of the input voltage Table 2.2 The conditions for the transistors of the inverter for being open or closed

.VT .VDD

− VT

.Vin

.≤

.VT


s) ? r-s : s-r; // condition endmodule

Similarly, conditions can be used in if statements or loops inside procedural blocks.

3.3 Test Benches Having a model of a desired behavior is only the first step. Next, we need to make sure that it actually does what it is supposed to do. There are several well-established techniques, and supporting tools, that are concerned with functional validation. The two main directions are simulation and formal verification. Since they complement each other, they are often combined in practice. Formal verification is independent of input values and can establish that a design “is correct” for all possible inputs. However, different formal verification techniques define differently what it means “to be correct.” Equivalence checking compares two designs, for instance, a high-level behavioral model and a low-level structural model, and, if successful, proves that both behave the same way for all possible inputs. This is useful, and often used, if we have worked hard with a behavioral model and we are pretty confident that it exhibits the right behavior. If we then synthesize a low-level model with a long chain of synthesis tools, it is reassuring if we can prove the equivalence of both models. Thus, equivalence checking defines correctness of a design with “being functionally equivalent to another model.”

3.3 Test Benches

55

Property checking can prove that a model always and for all possible inputs exhibits certain properties. For instance, a traffic light controller should never give green lights simultaneously to both directions of a crossroad. We can formulate this condition as a property and then prove that this property is always true for our controller design, whatever the current and future inputs to our controller. Hence, property checking defines correctness as “having a set of given, formally welldefined properties.” While formal verification methods are very powerful, they have limitations. Equivalence checkers need a reference model that has to be verified in a different way. For property checking to work, we need to find and model all relevant properties, which is a non-trivial task of its own. Another limitation is complexity. Formal methods cannot deal with very complex designs. For these and other reasons, simulation is indispensable and still the designer’s workhorse for functional validation. In this book, we use mostly simulation for establishing the functional correctness of a model. Test benches are central to functional validation through simulation, and in this section, we discuss three types of test benches, simple test benches, self-checking test benches, and test benches with test vectors.

3.3.1 Simple Test Bench Using our 2-1 multiplexer models as examples, we want to see what they actually do by applying input stimuli and observing the reaction of the models at the output. This is exactly what a test bench does, as illustrated in Fig. 3.5. Listing Fig. 3.11 shows a simple test bench for the multiplexer. In line 8, the Design under Test (DuT) is instantiated, and its ports are connected to the nets of the test bench. The $dumfile statement denotes the file, where all simulation activities Fig. 3.5 A test bench consists of a stimuli generator, the design under test (DuT), and the response checker

Testbench

Inputs

Stimuli Generator

Checker

Outputs

DuT

1 The Verilog code of this and the other test benches discussed below are available at Ch03-ModelingBasics/Mux on GitHub (Jantsch, 2022).

56

3 Modeling and Design Flow Basics

Listing 3.1 Simple test bench for the multiplexer 1

‘timescale 100ps/10ps // Unit of time is 100ps

2 3

4 5 6

module multiplexer_tbsi(); // A testbench has no inputs or outputs // Local nets: reg d0, d1, sel; // Input stimuli wire ybh; // Output of DuT

7 8 9

// Instantiate design under test: mux21_bh dut_bh (.Y(ybh), .D0(d0), .D1(d1), .S(sel));

10 11 12 13

14

// The initial block is evaluated only once: initial begin $dumpfile("multiplexer_tbsi.vcd"); // File with simulation results $dumpvars(0,multiplexer_tbsi); // which variables are written to file

15 16

17 18 19 20 21 22 23

#10 d0=1’b0; d1=1’b1; // 10 time units after beginning sel=1’b0; // 10 time units later... #10 sel=1’b1; #10 sel=1’b0; #10 sel=1’b1; #10 $finish; // Finish simulation end endmodule

are recorded. The file is a Value Change Dump (VCD), a common text format that records all changes of values for nets as requested by the $dumpvars statement in line 14. The statements in lines 16–20 constitute the stimuli generator: in steps of 1 ns, new values are assigned.

Box 3.5 Simulation with iverilog Throughout this book, we use the Icarus Verilog Compiler for simulation of our Verilog designs. Icarus Verilog is open source developed for Unix platforms but can also be installed on other platforms (Williams, 1998). To simulate the multiplexer test bench, you execute the following commands from the shell. 1 2

sh> # Compile it: sh> iverilog -o multiplexer_tb multiplexer.v multiplexer_tb.v

(continued)

3.3 Test Benches

57

Box 3.5 (continued) 3 4 5

sh> # Simulate it: sh> vvp multiplexer_tb

6 7 8

sh> # View the waveforms: sh> gtkwave multiplexer.vcd

To view the waveforms generated during simulation, we use gtkwave (Wave, 2005), also an open-source tool that can be installed on various platforms.

Since this simple test bench has no response checker, we need another way to observe the outputs of the DuT. Since we have dumped all the value changes during the simulation to a VCD file, we can use a waveform viewer. An example of such a tool is gtkwave, which reads the VCD file and displays the effects of the simulation along the time axis, as is shown for the multiplexer example in Fig. 3.6. There we see the assignments done in the test bench and also the response of the DuT on net y_bh, which selects input d0 when sel=0 and d1 when sel=1, just as a multiplexer is expected to do. For simulating the test bench and viewing the waveforms, see the sidebox Simulation with Iverilog.

Fig. 3.6 The waveforms as a result of simulating the simple test bench of the multiplexer

58

3 Modeling and Design Flow Basics

3.3.2 Self-Checking Test Bench A simple test bench is relatively quick to write, at least for a few obvious input conditions. But outputs have to be inspected manually, which can be tedious. Also, when you make small or big changes, you have to re-run the simulation and go through the waveforms again. This procedure is not only cumbersome; it is also worrisome because the design changes may have unexpected consequences which the designer may not suspect and does not bother to check.

Listing 3.2 Self-checking test bench for the multiplexer 1 2 3 4

// Instantiate designs under test: mux21_bh dut_bh (.Y(y_bh), .D0(d0), .D1(d1), .S(sel)); mux21_st dut_st (.Y(y_st), .D0(d0), .D1(d1), .S(sel)); mux21_df dut_df (.Y(y_df), .D0(d0), .D1(d1), .S(sel));

5 6 7 8 9 10 11

// Response checker: always @(y_bh or y_st or y_df) // Evaluate upon signal changes begin if (y_bh !== y_st || y_bh !== y_df) $display($time, ": Mux models not equal"); end

To address this issue, self-checking test benches are commonly used. Consider our multiplexer models again, and assume we want to check if they behave in the same way. Listing Fig. 3.2 shows code snippets from a test bench that fits the bill. In lines 2–4, all three muxes are instantiated. The stimuli generator can be the same as before, but now we have added another process, an always block, that serves as the response checker (lines 7–11). It is evaluated whenever any of the multiplexer outputs changes, and it compares those outputs. If they are not equal, a message is printed during the simulation. Self-checking test benches are particularly useful for regression tests. If the checks cover all relevant conditions and functions of the DuT, the test bench can be re-run whenever a design change is committed. If no errors are detected by the response checker, the design change did not have an adversarial effect. The downside with self-checking test benches is that they are very timeconsuming to write. In fact, if the response checker indeed covers all relevant functions of the DuT, it represents a specification model of the design. Consequently, the effort to write the checker is in the same order of magnitude as the effort to model the DuT itself. And, even worse, to obtain an error-free checker model is as hard as to write an error-free design model. Still, self-checking test benches are highly useful and substantially increase the quality of the models, mainly because it forces designers to approach the design from a very different angle, i.e., the perspective of a response checker. In addition, it is exceedingly unlikely that the

3.3 Test Benches

59

same error appears in both the DuT and the checker at the same time. While the chances that an error report is caused by a bug in the design or the check are about the same and the number of errors will roughly double due to doubling of the size of the code base, the probability to miss an error in the DuT decreases sharply. If the checker does not report an error, the chances are high that the design is correct and the confidence in the design is greatly improved.

3.3.3 Test Bench with Test Vectors A variant and further systematization of self-checking test benches are test benches where both input stimuli and expected responses are stored in separate files. Obviously, these test vector files are non-trivial to generate for non-trivial designs, and their quality determines the quality of the test benches. But assuming a test vector file is available, the test bench can be designed as a generic machinery that reads input stimuli from the file, applies them to the DuT, and compares the response to the expected values from the test vector file.

Listing 3.3 Test vector file multiplexer.tv // Test vector for 2-1 multiplexor // Format: d0 d1 sel y 000_0 001_0 010_0 011_1 100_1 101_0 110_1 111_1

Listing 3.4 shows a test vector file in a format that can be easily read by the Verilog command readmemb(). Except the comment lines, each line constitutes a vector of binary numbers; the underscore character “_” is ignored and used here to delineate inputs from expected outputs. The first three digits are the binary numbers to be applied to d0, d1, and sel of the multiplexer; the last digit is the expected output. As we can see, all possible combinations of the inputs are present, which means the test vector file can be used to run an exhaustive functional test. The test bench uses a clock to govern the timing when input stimuli are applied to the DuT and when output responses are sampled and compared. In our example, we use a 1 GHz clock signal and apply inputs at the rising clock edge and sample

60

3 Modeling and Design Flow Basics

the response at the falling clock edge. It is important to sample outputs some time after inputs are applied to allow sufficient time for the DuT to compute its function.

Listing 3.4 Test vector-based test bench 1 2 3 4 5

initial begin $readmemb("multiplexer.tv" , testvectors ) ; // Read vectors vectornum= 0; errors = 0; // Initialize end

6 7 8 9 10 11 12

// Stimuli generation : // apply test vectors on rising edge of clk always @(posedge clk) begin #1; {d0,d1, sel , yexpct} = testvectors [vectornum]; end

13 14 15 16 17 18 19 20 21 22 23 24

// Response checker: // check results on falling edge of clk always @(negedge clk) if (~ reset ) // skip during reset begin if (y !== yexpct) begin $display ("Error : inputs = %b", {d0,d1, sel }) ; $display (" outputs = %b (%b exp)",y,yexpct) ; errors = errors + 1; end

25 26 27 28 29 30

31 32 33

// increment array index and read next testvector vectornum = vectornum + 1; if ( testvectors [vectornum] ===4’bx) begin $display ("%d tests completed with %d errors" , vectornum, errors ) ; $finish ; // End simulation end end

Listing 3.4 shows relevant code snippets of the test bench that uses test vectors. In the initial block, the test vector file is read into the vector variable testvectors which is indexed by vectornum. The stimuli generator is an always block (lines 9–12). The input stimuli are applied 1 time unit, i.e., 100ps after the rising clock edge. The checker not only compares expected with observed outputs (lines 18– 23) but also increments the vector index (line 26) and checks when the input file is exhausted (lines 27–31). Figure 3.7 shows the simulation results of this test bench.

3.4 Adder Design in AIMS

61

Fig. 3.7 The simulation output of the test vector-based test bench from Listing 3.4

The test bench of Listing 3.4 is a generic template that can be easily adopted for any DuT, provided that a meaningful test vector file can be generated. Hence, it is recommended to use it because it simplifies the validation setup and guides the designer to focus on the generation of meaningful test patterns and expected responses. Note that validation with exhaustive test vectors is not feasible even for small or moderate designs. Even a 32-bit adder, which is a common block even in smaller designs, cannot be exhaustively validated because it would take years to simulate.

3.4 Adder Design in AIMS 3.4.1 Half Adder In this section, we use first the half adder and then the full adder to go through a AIMS cycle, and on the way, we learn more about modeling in Verilog, and we introduce synthesis with Yosys Synthesis Suite (Yosys) (Wolf, 2020a). As target technology, we use and introduce lattice FPGAs and tools for placement, routing, and timing analysis that can deal with this technology. Figure 3.8 details the steps and the decision we take for both examples. For the half adder, we consider only timing properties and pin constraints, but for the full adder, we will also explore power consumption and how different models impact the metrics of interest.

Technology-Independent Optimization Listing Fig. 3.9a shows a Verilog model of an half adder in dataflow style. There are two output functions, sum and carry, which are Boolean functions of the two inputs a and b. Assuming we have carefully validated out half adder, it is time to synthesize it.

62

3 Modeling and Design Flow Basics

Verilog Model

Sythesis

Technology Independent Optimization * Target Technology * Device family Technology Mapping

Improve

* Target Device * Pin Constraints

Analysis

Placement & Routing

Timing Analysis Power Analysis Resource Usage

OK ?

Fig. 3.8 The AIMS flow for the adder design. The square boxes inside the synthesis step denote design decisions

Fig. 3.9 Half adder model and synthesis. (a) Dataflow model. (b) Yosys synthesis script

3.4 Adder Design in AIMS

63

Fig. 3.10 The half adder from Listing Fig. 3.9a and processed by Yosys. Two generic, logic functions, denoted as $and and $xor, connect the input to the output ports

1'0

a

I0 I1

b

a

b

I0 I1 I0 I1

1'0

$631 LUT2 $632 LUT2

O

sum

I2

O

carry

$557 SB_LUT4

O

sum

I3 I0 I1

1'0

$556 SB_LUT4

I2 I3

O

carry

1'0

Fig. 3.11 The half adder synthesized by Yosys for two different FPGA targets. (a) Synthesized with synth_xilinx. (b) Synthesized with synth_ice40

For that, we use Yosys, which is an open-source, command line-based synthesis tool. After we have installed Yosys, we can invoke it from a shell and execute the commands shown in Fig. 3.9b. The opt command makes simple optimizations and cleans up the design to remove redundant structures. The show command generates the visualization of the design depicted in Fig. 3.10.

Mapping onto an iCE40 FPGA opt performs technology-independent design transformations. In our example, there

is not much to do, and we easily recognize the resulting design graph from the source code. But opt does not know about any target technology, and therefore, it cannot establish a connection to real hardware. There is a family of Yosys commands that know about specific target technologies. Two of these commands are synth_xilinx and synth_ice40 which target two different FPGA device families. The result of applying these commands onto our half adder is shown in Fig. 3.11. As can be seen, these two technologies have two different elementary components used, LUT2 and LUT4, which are Lookup Tables (LUTs) with two and four inputs, respectively.

64

3 Modeling and Design Flow Basics

Table 3.1 The iCE40 low power and high performance FPGA family Logic cells (LUT+flip-flop) RAM4K memory blocks RAM4K RAM bits PLLs Maximum programmable I/O pins Maximum differential input pairs

LP384 384 0 0 0 63 8

LP640 640 8 32K 0 25 3

LP1K 1,280 16 64K 11 95 12

LP4K 3,520 20 80K 22 167 20

LP8K 7,680 32 128K 22 178 23

HX1K 1,280 16 64K 11 95 11

HX4K 3,520 20 80K 22 95 12

HX8K 7680 32 128K 22 206 26

In the following, we use the lattice iCE40 family of FPGA devices to illustrate the design. iCE40 devices are relatively simple FPGAs that nonetheless have many of the features that more complex and larger devices also have. Table 3.1 lists all the family members with their names and main characteristics. There are two groups: Low Power labeled with LP and High Performance labeled with HX. The device name also hints at the size of the device by giving the approximate number of elementary cells, which are called Logic Cells in the iCE40 terminology. Logic cells are used to realize the logic functions and some storage capability. In addition to logic cells, the table lists memory resources PLLs and I/O pins. We will gradually discuss all these features one by one. The basic architecture, shown in Fig. 3.12, is the same for all devices in the ice40 family. It consists of a 2-D array of Programmable Logic Blocks (PLBs), interspersed with memory blocks and surrounded by I/O resources. Each PLB consists of eight logic cells, which in turn contains one four-input LUT, one flipflop, and a carry logic. We will discuss the details of PLBs and logic cells in a later chapter. For now, it suffices to know that a LUT4 can be used to implement any logic function with maximum four inputs; in particular, it can also be used for two-input AND and XOR functions. That is the reason we see two LUT4 cells in Fig. 3.11b that implement the functions for the sum and the carry outputs. Examining the architecture in Fig. 3.12 closer, we note that the type of logic resources is uniform; there are only LUT4 cells, no LUT2 or other types of LUTs. For our half adder, this means that we are wasting some resources, because we have to use an overdimensioned logic block for implementing a rather simple function. This is a trade-off that FPGA designers and users face all the time. FPGA designers can equip their device with many different types of LUTs, LUT2, LUT3, LUT4, etc. Then any given function could be implemented on the type of logic resource that most closely matches its complexity. However, the downside is that we may not have enough of a particular type of resources at the location we need it. For instance, if we provide two types of LUTs, LUT2 and LUT4, we have to distribute them in some regular fashion throughout the device. During the synthesis process, it could happen that we run out of LUT4 cells in a particular region of the FPGA. Then, to implement a fourinput function, we would either have to use a LUT4 from a faraway region, or we would have to use three LUT2 cells. In both cases, we would require more routing resources, and the design would become slower. Thus, providing many different

3.4 Adder Design in AIMS

65

I/O Bank 0 PLB PLB

I/O Bank 1

PLB

PLB PLB

PLB

PLB

PLB

PLB

PLB PLL

I/O Bank 2 Non-volatile Configurable Memory

Programmable Interconnect

PLB

PLB

PLB

PLB

PLB

PLB

PLB

PLB PLB

PLB

PLB

PLB PLB

PLB

4 kbit RAM

PLB PLB

PLB

PLB PLB

PLB

4 kbit RAM

PLB

PLB

Programmable Interconnect Programmable Interconnect

I/O Bank 3

NVCM

Phase Locked Loop

SPI Bank

Serial Peripheral Interface

Carry Logic

Lookup Table

Flip Flop

Mux

Fig. 3.12 iCE40 architecture

types of resources on an FPGA can lead to more or less wasted resources, and to slower or faster designs, depending on the specific application that we want to implement. For this reason, FPGAs are usually very uniform and come with few different types of resources. The iCE40 FPGA has only one type of LUT, one type of memory block, and one type of I/O cell.

Placement and Routing With the Yosys synth_ice40 command, we have made the first step toward mapping the half adder on an FPGA. synth_ice40 is knowledgeable of the type of resources available in the iCE40 family of devices (hence its usage of LUT4 cells), but it is not aware of the specific number and locations of resources in a specific iCE40 device. Thus, the next step is to do the actual placement and routing for one of the available devices in the iCE40 family. As Place & Route (P&R) tool, we use nextpnr (Shah et al., 2019), which is also open source and script based.

66

3 Modeling and Design Flow Basics

The following commands do the synthesis with Yosys and generate a JSON file which serves as input to the P&R tool. 1 2 3 4 5

sh> yosys yosys> read_verilog halfadder.v yosys> opt yosys> write_json halfadder.json yosys> exit

6 7

8

9 10 11 12 13

sh> nextpnr-ice40 --hx1k --json halfadder.json --pcf halfadder.pcf --asc halfadder.asc sh> cat halfadder.pcf set_io a 99 set_io b 98 set_io sum 7 set_io carry 8 nextpnr_ice40 is the P&R tool which knows the details of the iCE40 devices. In this case, the HX1K device is chosen as target, and we also use a Pin Constraint File (PCF) to specify to which pins the inputs and outputs of the half adder should be routed. Figure 3.13a shows the result of the P&R step. Since the given pin constraints have demanded that the inputs and outputs are located on opposite sides of the device, the half adder is spread out over the chip with the logic placed in the middle. As this example illustrates, the P&R binds all operations from the synthesized design graph (Fig. 3.11b) to specific resource instances on the FPGA. In addition to binding computing operations (such as LUTs), it also binds memory resources such as individual flip-flops, registers or block RAMs, and I/O pins. After the binding of resources, the routing phase connects all computing, memory, and I/O resources with each other by allocating specific wiring and interconnect resources. As we will see later, FPGAs have a tremendous amount of interconnect resources to facilitate routing even if many of the compute and memory resources are densely used. Because they may still not suffice if most of the LUTs of the FPGA are used up, the capacity limit of an FPGA is typically reached when the LUT resource usage reaches 80–90% and sometimes as low as 70%. Thus, it turns out that in most situations, interconnect resources are the limiting factor for FPGAs. In summary, the synthesis steps that we have seen so far (also seen in Fig. 3.8) can be described as follows.

Optimization Technology-independent logic optimization uses Boolean algebra to transform the logic functions extracted from the Verilog model into an optimized set of logic functions. Mapping Technology-dependent synthesis maps the logic gates from the previous step to elementary gates from a design library that exist in the target technology. The mapping only considers the type of elementary gates available,

3.4 Adder Design in AIMS

67

ICE40 HX1K FPGA IO Pad: sum, carry

IO Pad: a, b

2 4−input LUTs

IO Pads: sum, carry

2 4−input LUTs

IO Pads: a, b

Fig. 3.13 The half adder has been placed and routed on the HX1K FPGA. One can see the individual PLBs which are organized in a .12 × 16 grid. Two of the columns constitute in fact RAM blocks, which leaves the device with 160 PLBs. Since each PLB contains 8 LUTs, we count .10 × 16 × 8 = 1280 logic cells, motivating the corresponding entry in Table 3.1. (a) Placement of the whole HX1K device. (b) Placement of the specific resources used by the half adder

but not the number. It does not know about the number of gates, flip-flops, or I/O pins available in a specific device. Placement Device-specific placement considers the number of gate types, registers, and I/O pins available and also the topological relations of those resources. Thus, it knows which resources are close to each other and which are far away based on a 2-D footprint representation of the target device, illustrated by Fig. 3.13a. Routing allocates interconnect resources to connect the placed computation, memory, and I/O resources.

68

3 Modeling and Design Flow Basics

All four steps are implemented as heuristic search algorithms, because an optimal solution cannot be found except in the most simplest cases. These algorithms are guided by objective functions that represent the design goals. The most important objectives are to minimize resource usage, delay, and power consumption. Objectives come in one of two ways: Either we want to minimize/maximize a quantity or we have a limit that must not be exceeded or underrun. For instance, the HX1K device has 95 I/O pins. Using 20, 50, or 95 of them does not make a difference, but 96 is not permitted. Hence, the number of I/O pins should be given as an upper limit, not a quantity to be minimized. Delay is often to be minimized, but not always. If the designer gives a target frequency of 100 MHz, the delay between two clocked registers must be lower than 10 ns, but if it is 2 ns or 8 ns in some parts, it makes no difference. The situation is similar with power consumption. Almost always we want to minimize it, but we also have to deal with strict upper limits, because there are usually hard upper limits on the power consumption that a specific device can deal with. Designers can provide preferences and constraints to the synthesis process. We have seen one example of a constraint above when we provided hard constraints on the mapping of inputs and outputs to I/O pins.

Timing Analysis In addition to functionality, resource usage, delay, and power consumption are firstorder design goals. In our simple half adder example, there are no good options to play with resource usage, but propagation delay can be analyzed. We use the icetime tool from the IceStorm project (Wolf, 2020b) and obtain the timing report shown in Listing 3.5. In Sect. 5.3, we take a close look how Static Timing Analysis (STA) works. As the term static in its name suggests, it does not depend on the specific input patterns and tries to identify the worst-case delays by inspecting all possible paths in the design. Paths examined by STA go from register to register, from primary input to register, from registers to primary outputs, and from primary input to primary output. Only the latter case appears in our example, and we see the timing for the path from a to carry, which is the longest path. In total, it will take 2.93 ns, meaning that we can run our circuit at about 340 MHz. On closer examination of the report, we learn that the path delay is broken down into the following components: Pre IO is the delay from the external pin to, and including, the circuitry of the input pad for a: 240 ps. Connect 1 is the delay from the input pad to the logic cells in the center of the chip: 1003 ps. Logic Cells is the delay for the two LUTs in one PLB: 660 ps. Connect 2 is the delay from the PLB to the output pad: 702 ps. IO Cell is the delay in the circuitry of the I/O cell to select the correct output pad: 260 ps. Pre IO is the delay from the output pad to the external pin: 70 ps.

3.4 Adder Design in AIMS

69

Listing 3.5 Timing report for the half adder 1 2 3

icetime topological timing analysis report ==========================================

4 5 6

Report for critical path: -------------------------

7 8

9 10

11 12 13

14

15 16

17 18

19 20

21

pre_io_0_11_1 (PRE_IO) [clk] -> DIN0: 0.240 ns 0.240 ns net_1178 (a$SB_IO_IN) odrv_0_11_1178_1228 (Odrv12) I -> O: 0.540 ns t2 (Span12Mux_h1) I -> O: 0.133 ns t1 (LocalMux) I -> O: 0.330 ns inmux_11_11_23865_23894 (InMux) I -> O: 0.260 ns lc40_11_11_1 (LogicCell40) in1 -> lcout: 0.400 ns 1.902 ns net_21801 (carry$SB_IO_OUT) odrv_11_11_21801_23952 (Odrv4) I -> O: 0.372 ns t4 (LocalMux) I -> O: 0.330 ns inmux_13_11_27543_27519 (IoInMux) I -> O: 0.260 ns 2.863 ns net_27519 (carry$SB_IO_OUT) pre_io_13_11_0 (PRE_IO) DOUT0 [setup]: 0.070 ns 2.933 ns io_pad_13_11_0_din

22 23 24 25

Resolvable net names on path: 0.240 ns .. 1.503 ns a$SB_IO_IN 1.902 ns .. 2.863 ns carry$SB_IO_OUT

26 27 28

Total number of logic levels: 2 Total path delay: 2.93 ns (340.90 MHz)

Together the delay on the entire path amounts to 2935 ps. Is this good enough? What are our design objectives? If we do not know them exactly yet because they also depend on other parts of the design or the application, it is a good strategy to find the fastest and the cheapest solutions, because these two will show us the scope of the design space and we will know what is feasible. Since the half adder model itself is so simple, there is nothing we can tweak here. But we have different devices to pick, and we can impose mappings of I/Os to pins. Since all the tools are command line based, a script can easily be written that runs P&R and timing analysis for all the devices and different pin constraints.

70

3 Modeling and Design Flow Basics

Table 3.2 Half adder delay for different devices and pin constraints

Starting with the Yosys optimized half adder design (the JSON file), we do P&R for the available devices and two different pin constraint files, one where input and outputs are placed on opposite edges of the chip, which is denoted as “opposite” in the analysis below, and one where all I/Os are placed on the same side close to each other, denoted as “close.” The results, except for the LP640 device, are shown in Table 3.2. Studying the table, a few things can be observed. First, the cheapest solution is certainly the LP384 device, but the fastest is one of the high-performance chips when the pins are properly selected. The LP384 with “close” pin constraints can run at 300 MHz, while either of the two HX devices can go as fast as 500 MHz. Second, pin constraints make a significant difference, and the larger the devices, the bigger the difference. For the 8K devices, the “opposite” solution is 2.4 times slower than the “close” solution, while for the 1K devices, the ratio is about 1.5. Third, the HX devices are significantly faster than LP variants; the difference is visible in all components, I/O circuitry, logic, and interconnect. The difference is about 50%. Fourth, there are some variations with non-obvious causes. For instance, the logic of the LP1K device is a bit faster than the logic in both the LP384 and the LP8K chips. Or long-range interconnect in the LP384 is slower than in the 1K device. The reason for these examples is that P&R is a heuristic search algorithm; in this case, it is a simulated annealing optimizer. Heuristic methods usually do not find the optimal solution but stop their search when a reasonable solution is found. The search algorithm has to make decisions in which direction to search or which part of the design space not to investigate, without knowing if this allows or prohibits finding the optimal solution. Mostly, these decisions are based on heuristic rules that lead to good solutions most of the time, but without guarantees. Some of these decisions are also based on generated random numbers. For instance, the search algorithm has to decide if or if not a worse solution should be investigated in the hope that it leads out of a local optimum to a globally better part of the design space. Algorithms that never take this risk are called greedy. Greedy heuristics are

3.4 Adder Design in AIMS

71

prone to be trapped in local optima and are poor in finding globally good solutions. Therefore, most heuristics pursue worse solutions with a certain probability, which usually decreases the longer the search goes on. Often designers can provide seed values to initialize the random number generator. Different seed values will then lead to different search decisions and to different design solutions. Hence, it is good practice to re-run optimizers with different seed values to understand the variations of the results and the impact of the random decisions during search. nextpnr also has a --seed option. Executing nextpnr with different seed values shows that the abovementioned variations actually disappear or are reversed. Thus, they are not due to differences in the architecture, but cased by the specific solution that the P&R tools happen to find in a particular optimization run.

3.4.2 Full Adder With the half adder, we have taken our first steps through an AIMS cycle studying the effects of design decisions. Next, we use the full adder we go through the same flow, as illustrated in Fig. 3.8, but in addition, we also study the effects of changes in the modeling style, and we take a first look at power consumption. The full adder is similar to the half adder but has one more input, the carry in signal. Listing Fig. 3.14 shows three different models of the full adder: a dataflow model, a structural model that consists of two half adders, and a behavioral model based on the truth table. The structural model instantiates three components, two half adders and one OR gate, and connects them to form a netlist. The truth table model uses a Verilog process with only one case statement inside. One can see the left-hand and the right-hand sides of a truth table expressed as bit vectors. It can serve as a template for modeling logic functions for which the truth table is known. These three models are functionally equivalent as we should verify by using a good test bench, simulating them, and comparing the results. But now our question

Fig. 3.14 Three models of the full adder. (a) Dataflow model. (b) Structural model. (c) Truth table

72

3 Modeling and Design Flow Basics

Fig. 3.15 Dataflow model after simple optimization

A B a

A B A B

b

A B

ci

$1 $and

Y

$2 $and

Y

$6 $xor

Y

$4 $and

Y

A B

b

A

a

a b

ha1 halfadder

w1

carry sum

w2

ci

a b

A carry

ha2 halfadder

sum

B

w3

$3 $or

Y

co

B A

b

B

$1 $and

Y

$2 $xor

Y

Y A B

A B

a

$3 $or

$7 $xor

Y

$5 $or

Y

co

sum

ha1.carry A

ha1.sum

sum

B A

ci

B

$2 $xor

Y

sum

$1 $and

Y

ha2.carry

A B

$3 $or

Y

co

Fig. 3.16 Structural model after simple optimizations. (a) Before flattening. (b) After flattening 0:0 - 2:2 0:0 - 1:1 0:0 - 0:0 A 3'110

B

$4_CMP0 $eq

Y

$5_CMP0 $eq

Y

$7_CMP0 $eq

Y

$6_CMP0 $eq

Y

$8_CMP0 $eq

Y

0:0 - 2:2 0:0 - 1:1 0:0 - 0:0 A 3'101

B

0:0 - 2:2 0:0 - 1:1 0:0 - 0:0

0:0 - 2:2 0:0 - 1:1

A

0:0 - 0:0

B

3'011

a

0:0 - 2:2 0:0 - 1:1 0:0 - 0:0 A

b 3'100

ci

B

A

$11 $reduce_or

A

$13 $reduce_or

Y

0:0 - 2:2 0:0 - 1:1 0:0 - 0:0 A 3'010

B

2'xx

0:0 - 2:2 0:0 - 1:1

A

0:0 - 0:0

0:0 - 2:2 0:0 - 1:1

A

0:0 - 0:0

B

$9_CMP0 $eq

Y

Y

8'00011011

B S

$2 $pmux

Y

1:1 - 0:0

co

0:0 - 0:0

sum

ovec

0:0 - 3:3 0:0 - 2:2

3'001

A

$10_CMP0 $logic_not

Y

0:0 - 1:1 0:0 - 0:0

0:0 - 2:2 0:0 - 1:1 0:0 - 0:0 0:0 - 2:2 0:0 - 1:1

A

0:0 - 0:0

B

$3_CMP0 $eq

Y

3'111

Fig. 3.17 Truth table model after simple optimization

is, how do implementations of these models differ in terms of non-functional characteristics?

Simple Optimization To find out, we use Yosys to apply a sequence of transformations. The results of a first set of transformations, which contain technology-independent simple optimizations, are shown in Figs. 3.15, 3.16 and 3.17. In Fig. 3.15, we immediately see all the logic gates in the dataflow equations of the Verilog code. The interesting

3.4 Adder Design in AIMS

73

aspect of the structural model is the hierarchy involved. Figure 3.16 shows the result before and after flatting the design. Hierarchical models pose a challenge for optimizers because there is no heuristics that works best in all cases. In principle, there are two possible strategies: the optimizer can keep the hierarchy and optimize each block separately or it can flatten the design, treat it as a large netlist, and optimize it with a global perspective. Keeping the hierarchy is faster because the search space is significantly constrained and only optimizations within each block are possible. But that also means missing some opportunities that can be found only if the design is flattened before optimization. The simple design of the full adder does not really illustrate this trade-off, but comparing Figs. 3.15 and 3.16 highlights another important point. The flattened and optimized design in Fig. 3.16 uses fewer gates than the optimized design in Fig. 3.15 derived from the dataflow model. In fact, it has two gates less which means 28% fewer gates. The reason is that the hierarchical structure contains information about reuse of parts of the design which is often very challenging to extract from a flattened design. Indeed, if you investigate the netlist of Fig. 3.15 in detail, it will take quite a while to find out that it can be rearranged to save two gates. The netlist in Fig. 3.16b is smaller because it has used the information present in the hierarchical structure, namely, that part of the computation for the sum output can also be used for co. Figure 3.17 shows that the design extracted from the truth table model is quite different, using a set of comparators and a generic n-input multiplexer, $pmax. This structure reflects the truth table semantics nicely, but with 11 gates, it is scaringly large.

Logic Synthesis In the next step, the full power of logic optimization is enlisted; its results are shown in Fig. 3.18. This synthesis step is technology independent and uses a set of generic gates as target library. In addition to the common logic functions, it contains also 3-inout AND-OR-Inverter and OR-AND-Inverter gates denoted as AOI3 and OAI3, respectively. The three synthesized designs in Fig. 3.18 are similar but structurally not equivalent. The design synthesized from the truth table model is even one gate bigger than the other two. Yosys includes a flexible and powerful

A B

$61 $_NAND_

Y

$60 $_OAI3_

Y

A

ci

B A

b

B C

a

A B

A B

$63 $_XOR_

Y

$62 $_NAND_

$64 $_XOR_

Y

co ci

Y

sum

b

a

A A B A B

$56 $_NOT_ $58 $_XNOR_ $57 $_NAND_

Y

A B

$60 $_XOR_

Y

A

A B Y

A

a

sum

Y

C

$59 $_OAI3_

ci

Y

co

A B

$203 $_XNOR_

Y

$202 $_NAND_

Y

$201 $_NOT_

B Y

A B C

b

A B

$204 $_XNOR_

Y

sum

$205 $_OAI3_

Y

co

Fig. 3.18 Full adder models after synthesis. (a) Behavioral model. (b) Structural model. (c) Truth table model

74

3 Modeling and Design Flow Basics

logic synthesis toolbox allowing a designer to tweak the synthesis procedure such that on all three cases, the minimum implementation would be generated. However, the fact that the default settings of the heuristics can lead to differences which are intuitively hard to understand suggests that differences in the initial models can lead to unexpected differences in the implementations. If this is true for extremely simple designs, one wonders how designers can come up with models that lead to minimal, fast, and power-efficient designs. It turns out that the main burden is on the tools, but it requires significant experience for a designer to come up with appropriate models and steer the tools with sensible constraints and directives. But because the relation between input models, constraints, and implementation characteristics is highly non-linear and sometimes counterintuitive, an indispensable method is the systematic measuring of effects of design decisions with available analysis tools. This method is exemplified in this chapter with the adder design and is at the heart of the AIMS methodology.

Technology Mapping The next step in the full adder design flow is mapping onto atomic gates of the target device, which are in our case four-input LUTs. The results of this step for our three models are shown in Fig. 3.19. They are identical in all three cases. Thus, the tool finds the optimal implementation for the given function even though the initial models are fairly different. However, this cannot be taken for granted for more complex designs. The intermediate results in Fig. 3.18 in fact suggest that for a different target technology even in this simple case, we might see different nonfunctional characteristics.

Placement, Routing, and Timing Analysis The P&R for the full adder is pretty much the same as for the half adder, because it uses the same amount of logic gates, two LUTs. The only difference is that it has one input more. We again use nextpnr_ice40 for placement and routing and

1'0

1'0 I0

a

I1 I2

O

sum

a

I3

I2

I0 $74 SB_LUT4

O

sum

I1 I2 I3

b

O

co

ci

1'0

I2

$199 SB_LUT4

O

co

$198 SB_LUT4

O

sum

ci I0

$79 SB_LUT4

I1 I3

b I0

1'0

I1 I3

b

ci

1'0 I0

$80 SB_LUT4

I1 I2 I3

I0 $73 SB_LUT4

O

co

a

1'0

I1 I2 I3

Fig. 3.19 Full adder models after technology mapping. (a) Behavioral model. (b) Structural model. (c) Truth table model

3.4 Adder Design in AIMS

75

icetime for timing analysis. Also, we look at two different pin constraints to study their impact. The “close” pin constraints are identical to those of the half adder, but the “opposite” pin constraints are changed such that the a, b, and co are on the one side and the other pins are on the opposing edge, as can be seen here for the HX1K device: 1 2

sh> nextpnr-ice40 --hx1k --json fulladder.json --pcf fulladder-hx1k-opposite.pcf --asc fulladder.asc

3 4 5 6 7 8 9

sh> cat fulladder-hx1k-opposite.pcf set_io a 9 set_io b 10 set_io ci 96 set_io sum 97 set_io co 11

For latency, this is indeed more severe because the placement of the logic is very sensitive. If it is not exactly in the middle, both some of the inputs and some of the outputs have to traverse more than half the chip. In the half adder case, the pin constraints were less constraining because wherever the logic was placed, the sum of input signal length and output signal length was one chip traversal. Now for the full adder case, it is at least one chip traversal and can be in the worst case two chip traversals if the logic is placed close to the edge of the chip. Hence, sweeping over all the five devices with two different sets of pin constraints, we obtain the results of the timing analysis shown in Table 3.3. Taking a closer look at this table and comparing it with the half adder timing analysis in Table 3.2, we make the following observations. • The latency for I/O circuitry and the logic are the same or similar in both tables. The small differences for the logic are due to different placements. This can make a difference because part of the delay in the “Logic” column is due to the path from the surrounding interconnect to the LUT. If the input comes from a

Table 3.3 Full adder delay for different devices and pin constraints

76

3 Modeling and Design Flow Basics

neighboring PLB, this path is different compared to if the neighboring signal comes from a faraway PLB. We will discuss the details of the interconnect architecture of the FPGA in a later chapter, which will make this difference apparent. • The interconnect delays in both “Connect” columns is clearly higher for the full adder. This is true for both types of pin constraints. It is cased by the additional input pin. Taking another look at Fig. 3.13, we see that there are always two pins per pad. In the half adder case, there are two input pins which fit on one pad, but this is not the case for the full adder. Thus, one of the three input signals has to take a linger path than the other two, which then is reflected in the high delays of the table. • The increased severity of the “opposite” pin constraints does not result in higher delays. This can be seen by comparing the average delay of the “close” cases of half and full adder with the “opposite” cases. We see that the increase for the “close” cases is 28% and for the “opposite” cases it is 23%. Thus, the increased delay due to more input pins has a higher effect on the “close” cases, because the long-range interconnect delays are shorter. It seems that the P&R algorithm is smart enough, and has sufficient freedom, to find good solutions even for the tighter pin constraints for the full adders.

Power Analysis We have seen that the LPXX devices of the iCE40 family are clearly slower. What do we get for this disadvantage? We use the lattice proprietary power analysis tools for finding out.

Table 3.4 Power consumption for the full adder design with the following assumptions: core I/O .VDD , 2.5 V; process is typical; .T, 25 ◦C; core clock frequency, 100 MHz; and output load capacity, 10 pF

.VDD , 1.2 V;

3.5 Summary

77

Table 3.4 shows the results, and we can observe the following points: • Power consumption is divided into different categories. Static power consumption is independent of any activity in the design, and it is also independent of the size of the design. Hence, the static power consumption that we see in Table 3.4 is a characteristic of the specific device, whatever we map onto it. It is due to the constant leakage current that dissipates power as soon as we turn on the power supply. • Usually, I/O and core logic is quite different with respect to power consumption. I/O pads and their circuitry consume significant amounts of power because they have to drive large capacitors like external pins and wires. They are often operated with a higher voltage than the core logic. This distinction is also visible in the table, and we observe that the I/O circuitry dominates the overall power consumption, being between 88% (LP1K) and 49% (HX8K). It is interesting that the I/O power consumption is independent of the device type and size, which suggests that the same I/O circuitry is used for all devices in the family. This is a design choice for that particular family and cannot be generally observed. The I/O power is proportional to the number of pins in use. For designs with more I/O, we will see higher power consumption. In a later chapter, when we analyze an n-bit adder, we will be able to verify that the I/O power consumption .pI O (n) = c0 + nc1 , for n used pins and .c0 and .c1 being two constants. For the HX8K device, .c0 = 0.28 mW and .c1 = 0.276 mW, which means that every additional pin draws 0.276 mW. • The difference in the dynamic core power consumption of the “close” and the “opposite” cases is very significant, between 50 and 120%, which means that the interconnect is a significant source of power consumption, as significant as it is for delay. • The difference in power consumption between LP and HX devices comes only from the static power. Interestingly, we see no effect on the dynamic core, i.e., interconnect and logic. While this is not entirely true for larger designs, it is still relatively minor. The power-saving effect of using an LP1K device is still somewhat disappointing if we relate it to the increase in latency. The power reduction between HX1K and LP1K is about 10%, while latency increase is 50%. For the larger X8K devices, this trade-off is more reasonable since the differences of power and latency are both approximately 50%.

3.5 Summary We have taken the first step in modeling and designing HW circuits. Our modeling entry point has been Verilog, and our target is a family of FPGA devices from lattice semiconductor. We have used the open-source tools Yosys, nextpnr, and icetime. These tools offer a low-threshold entry point to HW design, because they are free

78

3 Modeling and Design Flow Basics

of charge and can often be downloaded and installed within minutes. Even more importantly, they allow us better to better understand the inner workings of the tools and to examine intermediate steps of synthesis. In particular, Yosys is script based, and all synthesis procedures are sequences of many simpler transformations. They can be executed one by one to study intermediate results. They can be run in different sequence and with different options. Therefore, it is a nice didactic device that facilitates the understanding of the design flow. The designs used in this chapter are simple, which allowed us to introduce basic modeling concepts, the basics of an FPGA architecture, and the basic steps in a HW design flow. Right from the beginning, we also consider non-functional properties because they motivate to design HW in the first place. In particular, delay, resource usage, and power consumption are first-class citizens in design, equally important as functionality. This focus is at the core of the AIMS design flow which is iterative in nature and uses analysis tools to understand and improve a design. In the following chapters, we will follow this iterative motion and examine one by one all the main components and aspects of HW design.

3.6 Exercises Exercise 3.1 Write Verilog code for the CMOS inverter and also write a test bench for it. Exercise 3.2 Write a structural Verilog model corresponding to the logic circuit given as follows: A

X

B

Y C

Exercise 3.3 Write a behavioral Verilog model for a 4-input, 8-bit multiplexer. Exercise 3.4 Draw the schematics for the given Verilog model. 1 2 3

odule example_4 (a,b,c,d,F); input a,b,c,d; output F;

4 5 6 7 8

wire X, Y, X_n, Y_n,F1, F2 ; and (X,a,b); or (Y,c,d); not (X_n, X);

3.6 Exercises 9 10 11 12 13

79

not (Y_n, Y); and (F1, X, Y); and (F2, X_n, Y_n); or(F, F1, F2); ndmodule

Exercise 3.5 Draw the schematics for the given Verilog model. 1 2 3

module not_gate (output a, input b); assign a = ~b; endmodule

4 5 6 7

module and_gate(output c, input d,e); assign c = d & e; endmodule

8 9 10 11

module and_gate_3(output k, input l,m,n); assign k = (l & m) & n; endmodule

12 13 14 15

module or_gate (output f, input h,g); assign f = h | g; endmodule

16 17 18 19 20 21 22 23 24 25 26 27 28

module example_5 (F,X,Y,Z); input X,Y,Z; output F; wire w1,w2,w3,w4,w5,w6; or_gate q1(F, w1,w2); and_gate r1(w1, w3,Z); and_gate_3 t1(w2, w4,w5,Y); or_gate q2(w3, X,w6); not_gate z1(w4, X); not_gate z2(w5, Z); not_gate z3(w6, Y); endmodule

Exercise 3.6 Draw the schematics for the given Verilog model. 1 2 3

module example_6 (a,b,c,F); input a,b,c,; output reg F;

4 5 6 7 8 9 10 11 12

always @(a,b,c) begin F = a; F = ~(b | F ); F = ~F ; F = ~(F | c) ; end endmodule

Exercise 3.7 Write a structural Verilog model for a 4-bit ripple carry adder. Then, write a Verilog test bench that tests the adder for all possible pairs of 4-bit addends. The test bench should stop and display the actual and expected outputs if there is any mismatch.

80

3 Modeling and Design Flow Basics

Exercise 3.8 Write a Verilog program for a customized multiplexer with four 8-bit input buses P, Q, R and T and three select inputs S2–S0 that choose one of the buses to drive an 8-bit output bus Y according to the following table. S2 0 0 0 0 1 1 1 1

S1 0 0 1 1 0 0 1 1

S0 0 1 0 1 0 1 0 1

Output P P P Q P P R T

Chapter 4

Modeling Latches, Flip-Flops, and Registers

Beside computing new logic values, the temporary storing of a logic state is the second fundamental operation in every digital design. The capability to store information is not only fundamental to the functionality; its non-functional characteristics often shape the performance and cost of a design. The need for storage occurs at many levels. On chip, we store individual bits, words, and arrays of words in structures that we call buffer, register, register file, cache, and similar. Outside the computing chip, we demand much larger storage capacities in Dynamic RAMs (DRAMs) flash memories and hard disks. On chip, the dominating metrics are often latency and power consumption. Off chip, the focus shifts more to bandwidth and capacity, but latency and power are still relevant if not critical. We observe a hierarchical architecture of the storage system, with lowlatency/low-density technology being close to the computation and highbandwidth/high-density technology further away. Our focus is on on-chip low-latency storage that is tightly integrated into the computing logic, but we note that efficient access to high-capacity off-chip memory, mostly in the form of DRAM and flash, is critical for almost any digital system.

4.1 Latch and Flip-Flop Off-chip storage is always based on some physical effect. DRAM and flash memories use capacitor structures to store charge; other technologies use some phase shift of a physical property such as magnetic polarization, resistance, or mechanical change. But since physical phases often require a different manufacturing process or are slow, we see logic feedback loops as the main mechanism to store information on chip.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Jantsch, Taking AIMS at Digital Design, https://doi.org/10.1007/978-3-031-35605-6_4

81

82

4 Modeling Latches, Flip-Flops, and Registers

Fig. 4.1 The RS-Latch diagram and symbol

S

NQ S

R R

S R

NQ

Q

Q

4.1.1 RS-Latch Figure 4.1 illustrates the principle of feedback-based storage with a RS-Latch based on two NOR gates. It has two inputs, but the output values depend not only on the current input but also on the current internal state of the device, which in turn depends on the input history. Thus, for predicting the output of the gate, we need to know the current input and the current state, which is summarized as follows.

.

Input State S R Set 1 0 Reset 0 1 Store 0 0 1 1 Not Allowed

Output Q 1 0 PrevQ 0

Two of the input combinations are straightforward. For .S = 1 and .R = 0, the upper NOR gate is dominated by the S input, and both inputs to the lower NOR gate are 0. Thus, we have .Q = 1 and .Q = 0. Similarly, for .S = 0 and .R = 1, the lower NOR gate is dominated by R, and we get .Q = 0, Q = 1. These are the set and reset functions of the latch. For the store state, .S = 0 and .R = 0, the inputs do not enforce particular values at the output, and the output is determined by the previous output values. Assuming .Q = Q , one of the outputs is 1, and the other is 0. That pattern will be re-enforced by the two NOR gates; thus, the outputs do not change. However, this is only true if .Q = Q . If .Q = Q and .S = 0, R = 0, the outputs are not stable, but both will oscillate between 0 and 1 with a period of two gate delays. To be precise, the period is the sum of all the delays on the path .Q → top NOR gate .→ Q → bottom NOR gate .→ Q. Once set in motion, the oscillation continues until either S or R is set to 1, even if the delays of the two gates are different. Listing 4.1 shows a Verilog model of an RS-Latch, where one NOR gate has a delay of 1 ns and the other has a delay of 1.1 ns. Figure 4.2 shows a simulation that goes through all four states of the latch and illustrates the oscillating output. The RS-Latch can also be realized with two NAND gates, but then the set and reset inputs are active low.

4.1 Latch and Flip-Flop

83

Listing 4.1 NOR-based RS-Latch 1

‘timescale 100ps/10ps

2 3

4 5

module rslatch (r,s, q, nq); input r,s; output reg q, nq;

6 7

8 9 10 11 12

always @(r or s or q or nq) begin q