Coupled Data Communication Techniques for High-Performance and Low-Power Computing (Integrated Circuits and Systems) 1441965874, 9781441965875

Wafer-scale integration has long been the dream of system designers. Instead of chopping a wafer into a few hundred or a

109 7 49MB

English Pages 222 [214] Year 2010

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Coupled Data Communication Techniques for High-Performance and Low-Power Computing
Foreword
Contents
List of Contributors
Part I Introduction
Part II Overview of 3D Technologies
Part III Coupled Data Technologies
Part IV Enabling Coupled Data Technologies
Part V Extending Data Coupling Technologies
Index
Recommend Papers

Coupled Data Communication Techniques for High-Performance and Low-Power Computing (Integrated Circuits and Systems)
 1441965874, 9781441965875

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Integrated Circuits and Systems

Series Editor Anantha Chandrakasan, Massachusetts Institute of Technology Cambridge, Massachusetts

For other titles published in this series, go to www.springer.com/series/7236

Ron Ho • Robert Drost Editors

Coupled Data Communication Techniques for High-Performance and Low-Power Computing

Editors Ron Ho Oracle Corporation Sun Labs VLSI Research Group 16 Network Circle UMPK 16-161 Menlo Park, CA 94025 USA [email protected]

Robert Drost Los Altos, CA 94024 USA

ISSN 1558-9412 ISBN 978-1-4419-6587-5 e-ISBN 978-1-4419-6588-2 DOI 10.1007/978-1-4419-6588-2 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2010927932 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

For Christina, Sawyer, and Finley – RH For Sharon and Juliet – RJD

Foreword

Wafer-scale integration has long been the dream of system designers. Instead of chopping a wafer into a few hundred or a few thousand chips, one would just connect the circuits on the entire wafer. What an enormous capability wafer-scale integration would offer: all those millions of circuits connected by high-speed on-chip wires. Unfortunately, the best known optical systems can provide suitably fine resolution only over an area much smaller than a whole wafer. There is no known way to pattern a whole wafer with transistors and wires small enough for modern circuits. Statistical defects present a firmer barrier to wafer-scale integration. Flaws appear regularly in integrated circuits; the larger the circuit area, the more probable there is a flaw. If such flaws were the result only of dust one might reduce their numbers, but flaws are also the inevitable result of small scale. Each feature on a modern integrated circuit is carved out by only a small number of photons in the lithographic process. Each transistor gets its electrical properties from only a small number of impurity atoms in its tiny area. Inevitably, the quantized nature of light and the atomic nature of matter produce statistical variations in both the number of photons defining each tiny shape and the number of atoms providing the electrical behavior of tiny transistors. No known way exists to eliminate such statistical variation, nor may any be possible. Proximity communication, or coupled data communication in general, may make possible the long-sought dream of wafer scale integration. Proximity communication permits assembly of wafer-scale systems from small parts. We can make circuit chips small enough for low defect rates, cast aside bad chips, and reassemble the good chips into wafer-scale systems. Two properties of proximity communication suit it to wafer-scale use. First, quality: the connections between chips are nearly as good as wires on a single chip. As this book describes, proximity connections are fast, occupy small area, and consume little energy. Second, and I think much more important, is replacement: proximity communication permits one to replace chips in a big system. Together, quality and replacement make wafer-scale integration possible. Because I think replacement is so important, Im going to devote a few more lines to it.

vii

viii

Foreword

What makes replacement possible? Proximity communication needs neither welds nor solder. The parts are joined electrically only by the electric fields between them. These fields pass right through the top layers of glass that protect the chips. Within error limits, the communication is also insensitive to chip separation and chip alignment. If one chip in a wafer-scale assembly of hundreds of chips proves unsuitable because of a hidden defect, or through aging, or simply for product upgrade, no physical bonds prevent its replacement. I believe that replacement will prove most useful for test. A complete system could serve as a jig that would test fresh chips in their real environment. Each fresh chip would spend only long enough in the complete system for a thorough test. A test jig smaller than a complete system might also serve to test only a single type of chip, providing it an environment indistinguishable from a full system. Such a test jig would have full speed access to every connection to or from the fresh chip. I see a huge potential for replacement to simplify and improve test. I also see that replacement may permit a profound change in the business alliances that produce products. Without the ability to replace, one bad chip destroys an entire multi-chip module, making specialization in module assembly a poor business. Because one bad chip spoils the entire module, a contractor who assembles multi-chip modules must take responsibility not only for defects in his own process, but also for defects in separate chips. This dual responsibility is a very high barrier to contract assembly. Board-level assembly houses are common because they avoid this dual barrier in two ways. First, not only is board-level assembly an old art with a well known low defect density but also it uses packaged and well tested parts. Second and more important, at the board-level some, albeit limited, replacement is possible. It is possible to remove and reuse at least the high-value chips on a board-level assembly, greatly reducing the high cost of bad parts. I believe that because proximity communication permits replacement it will also foster wafer-scale assembly houses. Bob Johnson, formerly technical head of Burroughs, talked about using conductive grease to connect the ordinary pads on chips placed face-to-face. A large area of thin grease between facing pads would provide a connection. The thinner and much longer layer of grease reaching to other pads would produce small but manageable cross talk. I merely replaced Johnson’s grease with electric fields. Robert Drost’s fiendishly clever diagonal arrangement of pads greatly reduces cross talk. Bob Bosnyak designed and measured some early proximity communication test chips. I recall one flawed ring oscillator test chip built for us by the MOSIS foundry service. The flaw turned out to be total omission of the metal plates on adjacent levels of metal that were to form the bulk of Bosnyak’s test structure. Nevertheless, the test chip worked, albeit at a mystifying small fraction of its intended speed. The mystery vanished when we discovered the omitted plates. MOSIS rebuilt the test chip for free. The late Bob Proebsting, a pioneer and life-long designer of fine memory parts, contributed to us much knowledge about sense amplifiers. For a period, the authors of this book were, in effect, Proebsting’s post-doc students. As usual in such relationships, both the brilliant teacher and the apt students took much delight from the

Foreword

ix

process. It was my joy to assemble such a mass of brainpower and to watch both its progress and the continuing delight of its participants. Portland, Oregon, September 2009

Ivan Sutherland

Contents

Part I Introduction 1

Introduction to Coupled Data Technologies . . . . . . . . . . . . . . . . . . . . . . 3 Ron Ho, Robert Drost 1.1 Life has been good . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Faster computers tomorrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 The end of Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.2 The arguments against–and for–multiple chips . . . . . . . . . 7 1.3 Coupled data communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 This book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Part II Overview of 3D Technologies 2

Power delivery, signaling and cooling for 2D and 3D integrated systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Muhannad Bakir, Gang Huang and Bing Dang 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Evolution of conventional silicon ancillary technologies: A brief overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Novel silicon ancillary technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Optical I/Os . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Fluidic I/Os for single and 3D chips . . . . . . . . . . . . . . . . . . 2.4 Power delivery for 2D and 3D systems . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Power delivery and design implications of 2D systems . . 2.4.2 Power delivery and design implications of 3D systems . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13 13 14 18 23 26 31 34 38 43 45

xi

xii

Contents

Part III Coupled Data Technologies 3

4

Capacitive Coupled Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . David Hopkins, Alex Chow, Frankie Liu, Dinesh D. Patil, Hans Eberle 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 An electrical model of capacitive interchip communication . . . . . . 3.2.1 Crosstalk mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Transmitting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Receiving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Attenuation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Loss of DC information . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Comparators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Receiver sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Timing schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Two-dimensional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Measurement results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Voltage waterfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.2 Timing waterfall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.3 Combined eye diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6.4 BER versus chip separation . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Prototype application: a high-radix switch . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51 51 53 56 56 61 62 62 63 65 66 67 68 70 70 71 72 72 73 77

Inductive Coupled Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Noriyuki Miura, Takayasu Sakurai, and Tadahiro Kuroda 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.2 Inductive-coupling channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.1 Overview of channel characteristics . . . . . . . . . . . . . . . . . . 80 4.2.2 Range extendability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2.3 Coupling strength through Si substrate . . . . . . . . . . . . . . . . 84 4.2.4 Crosstalk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3 Inductive-coupling transceiver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.1 Signaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.2 Coil design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.3 Transceiver circuit design . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3.4 Inter-chip communications . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.4 Power reduction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.4.1 Pulse shaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 4.4.2 Daisy chain transmitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 4.5 High-speed techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.5.1 Asynchronous transceiver . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.5.2 Burst transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 4.6 Crosstalk reduction techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.6.1 Time interleaving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Contents

xiii

4.6.2 Differential coil . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Application I: memory stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.7.1 Homogenous chip stacking . . . . . . . . . . . . . . . . . . . . . . . . . 114 4.7.2 Inductive-coupling up/down repeater . . . . . . . . . . . . . . . . . 114 4.7.3 Test chip measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 4.8 Application II: processor and memory stacking . . . . . . . . . . . . . . . . 118 4.8.1 Heterogenous chip stacking . . . . . . . . . . . . . . . . . . . . . . . . . 119 4.8.2 Interface design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 4.8.3 Test chip measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.7

5

Use of AC Coupled Interconnect in Contactless Packaging . . . . . . . . . 127 Paul Franzon 5.1 Introduction: Why use ACCI? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.1.1 Chapter outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.2 Historical Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3 Capacitively Coupled Chip I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.3.1 Capacitively Coupled Channel Design . . . . . . . . . . . . . . . . 130 5.3.2 ACCI Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.3.3 ACCI Packaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 5.4 Mid-channel Capacitively Coupled Structures . . . . . . . . . . . . . . . . . 142 5.5 Inductively Coupled Connectors and Sockets . . . . . . . . . . . . . . . . . . 146 5.6 Conclusions and Future Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . 151 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Part IV Enabling Coupled Data Technologies 6

Aligning chips face-to-face for dense capacitive communication . . . . . 157 John E. Cunningham, Ashok V. Krishnamoorthy, Ivan Shubin, James G. Mitchell, Xuezhe Zheng 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.2 Aligning chips face-to-face . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.2.1 Power and ground connections between coupled chips . . . 163 6.3 A low-cost package for capacitive proximity communication . . . . . 168 6.4 Array packages using bridge chips . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

Part V Extending Data Coupling Technologies 7

Delivering On-chip Bandwidth Off-chip and Out-of-box with Proximity and Optical Communication . . . . . . . . . . . . . . . . . . . . . . . . . 179 Ashok V. Krishnamoorthy, Jon Lexau, Xuezhe Zheng, John E. Cunningham 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.2 Photonics as a long-reach interconnect . . . . . . . . . . . . . . . . . . . . . . . . 180

xiv

Contents

7.3 Photonics on VLSI (optoelectronic VLSI) . . . . . . . . . . . . . . . . . . . . . 182 7.4 Proximity and photonic communication . . . . . . . . . . . . . . . . . . . . . . . 184 7.5 Test chip results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8

AC Coupled Wireless Power Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . . 193 Makoto Takamiya, Kohei Onizuka, and Takayasu Sakurai 8.1 Three dimensional stacked inter-chip wireless power delivery . . . . 193 8.2 Prototype of wireless power transmission circuits . . . . . . . . . . . . . . 195 8.3 Theoretical analysis and circuit improvements . . . . . . . . . . . . . . . . . 198 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

List of Contributors

Dr. Ron Ho Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Robert Drost Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Muhannad Bakir Microelectronics Research Center, Georgia Institute of Technology, 791 Atlantic Dr. NW, Atlanta, GA 30332-0269, USA, e-mail: [email protected] Alex Chow Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. John E. Cunningham Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Dr. Bing Dang IBM T. J. Watson Research Center, 1101 Kitchawan Rd, RM 6-242, Yorktown Heights, NY 10598, USA, e-mail: [email protected] Dr. Hans Eberle Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Prof. Paul Franzon Department of Electrical and Computer Engineering, North Carolina State University, Box 7914, Raleigh, NC, 27695, USA, e-mail: [email protected] David Hopkins Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, xv

xvi

List of Contributors

USA, e-mail: [email protected] Dr. Gang Huang Intel Corporation, Ultra Mobility Group, 1501 S. MO-Pac Expy, Austin, TX 78746 USA, e-mail: [email protected] Dr. Ashok V. Krishnamoorthy Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Professor Tadahiro Kuroda Department of Electrical Engineering, Keio University, 3-14-1, Hiyoshi, Kohokuku, Yokohama 223-8522, JAPAN, e-mail: [email protected] Jon K. Lexau Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Frankie Liu Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Professor Noriyuki Miura Department of Electrical Engineering, Keio University, 3-14-1 Hiyoshi, Kohokuku, Yokohama 223-8522 JAPAN, e-mail: [email protected] Dr. James G. Mitchell Sun Microsystems Chief Technology Organization, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Dr. Kohei Onizuka Formerly with the Institute of Industrial Science, University of Tokyo, and now with Toshiba Corporation. Dr. Dinesh D. Patil Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: [email protected] Professor Takayasu Sakurai Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Dr. Ivan Shubin Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected] Professor Makoto Takamiya VLSI Design and Education Center, University of Tokyo, 4-6-1 Komaba, Meguroku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Dr. Xuezhe Zheng Sun Microsystems Chief Technology Organization, 9515 Towne Centre Drive, San Diego, CA 92121, USA, e-mail: [email protected]

Part I

Introduction

Chapter 1

Introduction to Coupled Data Technologies Ron Ho, Robert Drost

1.1 Life has been good The past quarter-century has seen an explosive growth in the performance of computer systems. One of the first widely popular personal computers was a mid-1980s IBM PC, running on a 4.77 MHz Intel 8088 processor, stuffed with 256 KB of system memory (plus another 384 KB on an expansion card), displaying 640x200 black-and-white graphics, and storing data on 360 KB 5.25-inch floppy disks. In 2009, a typical workstation configuration sold by Sun Microsystems, the Ultra 24 Workstation, used a 3 GHz Intel Quad Core 2 processor with 8 GB of memory, displayed 2560x1600 graphics on a 30-inch LCD monitor using an Nvidia Quadro NVS 290 accelerator card, with up to 1.8 TB of Serial-Attached SCSI drives spinning at 15 krpm. Both systems cost around $4000 in contemporary dollars. The enormous advancement in price-performance between these computer systems came from improvements in many different technologies, including storage media, displays, software systems, and so on. But certainly a large part of it was because VLSI semiconductors, and high-end microprocessors and memories in particular, have gotten faster. Figure 1.1 shows the historical performance of microprocessors, normalized to the SpecINT2000 benchmark [1], and how it has seen a remarkable 35% cumulative annual growth rate over the past twenty-five years – a growth curve seen by virtually no other industry. The natural question prompted by this chart is, “can this growth curve continue?” Or, for the readers of this book, “what must designers do to enable it to continue?” This growth in performance is popularly, though somewhat incorrectly, fully attributed to “Moore’s Law.” This is what Carver Mead at CalTech called Gordon Dr. Ron Ho and Dr. Robert Drost Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: {ron.ho},{robert.drost}@sun.com R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_1, © Springer Science+Business Media, LLC 2010

3

Introduction

    

4  

     %$ $"        

  

"

 

&&

&$

&!

&

%%

%#

   Fig. 1.1 Processor performance scaling over the past twenty-five years [3].

Moore’s now-famous 1965 extrapolation of transistor density scaling [2]. Moore had argued that when optimized for lowest total cost, integrated circuit chips would, over time, contain an ever-growing number of transistors. Too few transistors per chip, and the fixed overhead of manufacturing and packaging the chips would dominate their cost; too many transistors, and random defects would excessively reduce the yield of good chips and hence increase their cost. But the right number of transistors–the number that minimized cost–would continue to grow, as wafer sizes increase and transistor dimensions decrease. In reality, transistor density scaling has only partly fueled the growth in computer systems performance. Equally important have been rapid advances in raw transistor speeds and in aggressive design techniques, as we discuss next.

1.2 Faster computers tomorrow For a new computer system to out-perform an old computer system on the same software program, it must demonstrate improvements in the product of three terms: seconds per logic gate, logic gates per clock cycle, and clock cycles per instruction [4]. The product of these three gives program execution rate, in seconds per instruction. The number of seconds per logic gate (approximately 10−11 seconds, or 10 ps, in a modern 40 nm process technology) has been scaling down roughly linearly with technology for many years: a technology with half the drawn transistor dimensions as another could be expected to be twice as fast. Each new generation reduces dimensions by 70%, so this doubling of speed arrives two generations, or every five to six years.

1 Introduction

5

   

While this improvement trend has held steady for several technology generations, designers expect it to slow down soon. This is because maintaining transistor performance directly conflicts with reducing transistor power, and power has become a primary design constraint in today’s systems. As a result, transistor designers will likely choose to reduce what they have long jokingly called their “technology entitlement,” and live with devices that are only slightly faster each process generation. But even if the delay of logic gates does not reduce as many designers expect, it provides at best a 2x improvement every five to six years, or approximately a 13% annual growth rate. More must be done to match the 35% historical growth rate in computer performance. The number of logic gates per clock cycle, when combined with the seconds per logic gate, gives clock frequency, which is 2–5 GHz in modern processors. Logic gates per clock cycle directly measure the aggressiveness of the processor design: a CPU with thirty gates per clock cycle is much less aggressive than one with only ten gates per clock cycle; its designers have much more time per cycle to perform computation or communication. What is the limit to this design aggressiveness? Over      &$ $"        

 ' & % $ # " !   

"

 

''

'$

'!

'

&&

&#

    Fig. 1.2 Scaling of logic gates per clock as a function of technology generation.

the past twenty years, the number of logic gates per cycle has fallen as the aggressiveness of designs has increased. Pre-Pentium processors used around 100 gates per cycle. Today, the industry has settled in the range of 15–30 gates per cycle. Collectively we now understand that achieving the lower end of that range is possible but disproportionately expensive: building so-called “short-tick” machines requires much more effort and care in clock distribution, parasitic extraction, timing verification, and min-path methodology. For example, a modern processor has a clocking overhead of nearly two gates per cycle, so a ten gate-per-clock design would thus have only eight gate delays in which to do work, barely enough time to complete a 64-b integer addition. While doable, such designs consume not only extra design

6

Introduction

resources and non-recurring engineering (NRE) costs but also significantly extra power in the design. Therefore, the number of logic gates per cycle will most likely not fall any further. Combined with the argument above for seconds per logic gate, this predicts slow changes in clock frequency, on the order of 13% per year and likely even lower. Therefore, the way to continue to improve processor performance must come from reducing the number of cycles per instruction. This arises through increased parallelism: pipelined or superscalar execution, vector processing, and speculation all aim to increase the number of operations concurrently executed1 . At a larger scale, processors with multiple cores and a shared memory can be used to divide a complex problem into separate threads. Historically, such techniques have provided the balance of the performance gains shown in Figure 1.1, with designers increasingly leveraging and targeting parallelism. However, increased parallelism–and reduced cycles per instruction–has a cost: processors by necessity also grow increasingly complex. Larger instruction windows to winnow out code independencies require larger queues and communication structures. Multiple execution pipes require more area for more adders, multipliers, and registers, as well as the switches to access these added functional blocks. Processors packed with multiple cores need to fit not only those cores on the die, but also correspondingly more cache to keep them all fed. This last point bears repeating: suppose we increase the number of cores on a chip. If we keep the memory-to-core ratio constant, then each core still has the same amount of cache available to it, and therefore has a consistent cache miss rate. However, because of the growing number of cores, the total aggregated miss rate for the chip will go up and put pressure on the fixed off-chip I/O bandwidth; as a result, when increasing the number of cores on a chip we must in fact disproportionately increase the cache size as well, to lower miss rates and to continue to fit inside the available total chip I/O. Thus far the transistor density scaling provided by Moore’s Law has kept up with the need for ever-complex architectures and systems, and allowed us to continue to find and to exploit parallelism on a chip. In other words, the improvements in clock cycles per instruction provided by Moore’s Law scaling have combined with the historical improvements in seconds per logic gate and logic gates per clock to give the trends in Figure 1.1.

1 In this discussion we gloss over important distinctions between instruction-level parallelism and task-level parallelism. While they are remarkably different at an architectural level, at a physical level both require similar increases in integration and hence increased transistor counts in a package.

1 Introduction

7

1.2.1 The end of Moore’s Law “Is Moore’s Law ending?” is a perennially-asked question in industry journals and conferences. For several very good reasons involving seemingly fundamental physics, feature size scaling has “always” been on its last legs. Yet the industry has stubbornly insisted on solving these problems and continually shrinking transistors and wires. Today, foundries pattern structures with dimensions of a few 10s of nm using light with a wavelength of 193 nm. By rights, this ought to be impossible. Yet it is done, by using optical proximity correction, phase-shifting masks, off-axis illumination, spacer masks, and some extremely expensive diffractive lenses. Atomic thinness limits in oxide gate insulators are overcome by employing metallic gates and high-permittivity liners, which happily also help reduce gate leakage currents. And a combination of mostly-air dielectrics that reduce wires parasitics, and thick deposited metals that reduce wire resistance, have helped to keep wires from overly constraining chip performance. Will these improvement trends continue in the next ten to twenty years? While the answer “no” has been proven wrong time and time again, recent economic limitations have now supplanted technology as the likely true limit for Moore’s Law. Especially given the financial realities of the current global economic crisis, the semiconductor industry can no longer continue to enjoy a fully elastic market that supports ever-increasing global financial investment. Worse yet, new fabrication plants will each cost over 1% of the total semiconductor market, thus limiting the number of new technologies able to come on-line each year. Gordon Moore himself pointed out that his “law” will eventually end, although he was hopeful that new technologies would delay that date–and from his talk in 2003 to the present, they certainly have. However, any industry that constantly relies on exponential growth to continue will eventually be disappointed. Thus, Moore’s Law of transistor scaling has historically combined with logic gate scaling and clock rate scaling to enable faster and faster computers. Looking forward, Moore’s Law is the only scaling trend left, as gate scaling and clock rate scaling are both slowing down for design and integration reasons, and even Moore’s Law will not survive through the next few technology generations. What is a designer of high-end computer systems to do?

1.2.2 The arguments against–and for–multiple chips Designers can achieve more complex systems either by exploiting Moore’s Law scaling for a single chip or by aggregating the functionality across multiple chips. An example of the former is a recent Xeon microprocessor from Intel that occupies nearly 7 cm2 in area and contains as many as eight full processors and a proportionally large cache [5]. An example of the latter would be an IBM Power processor with five chips integrated on a multi-chip module (MCM).

8

Introduction

Often, system designers resort to the multiple-chip approach because a singlechip solution is infeasible: the design would not fit within a silicon lithographic reticle, or it would be so big that the inevitable sprinkling of random defects during wafer processing would unacceptably reduce the final yield of working chips. But if the design fits, integrating within a single chip allows the functional blocks to communicate using on-chip wires, which are dense and plentiful, and can be run efficiently and with high data fidelity. On-chip wires can also be highly optimized for power and/or performance improvements using various circuit techniques [6, 7, 8, 10], so that the costs to send data between blocks on a single chip are small. By constrast, using multiple chips entails a number of tradeoffs in power and performance that make it less appealing. Principally, chips communicate to other chips through solder connections and traces on printed circuit boards or packages. Solder connections are large and thus expensive: a single high-speed channel requires two chip solder pads, each around 100 µm on a side (and more if shields are included). While a large chip can have several thousands of these solder pads, most are required for delivering power supply current to the chip, and typically only a few hundred are left for data communication. To squeeze as much bandwidth out of these few pads as possible, designers run them at data rates much higher than the chip’s clock speed, thus serializing data into the pad and deserializing the data at the other chip. These serializer-deserializer (or “serdes”) circuits consume significant energy per transmitted bit: not only do they have to queue and dequeue the data bits, but they also have to generate and precisely align overclocked timing signals from the data stream. Moreover, running high data rates on resistive and lossy channels requires several circuit enhancements, such as channel equalization, which consume additional power and add complexity. A natural question therefore arises: can we design systems that employ communication structures as cheap and efficient as on-chip wires, but that do not suffer from single-chip area limits? Can we build some semblance of a “virtual” chip, out of multiple chips appropriately stitched together, with the equivalent of on-chip wires? This book discusses several ways in which the answer is “yes.”

1.3 Coupled data communication The idea behind coupled-data communication is that chips can communicate with each other, directly or through an intermediate layer, when placed in close proximity. Structures on one chip can interact with matching structures on the other, through electric field or magnetic field interactions, or through optical coupling. Because these structures can be small–and much smaller than traditional solder balls–they offer enormous improvements in bandwidth density, and thus the possibility of running many such channels in parallel rather than serializing them with overclocked timing. This, along with the small size and hence small capacitance of the data coupling structures, allows the circuits to be low-energy as well. Finally,

1 Introduction

9

the relative simplicity of coupled data circuits allows them to have relatively low latency. In many ways, then, coupled data circuits allow chip-to-chip communication with metrics similar to those of on-chip wires. Imagine, then, a large VLSI design, integrating together many blocks and units together in order to achieve high system performance, but too large to fit economically on a single die. It may also ideally combine together disparate technologies– CMOS, DRAM, Flash, SiGe–that cannot traditionally be manufactured in the same fabrication process. Designers might create such a system out of individual chips, each tailored to its own optimized technology, and connect them together using coupled data communication. In this way, the internal chip-to-chip communication would be small enough, fast enough, consume low enough energy, and have low enough latency that it would be akin to using on-chip wires to connect together different parts of a much larger chip. The design would be a single “virtual” chip comprised of many different chips. Whether these chips are stacked vertically or spread horizontally depends on the system, the packaging, and the coupled data circuits. This style of design, exploiting coupled data communication, offers a way past the inevitable slow-down of Moore’s Law to continue to scale overall system performance. In many-chip systems enabled by coupled data communication, designers can create systems of much greater complexity than standard silicon scaling offers; in a very real sense they can skip generations of Moore’s Law scaling.

1.3.1 This book In this book we discuss several ways of designing these coupled data circuits in current state-of-the art implementations. We begin with an overview of packaging technologies for many-chip integrated systems, by Bakir, Huang, and Dang, surveying their work in this field at Georgia Tech over the past several years. Capacitive coupled data communication circuits as envisioned by Sun Microsystems are next introduced by Hopkins, Chow, Liu, Patil, and Eberle; followed by a complementary chapter on inductive coupled data communication circuits from Keio University at the University of Tokyo, by Miura, Sakurai, and Kuroda. Some earlier foundational work on capacitive coupling through board traces done at North Carolina State University is then reviewed by Franzon. Coupled data communications can require careful chip-to-chip alignment. The next chapter, by Cunningham, Krishnamoorthy, Shubin, Mitchell, and Zheng, discusses packaging technologies to overcome these requirements. Merging electrical with optical communications is the subject of the following chapter, by Krishnamoorthy, Lexau, Zheng, and Cunningham. Finally, work at the University of Tokyo on delivering power through coupled connections is introduced by Takamiya, Onizuka, and Sakurai.

10

Introduction

References 1. http://www.spec.org 2. G. Moore, “Cramming more components onto integrated circuits,” Electronics Magazine, vol. 38, no. 8, 1965. 3. M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, K. Bernstein, “Scaling, power, and the future of CMOS,” IEEE International Electron Devices Meeting, 2005, pp. 7–15. 4. J. Hennessey, D. Patterson, Computer Architecture: A Quantitative Approach, Third Edition, Morgan-Kaufmann, 2002. 5. S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada, M. Ratta, S. Kottapalli, “A 45nm 8-Core Enterprise Xeon(R) Processor,” IEEE International Solid State Circuits Conference, 2009. 6. R. Ho, T. Ono, F. Liu, R. Hopkins, A. Chow, J. Schauer, R. Drost, “High-speed and lowenergy capacitively driven wires,” IEEE Journal of Solid State Circuits, Vol. 43, No. 1, January 2008. 7. E. Mensink, D. Schinkel, E. Klumperink, E. van Tuijl, B. Nauta, “A 0.28pJ/b 2Gb/s/ch transceiver in 90nm CMOS for 10mm on-chip interconnects”, IEEE International Solid State Circuits Conference, 2007. 8. B. Kim, V. Stojanovic, “A 4Gb/s/ch 356fJ/b 10mm equalized on-chip interconnect with nonlinear charge-injecting transmit filter and transimpedance receiver in 90nm CMOS,” IEEE International Solid State Circuits Conference, 2009. 9. G. Moore, “No exponential is forever: but ’Forever’ can be delayed!” IEEE International Solid State Circuits Conference, 2003. 10. J. Seo, R. Ho, J. Lexau, M. Dayringer, D. Sylvester, D. Blaauw, “High bandwidth and low energy on-chip signaling using adaptive pre-emphasis in 90nm CMOS,” IEEE International Solid State Circuits Conference, 2010.

Part II

Overview of 3D Technologies

Chapter 2

Power delivery, signaling and cooling for 2D and 3D integrated systems Muhannad Bakir, Gang Huang and Bing Dang

2.1 Introduction As gigascale integrated (GSI) technology progresses beyond the 45 nm generation, the performance of a monolithic system-on-a-chip (SoC) has failed by progressively greater margins to reach the “intrinsic limits” of each particular generation of technology [1]. The root cause of this lag is the fact that the capabilities of monolithic silicon technology per se have vastly surpassed those of the ancillary or supporting technologies that are essential to the full exploitation of a high-performance SoC. The most serious obstacle that blocks fulfillment of the ultimate performance of an SoC is inferior heat removal. The increase in clock frequency of an SoC has been virtually brought to a halt by the lack of an acceptable means for removing, for example, 200 W from a 15x15 mm die. In addition, the inability to remove more than 100 W/cm2 per stratum is a key limiter to the successful 3D integration of high-performance ICs. A huge deficit in chip input/output (I/O) bandwidth due to insufficient I/O interconnect density is the second most serious deficiency stalling high performance gains. The excessive access time of a chip multiprocessor (CMP) for communication with its off-chip main memory is a direct consequence of the lack of, for example, a low-latency 100 THz aggregate bandwidth I/O signal network. Lastly, SoC performance has been severely constrained by inadequate I/O interconnect technology capable of supplying, for example, 200–400 A at 0.7 V to a CMP with ever-decreasing noise margins. Of course, innovation in silicon ancilDr. Muhannad Bakir Georgia Institute of Technology, 791 Atlantic Dr. NW, Atlanta, GA 30332-0269, USA e-mail: [email protected] Gang Huang Intel Corporation, Ultra Mobility Group, 1501 S. MO-Pac Expy, Austin, TX 78746 USA e-mail: [email protected] Bing Dang IBM T. J. Watson Research Center, room 6-242, 1101 Kitchawan Rd, RM 6-242, Yorktown Heights, NY 10598, USA e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_2, © Springer Science+Business Media, LLC 2010

13

14

2D and 3D integrated systems

lary technologies will have to be done in parallel with continued innovations at the chip level, including improvements in scaled transistors, interconnects, and system architecture. A critical technical hurdle to the above grand challenges is the realization of a low-cost integrated interconnect network that is capable of addressing the heat removal, I/O bandwidth and power delivery requirements for a gigascale SoC. Challenges in power delivery and cooling, moreover, are exacerbated with 3D system integration. This chapter is organized as follows: Section 2.2 provides a review of current silicon ancillary technologies, and Section 2.3 describes low-cost and fully compatible electrical, optical, and fluidic, or “trimodal,” chip I/O interconnects for single and 3D chips. Power delivery for GSI systems and chip-package codesign of the power delivery network are discussed in Chapter 2.4. Section 2.5 is the conclusion.

2.2 Evolution of conventional silicon ancillary technologies: A brief overview In order to maintain constant junction temperature with increasing power dissipation, the size of the heat sink used to cool a microprocessor has been increasing, and thus imposing limits on system size, chip packing efficiency, and interconnect length between chips. A schematic illustration of a 2D system is shown in Figure 2.1. It is projected that the junction-to-ambient thermal resistance at the end of the roadmap will be less than 0.2 ◦ C/W [2]. However, using conventional materials for the various thermal interconnects between the silicon die and the ambient (the heat spreader, the heat sink, and the thermal interface materials (TIM) at the die/heat spreader and heat spreader/heat sink interfaces), the lowest attainable thermal resistance from a conventional air-cooled heat sink is approximately 0.5 ◦ C/W. Some reduction in the thermal resistance can be achieved with improved materials and increased air flow rate. Not only can the TIM account for a large fraction of the overall thermal resistance, but it also presents many reliability problems [3]. The cooling of hot-spots (power density up to 500 W/cm2 ) greatly exacerbates the complexity of cooling. Thus, it is clear that more revolutionary innovations in cooling technologies are needed to 1) eliminate/improve the TIM, 2) reduce the thermal resistance of the heat sink, 3) address the cooling needs of nonuniform power dissipation (cooling of hot spots), and 4) reduce the dimensions of the chip cooling hardware. The number of power and ground I/Os needed on a chip is a function of power dissipation and the maximum allowable on-chip power supply noise, which is due to resistive losses and δ I/δ t noise across the power distribution network. Because the supply voltage decreases with each technology (although more slowly in the future) and decreasing timing margins, the allowable on-chip power supply noise will also decrease. However, the increase in power dissipation and the resulting increase in current drain of a SoC will increase the power supply noise and resistive losses through the motherboard, socket, package, and chip I/Os, which can become large

2 2D and 3D integrated systems

15

Heat sink Heat Spreader Capacitor Socket

Die

Power

Communication

Fig. 2.1 A schematic illustration of a traditional “2D” multi-socket system. There are many unknowns

for 3D IC: •How to cool? •How to deliver power? [4, 5, 6]. In order to maintain acceptable on-chip supply noise, the number of power ? •How to package? andPhotonics? ground pads must be scaled accordingly with each technology generation and •Type of interstratal interconnect(s)? ? ?wires and DC-DC? sized on-chip appropriately decoupling capacitors are allocated for the •How H tto assemble/bond? bl /b d? power distribution network [7]. The above issues are discussed in more detail later •Chip-scale or wafer-scale? in the chapter. ? ? moreA??? Power delivery is not the only •And challenge. key bottleneck to the realizaHeat removal?

tion of high-performance microelectronic systems is the lack of low-latency and high-bandwidth off-chip interconnects. Some of the challenges in achieving highbandwidth chip-to-chip communication using conventional electrical interconnects include the low number of signal pins, frequency-dependent high losses in the substrate, reflections and impedance discontinuities, and susceptibility to cross-talk. The motivation for the use of microphotonics technology to overcome these challenges and leverage low-latency and high-bandwidth chip-to-chip communication has been presented [8, 9]. Significant progress has been made in developing chip-to-chip optical interconnects. Fiber-to-the-chip schemes, where an optical signal is coupled to a silicon integrated circuit through nanoscale silicon-based waveguides, have been reported [10]. However, such an approach limits the optical I/O density (because of fiber size and handling), increases the complexity of packaging, and potentially increases the cost of assembly because fibers must be manually (and serially) connected to each chip. High-density free-space optical interconnects are also being pursued for chip-to-chip communication [11, 12, 13]. However, susceptibility to misalignment and complexity in packaging are formidable challenges that have yet to be fully addressed. Optical misalignments can severely reduce the optical power delivered to the photodetector thereby increasing the bit error rate (BER) and reducing bandwidth [14]. Moreover, such free-space optical I/O schemes are not compatible with underfill processes. Technologies to address these challenges are discussed later in the chpater. Challenges in power delivery and cooling are exacerbated in 3D integrated systems, which have recently gained significant momentum in the semiconductor industry and are viewed as a key enabler to future system performance advances. Three-dimensional integration may be used either to partition a single chip into multiple strata to reduce on-chip global interconnect length [15] and/or used to stack chips that are homogenous or heterogeneous. An example of 3D stacking of ho-

Socket Power

Communication

16

2D and 3D integrated systems

Heat removal?

•How to cool? •How to deliver power? •How to package?

?

Photonics? DC-DC? ?

?

?

?

•How H tto assemble/bond? bl /b d? •Chip-scale or wafer-scale? •And more ???

Fig. 2.2 A schematic illustration of the challenges associated with “3D” stacking of highperformance die. These include cooling, power delivery, packaging, types of intrastratal interconnects, assembly and bonding, chip- or wafer-scale integration, and more.

mogenous chips is memory chips, while an example of heterogeneous chip stacking is memory and microprocessor chips. There are a number of interconnect challenges that need to be addressed to enable stacking of high-performance die (see Figure 2.2). When two 100 W/cm2 microprocessors are stacked on top of each other, for example, the net power density becomes 200 W/cm2 , which is beyond the heat removal limits of conventional air-cooled heat sinks [16]. Thus, cooling becomes a key limiter to the stacking of high-performance chips. Power delivery to a 3D stack of high power chips also presents many challenges and requires careful and appropriate resource allocation at the package level, die level, and interstratal interconnect level [17]. Both issues are discussed later in the chapter. Non-TSV Based 3D Die Package #2

Die Package #1

Die #6 Die #5 Die #4 Die #3 Die #2 Die #1 Substrate

Die #3

Inductive coupling

Die #2 Die #1 Substrate

Fig. 2.3 Schematic illustrations of 3D technologies not using through-silicon vias (TSVs).

Figures 2.3–2.5 show representative schematic illustrations of 3D integration technologies that have been proposed to date and consist of three categories. The first category contains 3D stacking technologies that do not utilize TSVs and are shown in Figure 2.3. The second category consist of 3D integration technologies that require TSVs (Figure 2.4), and the third category consists of monolithic 3D systems that make use of semiconductor processing to form active levels that are vertically stacked (Figure 2.5). Of course, a combination of all these technologies is possible. The non-TSV 3D systems span a wide range of different integration methodologies. The left of Figure 2.3 illustrates stacking of fully packaged die. Although this may offer the advantages of being low cost, simplest to adopt, fastest to market, and modest form-factor reduction, the overhead in interconnect length and low-density interconnects between the two die do not enable one to fully exploit the advantages of 3D integration. The middle of Figure 2.3 illustrates the most common method

2 2D and 3D integrated systems

17

TSV Based 3D

Die #N

Die #4 Die #3

Die #3 Die #2

Die #2 Die #1

Die #1

Substrate

Substrate

Fig. 2.4 Schematic illustrations of 3D technologies using through-silicon vias (TSVs).

to stack memory die, which is based on the use of wire bonds. Naturally, this 3D technology is suitable for low-power and low-frequency chips due to the adverse effect of wire bond length, low density, and peripheral limited pad location for signaling and power delivery. The right of Figure 2.3 illustrates the use of wireless signal interconnection between different levels using inductive coupling (capacitive coupling is also possible) [18]. This is discussed later in the book. There are several derivatives to the topologies described above and in general are hybrid die/package level solutions. It is important to note that the non-TSV approaches rely on stacking at the die/package level (die-on-wafer possible for inductive coupling and wire bond) and thus do not utilize wafer-scale bonding. This may serve to impose limits on economic gains from 3D integration due to cost of the serial assembly process. Figure 2.4 illustrates 3D integration based on TSVs. The left figure illustrates bonding of die with C4 bumps and TSVs. The short interconnect lengths and high density of interconnects that this approach offers are important advantages. Compared to wire bonding, it is possible to have several orders of magnitude larger number of interconnects. Although it is possible to bond at the wafer level, this approach is most suitable for die-level bonding (using a flip-chip bonder). At the right the figure illustrates 3D stacking based on thin-film bonding (metal-metal or dielectricdielectric) [19, 20, 21]. Not only are solder bumps eliminated in this approach, but also increased interconnect density and tighter alignment accuracy may be achieved when compared to the previous approach due to the fact that these approaches are based on wafer-scale bonding (although there are challenges in aligning 12-inch Monolithic 3D wafers, for example). A ti Active layers Substrate

Fig. 2.5 Schematic illustrations of 3D technologies using monolithic integration.

Finally, Figure 2.5 illustrates a semiconductor-manufacturing (non-packaging) approach to 3D integration. The main enabler to this approach is the ability to deposit/grow a semiconductor film on a wafer during the IC manufacturing process using number of techniques [22, 23]. Ultimately, this approach may offer the most integrated system but may potentially be limited to a smaller set of applications.

18

2D and 3D integrated systems

It is important to note that none of the above described 3D integration technologies address the need for cooling in a 3D stack of high performance chips. This is a significant omission and imposes a constraint on the ability to fully utilize the benefits of 3D technology for high-performance chips. As such, new 3D integration technologies are needed for such applications.

2.3 Novel silicon ancillary technologies In order to provide all critical interconnect functions for a gigascale SoC, fullycompatible, low-cost, and microscale electrical, optical, and fluidic chip I/O interconnects (or, “trimodal” I/Os) have been recently proposed (Figure 2.6) [24]. A schematic illustration of a cross-section of a gigascale SoC with trimodal I/Os and SEM images is shown in Figure 2.7. The overarching strategy of this novel approach is to extend and utilize low-cost wafer-level batch processing, the key to the success of Si technology, to the ancillary technologies that have now become the millstone around the neck of Si technology itself.

Die Electrical

Optical

Fluidic

Substrate

Fig. 2.6 Schematic illustration of a chip with electrical, optical, and fluidic (“trimodal”) I/O interconnects.

Electrical chip I/O interconnection is achieved using solder bumps. The optical I/Os are implemented using surface-normal optical waveguides and take the form of polymer pins [25, 26, 27]. A polymer pin, like a fiber optic cable, consists of a waveguide core and a cladding. The polymer pin acts as the waveguide core, and unlike a fiber optic cable, the cladding is air. A key feature of the optical pins is that they are mechanically flexible and thus, can bend to compensate for the coefficient of thermal expansion (CTE) mismatch between the chip and substrate. More details on the optical I/Os will be presented in Section 2.3.1. The fluidic I/Os are implemented using surface-normal hollow-core polymer pins, or micropipes [28, 29]. Unlike prior work on microfluidic cooling of ICs that require millimeter-sized and bulky fluidic inlets/outlets to the microchannel heat sink, the micropipe I/Os under consideration are microscale, wafer-level batch fabricated, area-array distributed, flip-chip compatible, and mechanically compliant. Figure 2.8 illustrates the evolution of thermal interconnects, including the thermal interconnects under consideration. This is discussed in Section .

2 2D and 3D integrated systems

19

Cap

Optical device

Si 100 m

200 m

Microchannel heat sink Fluidic I/O Cu pad

Optical pin I/O Solder bump

Optical waveguide Fluidic channel

Optical & Fluidic I/Os

Electrical & Optical I/Os

Electrical & Fluidic I/Os

Fig. 2.7 Schematic illustration of GSI chips with trimodal I/Os. SEM images are also shown.

© 2007 IEEE

Fig. 2.8 Schematic illustration of the evolution of thermal interconnects. (MCHS: microchannel heat sink)

20

2D and 3D integrated systems

The fabrication process of the proposed electrical, optical, and fluidic I/Os is shown in Figure 2.9. We assumed that the optical devices (detectors or sources) are monolithically or heterogeneously integrated on the CMOS chip. The fabrication process begins by partially etching through-wafer fluidic vias starting from the back side of the chip (side closest to the heat sink), as shown in Figure 2.9b. Next, trenches are etched directly into the silicon surface (Figure 2.9c) while simultaneously completing the etch of the fluidic through-wafer vias. Following the silicon etch, the microchannel heat sink is enclosed using any of a number of techniques [30] (Figure 2.9d). This completes the fabrication of the microchannel heat sink. Next, solder bumps are fabricated on the front side of the chip using standard processes (Figure 2.9e). Next, a photosensitive polymer film, equal in thickness to the height of the final optical and fluidic I/Os, is spin coated on the front side of the wafer (and over the solder bumps), as shown in Figure 2.9f. Finally, the polymer film is photodefined to yield the optical and fluidic I/Os simultaneously. The polymer I/Os are next cured for one hour in a nitrogen purged furnace set to 160 ◦ C.

© 2007 IEEE

Fig. 2.9 Process used to fabricate the trimodal I/Os. Schematic illustration of a chip with an alternate configuration of electrical, optical, and fluidic I/O interconnects.

There are many derivatives to the electrical, optical, and fluidic I/O approach described above. One such derivative is shown in Figure 2.10, which illustrates the ability to embed each optical pin in a solder bump to create a dual-mode electricaloptical solder bump [31]. An SEM image of such dual-mode electrical-optical solder bumps is shown in Figure 2.11. Not only does this enable higher levels of integration between the electrical and optical I/O interconnects, but also enables the possibility

2 2D and 3D integrated systems

21

of the aggregate number of I/Os to double for a given pitch. This approach is easily extendable to the fluidic I/Os and enables the fabrication of dual-mode electricalfluidic I/O interconnects.

Solder bump with embedded optical pin I/O

© 2007 IEEE

Fig. 2.10 Schematic of optical and fluidic pins (I/Os) embedded in conventional solder bumps.

© 2007 IEEE

Fig. 2.11 SEM images of optical pins (I/Os) embedded in conventional solder bumps.

In another approach, the optical and electrical I/Os can be assembled without having to embed the polymer pins in the solder bumps [32], as illustrated in Figure 2.7 and Figure 2.12. In order to adhere the optical I/Os to the waveguides on the substrate, the die containing the electrical and optical I/Os is dipped into a thin layer of a polymeric adhesive before bonding. When the I/Os are dipped, only the optical polymer pins make contact with the polymer adhesive. This is accomplished by fabricating the polymer pins to be taller than the solder bumps. Moreover, the adhesive is spin-coated to a thickness so that it only makes contact with the optical

22

2D and 3D integrated systems

pins when the chip is dipped into the film. SEM images of a chip with polymer pins assembled using this approach are shown in Figure 2.13.

Die

Optical device

Dip I/O Adhesive carrier Die

Adhesive

Substrate © 2008 IEEE

Fig. 2.12 Schematic illustration of the process used to bond electrical and optical I/Os simultaneously using a flip-chip bonder.

Die Substrate Die Optical polymer pin i

Adhesive

Substrate © 2008 IEEE

Fig. 2.13 SEM images of optical pins bonded to a substrate using the process shown in Figure 2.12 .

2 2D and 3D integrated systems

23

2.3.1 Optical I/Os The use of flexible surface-normal optical waveguides, or optical pins, have been proposed as a means of addressing the shortcomings of free-space optical I/O interconnections [33]. The height separation between the chip and the substrate has minimal effect on the optical power received at the photodetector (except for losses through the polymer pin) because the light is tightly confined within the crosssectional area of the pin. Although we consider using polymeric materials with relatively high optical absorption losses for the fabrication of the optical pins, due to their very short length (height), the optical transmission losses through the pins are small [33, 34]. The optical pins are designed to be mechanically compliant (flexible). The low elastic modulus of the polymer and air cladding of the waveguide contribute to the flexible nature of the optical pins. As a result, the lateral misalignment induced by chip-substrate CTE mismatch is compensated by the mechanical compliance of the optical pins. Thus, optical interconnection and alignment are maintained at all times between the optical components on the chip and substrate due to the mechanical compliance of the optical pins.

© 2006 IEEE

Fig. 2.14 Experimental setup used to characterize the coupling efficiency of various diameter optical pins.

Figure 2.14 illustrates the experimental setup used to characterize the surfacenormal optical coupling efficiency of the pins. A fiber was scanned in the X-axis and in the Y-axis across the endface of the pin and across the surface of the aperture (at a Z-axis distance equal to the pin’s height). The relative transmitted optical intensity measurements of 50x150 µm optical pins and 50 µm optical apertures are plotted in the top of Figure 2.15. The transmitted intensities are normalized to the maximum transmission at the center of the aperture without a pin. The X- and Y-axis scans are essentially equal due to the radial symmetry of the light source and the pins. The dif-

24

2D and 3D integrated systems

ference between the coupling efficiency of the two measurements (using data from the X-axis scan) is plotted in the bottom of Figure 2.15. The data demonstrate that

Loss Reduction [dB]

5 4 3 2 1

20

24

16

8

12

4

0

-4

-8

-1 6

-1 2

-2 0

-2 4

0

Lateral Position [um]

© 2008 IEEE

Fig. 2.15 Using the experimental setup shown in Figure 2.14, the transmitted optical intensity as a function of light source lateral position above the pin (50x150 µm) and aperture are measured (top). The reduction in the coupling loss due the optical pins ranges from 2–4 dB (bottom).

at the 0 µm displacement position, the optical pins enhance the coupling efficiency by approximately 2 dB when compared to direct coupling into the aperture. At distances of ±25 µm away from the center, the optical coupling improvement due to the pin exceeds 4 dB. The 4 dB coupling improvement is significantly larger than the 0.23 dB excess loss of the pins [33, 34], which clearly demonstrates the benefits of the pins. Note that the profile of the relative intensity curve of the optical pin is almost flat across the entire endface of the pin and abruptly drops beyond the edges of the pin (X=±25 µm). On the other hand, the intensity curve of the aperture resembles an inverse parabola. This is important because it signifies the importance of having perfect alignment for the direct coupling case. Any misalignment in the lateral direction would cause a fast roll-off in the intensity. Even with perfect alignment during assembly, any lateral misalignment between the mirror and the detector due to either CTE mismatch or other factors may reduce the coupling efficiency and limit the achievable bandwidth. When 30x150 µm pins (optical aperture 30 µm in diameter) were tested, the coupling efficiency improved by 3 to 4.5 dB [33]. This improvement in the optical coupling efficiency is larger than the measured improve-

2 2D and 3D integrated systems

25

ment from the 50x150 µm pin. This is significant because it demonstrates that as the optical I/O density increases and smaller PDs are used to attain higher bandwidth, optical pins become even more critical to the overall performance of the system.

© 2007 IEEE

Fig. 2.16 Schematics of the experimental setups used to evaluate the optical displacement compensation of the pins.

The experimental setup used to characterize the optical displacement compensation of the optical pins is shown in Figure 2.16. To quantify the effects of pin bending on optical signal transmission, two experimental configurations, shown in Figure 2.16, were developed. In the first configuration, which is labeled “scanning” in the figure, the light source (fiber) is scanned laterally across the endface of the pin (similar to earlier measurements). In the second configuration, which is labeled “bending” in the figure, the fiber is attached to the endface of the pin using epoxy to form an air-free interface between the source and the substrate. In the “bending” case, the controlled lateral displacement of the light source causes the optical pin to bend sideways helping to keep the lightmode confined in the pin and thus deliver the optical signal to the detector with lower coupling losses. The relative transmitted intensities as a function of lateral displacement of the light source for the two experimental configurations illustrated in Figure 2.16 are shown at the top of Figure 2.17. The loss reduction is less than 1 dB up to 15 µm displacement, while it increases up to 4 dB at 30 µm. This is significant since a limited loss budget is available in typical systems for misalignments/assembly to maintain proper operation. The top of Figure 2.17 demonstrates that for a given loss budget of, e.g., 1 dB, the 50x150 µm flexible pins double the displacement tolerance from less than 15 µm to approximately 30 µm. The 4 dB pin-assisted loss reduction at the 30 µm displacement can

26

2D and 3D integrated systems

decrease the BER by few orders of magnitude [33, 34]. Thus, the optical pins provide a method of reducing optical coupling losses caused by thermomechanically induced misalignment between the CTE mismatched chip and substrate. Lateral Displacement [um] 0

10

20

30

40

50

Relative Intensity [dB]

1

0 -1 -2 -3

Bend forward Bend return Scan Scan with epoxy

-4 -5

Loss Reduction [dB]

5

4 3 2 1 0

0

2

4

6

8 10 0 12 14 16 6 18 8 20 0 22 24 26 6 28 8 30

Light Source Lateral Position [um]

© 2007 IEEE

Fig. 2.17 Measured optical displacement compensation using the flexible optical pins. The experimental procedure for the data labeled “Scan with epoxy” is similar to that labeled “Scan” with exception being that the fiber tip contained a layer of epoxy.

2.3.2 Fluidic I/Os for single and 3D chips The process that is used to bond the fluidic I/Os is shown in Figure 2.18. As with the optical I/Os, the fluidic I/Os are flip-chip compatible and are batch-fabricated at the wafer-level. In fact, the optical and fluidic I/Os are batch fabricated simultaneously using the same polymer (Figure 2.19). The assembly process begins by aligning the die, which contains the electrical and fluidic I/Os, to the substrate using a flipchip bonder. In this case, the fluidic I/Os are aligned with inlet/outlet fluidic vias that interconnect to substrate-level fluidic channels. Although an organic substrate could be used, in this work, a Si substrate is used because it can also be used for very high density interconnects (Figure 2.19). The fluidic through-substrate vias

2 2D and 3D integrated systems

27

connect directly to fluidic tubes attached on the other end of the substrate. As the solder bumps make contact with the copper pads on the substrate, the fluidic I/Os become inserted into the inlet/outlet vias. Once assembled, an encapsulant is applied between the die and substrate to seal the interface between the fluidic I/Os and the vias in the substrate (Figure 2.18). This approach is radically different from previously reported research [35, 36] in the area of fluidic interconnects for ICs. Fluidic I/O

Die

Cu pad

Si Substrate

Fluidic via Encapsulant Power to heaters

Electrical reading

Fl id iin Fluid

Fl id out Fluid © 2007 IEEE

Fig. 2.18 Schematic of the experimental setup used to demonstrate the fluidic I/O interconnects.

In order to characterize the fluidic interconnects and the microchannel heat sink and perform temperature measurements, thin-film Pt heaters/thermometers were fabricated on the silicon die. The tested die contained a total of 51 parallel microchannels (100 µm in width and 200 µm in height) distributed evenly across the back-side of the chip (1 cm2 ) and a total of 32 fluidic I/Os were used. In this case, the microchannel heat sink was capped with a Pyrex wafer using an adhesive [30]. The inlet and outlet temperatures were measured using thermocouples, while the chip temperature was measured by recording the change in the resistance of the Pt heaters. Figure 2.20 plots the temperature of the coolant (DI water) at the substrate inlet and outlet and the average chip temperature when 75 W/cm2 is applied to the Pt heaters/thermometers. Under a relatively large flow rate (≈ ≈ 104 ml/min), the average temperature rise is 12.7◦ C, and the corresponding thermal resistance for the chip is approximately 0.28◦ C/W. As shown in Figure 2.20, during testing, the supply power was toggled to verify the consistency of the measurement results. As the microchannel heat sink was not optimized, lower thermal resistance and the cooling

28

2D and 3D integrated systems

© 2007 IEEE

Fig. 2.19 Optical micrograph of a silicon substrate with electrical interconnects (copper) and through-substrate fluidic vias. 50.0

average chip hi ttemperature t outlet temperature

Temperatture (C)

45.0

inlet temperature

40.0

35.0

30.0

25.0

20.0 0

100

200

300

400

500

600

700

800

900

Time (seconds)

Fig. 2.20 Measured temperatures at the inlet and outlet as well as the average chip temperature at a flow rate of 104 ml/min. The calculated unit thermal resistance is ≈ ≈0.17 ◦ C·cm2 /W. (Overall power is ≈ ≈45 W; heating area is ≈ ≈0.6 cm2 ; on-chip temperature rise is 12.7 ◦ C.)

2 2D and 3D integrated systems

29

of higher power density can be achieved. In fact, the first example of microchannel liquid cooling demonstrated a junction-to-ambient thermal resistance of 0.09 ◦ C/W and the cooling of 790 W/cm2 [37], which demonstrates ability to cool hot spots (up to 400 W/cm2 in some processors). In this work, we focus on the integration and implementation of the fluidic I/O interconnect network rather than on the microchannel heat sink. The novel feature of the research is delivering and extracting a liquid coolant from a microchannel heat sink in a way that is compatible with CMOS process technology and conventional chip I/O technology.

Solder cap

Polymer micropipe

© 2008 IEEE

Fig. 2.21 SEM micrograph of a solder-capped polymer micropipe.

In this work, an encapsulant was used to seal the fluidic I/Os after assembly. In an alternate configuration, solder may be capped on the top of the polymer (or metallic) micropipe I/Os, as shown in Figure 2.21, to seal the fluidic I/Os. The plating and reflow processes are clearly compatible with the solder bumps fabricated for the electrical I/Os. Thus, this could potentially enable the use of conventional solder to enable the interconnection of all I/Os modes: the electrical, optical, and the fluidic. It is possible to extend the microfluidic chip I/Os to 3D integrated chips, as illustrated in Figure 2.22 [1, 38, 39, 40]. The electrical interconnect network is used for power delivery and signaling between strata, and fluidic interconnects are used to enable the rejection of heat from each stratum in the 3D stack. One implementation of this approach is shown in Figure 2.23. Each silicon die in the 3D stack contains the following features: 1) a monolithically integrated microchannel heat sink; 2) through-silicon electrical (copper) vias (TSEVs) and through-silicon fluidic (hollow) vias (TSFVs); 3) solder bumps (electrical I/Os) and microscale polymer pipes (fluidic I/Os) on the side of the chip opposite to the microchannel heat sink. Microscale fluidic interconnection between strata is enabled by the combination of through-wafer fluidic vias and polymer pipe I/O interconnects. The chips are designed such that when they are stacked, each chip makes electrical and fluidic interconnection to the die above and below. Consequently, power delivery and signaling can be supported by the electrical interconnects (solder bumps and copper TSVs), and heat removal for each stratum can be supported by the fluidic I/Os

30

2D and 3D integrated systems

Die Electrical

Fluidic Die

#2 Fluidic

Electrical Die El t i l Electrical

#3

#1 Fluidic

Substrate © 2008 IEEE

Fig. 2.22 Schematic illustration of a 3D chip stack with electrical and fluidic chip I/O interconnects.

and microchannel heat sinks. Optical TSVs [41] and I/Os may also be integrated to provide unusual flexibility to system integration.

© 2008 IEEE

Fig. 2.23 Schematic illustration of one possible implementation of the system shown in Figure 2.22.

In order to achieve high heat transfer, low thermal resistance, and low pressure drop, a relatively tall microchannel heat sink is needed (≈250 µm, for example). ≈ As a result, this necessitates a thick silicon wafer and is different from other 3D integration technologies, which seek to polish the silicon wafer to as small a thick-

2 2D and 3D integrated systems

31

ness as possible before wafer handling and mechanical strength become limiters. Cross-sectional optical images of fabricated electrical TSVs in a silicon wafer with and without a microchannel heat sink are shown in Figure 2.24. At the bottom of the figure, the microchannel shown is 200 µm tall and 100 µm wide. The aspect ratio of the microchannel can be varied to meet thermal resistance and pressure drop of different applications [37]. TSVs with aspect ratios 30:1 and greater have been demonstrated [42].

Si

Cu © 2006 IEEE

Fig. 2.24 Optical images of a silicon wafer with through-silicon electrical vias (top) and the fabrication of a microchannel heat sink with electrical TSVs (bottom).

The process used to fabricate the die is shown in Figure 2.25. The process begins by (a) fabricating electrical TSVs followed by (b) the fabrication of trenches and microfluidic TSVs into the silicon wafer. In (c), the trenches are encapsulated to form the microchannels [30]. Vias are next formed into the overcoat polymer to simultaneously expose the electrical TSVs and form fluidic vias that ultimately allow fluid flow to the upper and lower die. Following this process step, copper pads are patterned above the electrical TSVs to facilitate solder bonding during assembly. Finally, in (d), solder bumps and microfluidic polymer micropipes (electrical and fluidic I/Os, respectively) are fabricated on the side of the wafer opposite to where the microchannel heat sink is located using processes reported previously [24]. A two-die stack using the above outlined assembly process, including SEM images of the trenches and microfluidic TSVs are shown in Figure 2.26 (the microchannel heat sink was not included, in order to simplify the assembly experiment).

2.4 Power delivery for 2D and 3D systems Power consumption of GSI chips is increasing at an alarming rate [43]. The increasingly faster devices packed at unprecedented densities result in high current densi-

32

2D and 3D integrated systems

(a)

(b)

(c)

(d)

© 2007 IEEE

Fig. 2.25 Schematic illustration of the process used to fabricate silicon die, at the wafer level, that each contain electrical and microfluidic TSVs and I/Os.

© 2007 IEEE

Fig. 2.26 SEM image of a two-die stack that contain electrical and fluidic I/Os and TSVs. In this experiment, no microchannels were included, in order to simplify the process and assembly experiments.

ties. Although the scaling of the supply voltage has slowed down in recent years, the logic on the integrated circuit (IC) continue to become increasingly sensitive to any supply voltage change because of the decreasing clock cycle and therefore noise margin. With this trend, power supply noise, the voltage fluctuation on power delivery networks, has become a significant factor that can substantially influence the overall system performance. As a result, the design of power delivery systems becomes a very important and challenging task. Therefore, understanding complicated power delivery networks and supplying clean power to microprocessors is of great significance [44, 45]. IR-drop and ∆ I noise are the two main components of the power supply noise. IR-drop results from the supply current passing through the parasitic resistance of

2 2D and 3D integrated systems

33

© 2007 IEEE

Fig. 2.27 Simulated noise droops of Intel microprocessors [46].

the power distribution networks. ∆ I noise is caused by the inductance of the power delivery system and becomes important when a group of circuits switch simultaneously. Power supply noise consists of three distinct voltage droops [44], and they result from the interactions between the chip, package, and board. The three droops are illustrated as shown in Figure 2.27. The third droop is related to the bulk capacitors at the board level, and has a time duration of a few microseconds. The third droop influences all critical paths but can be readily minimized by using more board space for bulk capacitors [44]. The second droop is caused by the resonance between the inductive traces on the motherboard and the decoupling capacitors (decaps) in the package. The second droop has time duration of a few hundred nanoseconds and impacts a significant number of critical paths. The first droop is caused by the package inductance and ondie capacitance. The resonance frequency of the first droop is in the range of tens of MHz to a few hundred of MHz depending on the sizes of package level components and on-chip decaps [47]. Because putting additional on-chip decaps is very costly, among the three droops, the first droop is the most difficult one to suppress. The first droop noise has the largest magnitude. Even though the first droop has the smallest time of occurrence it can adversely affect GSI circuits as its duration can be tens of nano seconds (ns). Chip performance can be severely degraded when the first droop affects some critical paths. Because of its severe impact on high-performance chips, the first droop is thus the main focus of this section. Excessive power supply noise can lead to severe performance degradation of onchip circuitry and off-chip high speed data links, and even result in logic failures [44]. Thus it is vitally important to model and to predict the performance of power delivery networks with the objective of minimizing supply noise. On-chip power distribution networks consist of global and local networks. Global power distribution networks carry the supply current and distribute power across the chip. Local networks deliver the supply current from global networks to the active devices. Global networks contribute most of the parasitics, and thus are the main concern of this chapter. For global distribution networks, the most common way

34

2D and 3D integrated systems

is to use a grid made of orthogonal interconnects routed on separate metal levels connected through vias [48]. Wire-bond and flip-chip technologies are the two most commonly used chipto-package interconnects [49]. Wire-bond is lower cost than flip-chip interconnect; however, peripheral wire-bond interconnect causes higher power supply noise level because of larger parasitics. In flip-chip technology, the parasitics are reduced by spreading I/O pads over the surface area of the chip, therefore reducing the noise. The development of GSI systems is not only driven by more efficient silicon real estate usage but also by more I/O counts. Hence most of today’s high performance designs use flip-chip interconnect and area-array I/Os to provide larger bandwidth for chip to the next level interconnections.

2.4.1 Power delivery and design implications of 2D systems The grid structure and area-array I/O pad allocation are shown in Figure 2.28. Power is fed through the power pads from the package. The current flows through power wires and on-chip circuits, and returns to the package through ground wires and ground pads.

© 2007 IEEE

Fig. 2.28 On-chip power/ground grids and I/O pads in flip-chip technology.

Different types of modeling methods can be used to analyze power supply noise, such as circuit simulation methods, 3D electromagnetic solver methods, and compact physical models. Circuit simulations and 3D solvers are commonly used for

2 2D and 3D integrated systems

35

dedicated validation after designs are fulfilled. However, to gain sufficient physical insight, compact and accurate physical models are needed before the physical designs are performed. Such models would be critical in the early stages of design and can estimate the on-chip and off-chip resources needed for the power distribution network. Also compact physical models can be used to predict the power noise trends of different generations of technology from mathematical and physical bases. Compact physical models for ∆ I noise and IR-drop have been proposed [7, 50]. Power supply noise is a dynamic effect changing with time, and IR-drop is the case when the noise goes to steady state. The transient part of the power supply noise, ∆ I noise, is significant in determining the timing budget of a system. These models embody the distributed nature of on-chip power grids and display high accuracy when predicting the frequency response and time domain transients of the power supply noise of 2-D power distribution networks. An example of the simplified circuit model for a part of the power distribution network is shown in Figure 2.29 [50]. The segment resistance of the grid is represented by Rs . Switching current between a power grid node and the adjacent ground grid node is modeled as a current source, and J(s) represents the switching current density in the Laplace domain. Symbol Cd denotes the decoupling capacitance (including both the intentionally added decaps and the equivalent capacitance of the non-switching transistors) per unit area. Symbols ∆ x and ∆ y represent the distances between two adjacent power (or ground) nodes at the same wiring level for x and y directions, respectively. Symbol L p (4L p for quarter pad) represents the per pad loop inductance of the package.

J ( s)'x'y

Cd 'x'y © 2007 IEEE

Fig. 2.29 Simplified circuit model for GSI power distribution systems.

36

2D and 3D integrated systems

Absolute value off the worstt case peak noiise (V)

Based on this circuit model, compact physical models can be derived and the relationships between the power supply noise and other physical parameters can be quantified as shown in Figures 2.30–2.32. (2.29),Rs=0.88:

0.20

(2.29),Rs=0.44:

(2.29),Rs=0.22:

0.18

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE,Rs=0.22:

SPICE,Rs=0.11:

0.16 0.14 0.12 0.10 0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

Prop portion of the chipp area occupied p by decaps

© 2007 IEEE

Fig. 2.30 The worst case peak noise as a function of the chip area occupied by decaps: Comparison between the physical model in [50] and the results of SPICE simulations for a pair of grids.

We can observe from Figures 2.30–2.32 that ∆ I noise is sensitive to the amount of decaps, package level inductance, and the number of I/O pads. Decap insertion is an effective way to reduce the noise level. However, the on-die area budget for decoupling capacitors can be limited. In this situation, package-level high density I/O solutions, such as sea of leads (SoL) [51], can be used to suppress the power supply noise. High density chip I/Os can greatly reduce the loop inductance of power distribution networks, resulting in smaller noise. Larger numbers of I/Os can also reduce the IR-drop. It is also of great importance to project the power noise trends for different generations of technology. In [50], the worst case peak noise value is calculated for a high performance microprocessor unit (MPU) for each generation from the 65 nm node (year 2007) to the 18 nm node (year 2018). Figure 2.33 suggests that supply noise could reach 25% Vdd at the 18 nm node compared to 12% Vdd for current technologies if the ITRS [52] scaling trends are followed. Excessive noise can cause severe difficulties for circuit designers, and new solutions to tackle this supply noise problem are needed in the future. The importance of scaling package parameters such as the number of I/O pads is also indicated in Figure 2.33. It can be seen that by increasing the pad number by 1.3x each generation, the supply noise can be kept well under control.

orst case Absolute value of the wo peak n noise (V)

2 2D and 3D integrated systems

0.22 0.20 0.18

37

(2.29),Rs=0.88:

(2.29),Rs=0.44:

(2.29),Rs=0.22:

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE,Rs=0.22:

SPICE,Rs=0.11:

0.16 0.14 0.12 0.10 0 08 0.08 200p

300p

400p

500p

600p

700p

800p

Package inductance per I/O (H) Fig. 2.31 The worst case peak noise as a function of L p : Comparison between the physical model in [50] and the results of SPICE simulations for a pair of grids.

Absolutte value oof the worrst case peak nooise (V)

0.18 0.16 0 14 0.14

(2.29),Rs=0.88:

(2.29),Rs=0.44:

(2.29),Rs=0.22:

(2.29),Rs=0.11:

SPICE,Rs=0.88:

SPICE,Rs=0.44:

SPICE Rs=0.22 SPICE,R =0 22:

SPICE Rs=0.11 SPICE,R =0 11:

0.12 0.10 0.08 0.06 0.04 2000

4000

6000

8000

10000

Number of total power/ground I/O pads © 2007 IEEE

Fig. 2.32 The worst case peak noise as a function of the number of pads: Comparison between the physical model in [50] and the results of SPICE simulations.

38

2D and 3D integrated systems

0.26 0.24

Vnoise / Vdd

0.22

ITRS scaling 1 3x pad number scaling 1.3x

0.20 0.18 0.16 0.14 0.12 0 10 0.10 0.08 2006

2008

2010

2012

Year

2014

2016

2018

© 2007 IEEE

Fig. 2.33 Technology trends of the worst case peak noise.

2.4.2 Power delivery and design implications of 3D systems 3D nanosystems can provide enormous advantages in achieving multi-functional integration, improving system speed and reducing the power consumption for future generations of ICs [53]. 3D chip stacks have been used in commercial products though today’s applications are mainly focused on low power portable devices, such as flash memories and wireless chips. At the high performance end, industry has already started to pave the way for microprocessor stacking and microprocessormemory stacking, which will extend Moore’s Law beyond its expected limits and help break the bottleneck of the memory bandwidth problem for multi-core microprocessors [54, 55]. Through-silicon-vias (TSVs) and micro-bumps are the key technologies to fulfill 3D chip stacks for high performance applications; they eliminate the need for long-metal wires that connect today’s 2-D chips together, instead, relying on short vertical connections etched through the silicon wafer [54]. These TSVs and micro-bumps enable multiple chips to be stacked together, allowing greater amounts of information to be passed between them. However, stacking multiple high-performance die may result in severe power integrity problems. As shown in Figure 2.34, if multiple high power microprocessors are stacked together and flip-chip technology for 3D chip stacking is used, several hundred amperes of current (or even more) will need to be delivered through limited footprint area. Also the supply current flows through the micro-bumps and narrow TSVs that may exhibit large parasitic inductance. These may potentially lead to a large ∆ I noise if stacked chips switch simultaneously. Thus, power distribution net-

2 2D and 3D integrated systems

39

ȝp4, 100W, 125A Long inductive trace

ȝp3, 3 100W, 100W 125A ȝp2, 2 100W, 100W 125A ȝp1, p1 100W, 100W 125A Package Large amount off currentt

Fig. 2.34 Power integrity problem of 3D chip stack.

works in 3D systems need to be accurately modeled and carefully designed. In [56], analytical models are derived to describe the frequency-dependent characteristics of the power supply noise in each stack of the chips and to obtain physical insight into the rather complex power delivery networks in 3D systems. The model in [56] comes from the simplified circuit model of the power distribution network in 3D systems, as shown in Figure 2.35. A wire between two nodes on the ith die is simply modeled as a lumped resistance Rsi . The decoupling capacitance per unit area of the ith die is represented by Cdi . The current density for an active block of Die i is represented by Ji (s) in the Laplace domain. Inductance L p is the per pad loop inductance associated with the package, connected to the bottom-most die (Die 1). Each silicon TSV is modeled as connected inductor Lvia and resistor Rvia in series (this includes the parasitics of the micro-bumps when they are used between die). Symbols x and y represent the distances between two adjacent power (or ground) nodes in the same wiring level for ox and y directions, respectively. The models derived in [56] help address the power integrity problem of 3D integration systems. To make a worst case scenario analysis, the worst case peak noise value will be considered. The case when a single die is switching is considered first. This can be the case when one die dissipates considerably larger power relative to the other die in the stack. An example of such a system is a processor die with several memory die. It can be seen in Figure 2.36 that if the total number of the stacked die increases, the noise level for the topmost die decreases when the number of die is less than 6. This is because non-switching dice behave as decaps for the switching die. However,

40

2D and 3D integrated systems

J i ( s )'x'y

Cdi 'x'y

© 2007 IEEE

Fig. 2.35 Simplified circuit model for 3D stacked system.

Die n Die n-1

Die 1 Package

Absolute Valuue of Power Nooise (mV)

(a) W o r s t C a s e P e a k N o is e f o r T o p m o s t D ie ( S P I C E ) W o r s t C a s e P e a k N o is e f o r T o p m o s t D ie ( P h y s ic a l M o d e l)

180

160

140

0

2

4

6

8

10

T o ta l n u m b e r o f d ic e

(b) Fig. 2.36 Single die switching, increasing total number of die.

© 2007 IEEE

2 2D and 3D integrated systems

41

when the number of die increases beyond 6, the increase in decaps can not compensate the impact of the longer inductive TSV traces and micro-bumps associated with those added die, which result in the increase of the noise level.

2D 2-D

3-D © 2007 IEEE

Fig. 2.37 Achieving shorter interconnects between communicating blocks by using 3D integration.

If only one die is switching, the noise is smaller than the single chip case (2-D case), because the switching die can use the decaps of those non-switching die in the 3D stacks. However, it is expected that the activities of the two blocks with the same footprints are highly correlated because an important purpose of 3D integration is to put the blocks that communicate most as close to each other as possible, as shown in Figure 2.37. Therefore, we must consider the worst-case scenario when all the functional blocks sharing the same footprint switch simultaneously, as shown in Figure 2.38. If the total number of die is increased and the noise levels of the topmost and bottommost levels are examined, it can be seen that when all die are switching the noise produced in a 3D integrated system is unacceptable when compared to a single chip case. This is especially true for the topmost die where the noise level changes dramatically (180 mV for the single die case as opposed to 790 mV for the 10 die case). Even for the bottommost die, methods of suppressing the noise need to be identified. If we can use a whole die as decap (100% area is occupied by decap) and stack the “decap die” with other die, the noise can be suppressed to some extent. For example, if the same setup as discussed in previous sections is adopted and four die with one decap die are stacked together, putting the decap die on the top can result in a 36% reduction in the worst-case peak noise (256 mV compared to 400 mV). Putting the decap die at the bottom of the stack can result in a 22% reduction (312

42

2D and 3D integrated systems Die n Die n n-1 1

Die 1 Package Absolute V Value of Poweer Noise (mV V)

(a) 800 600 400 200

Topmost Die (SPICE) Topmost Die (Physical M odel) Bottommost Die (SPICE) Bottommost Die (Physical M odel)

0 2

4

6

8

10

Total # of Dice

(b)

© 2007 IEEE

Fig. 2.38 All die switching and increasing total number of die.

mV compared to 400 mV). Although improvements result from the decap die, we still need to add more decap die to achieve the noise level of a single die (182 mV). Figures 2.39b through 2.39d illustrate the case of different schemes for using two decap die. By putting the two decap die on the top, we can suppress the noise to the level of a single chip. It can be seen that putting the decap die on the top is the best scheme to suppress the noise of the fourth die. Instead of adding a decap die, it will be more efficient if high-k material is used between the power and ground planes (on-chip). Finally, it should be emphasized that cooling also presents challenges to 3D integration (also discussed in this chapters), and the newly developed microfluid cooling technique can potentially alleviate this problem. Another possible solution is to use more TSVs. To examine the efficiency of increasing the number of TSVs, in the first case, a five die-stacking structure is used, and the total number of power/ground I/Os is fixed as 2048. As shown in Figure 2.40, one cannot benefit by solely increasing the number of TSVs. Because the parasitics of TSVs are much smaller than those of the package, only small changes for noise level can be obtained by inceasing the number of TSVs. Adding more TSVs might even make designers lose benefits because TSVs consume die area that would be potentially used for decaps or additional circuits for noise suppressing purposes.

2 2D and 3D integrated systems

43 Die 4 Die 3

Decap

Single Die

Die 2

Package

Die 1

|Vnoise|=182 mV (a)

Decap Package

|Vnoise|=266 mV, 34% reduction

(b)

Decap

Decap

Di 4 Die

Decap

Die 3

Die 4

Decap

Die 3

195 mV

Die 2

Die 2

204 mV

Die 1

Die 1

200 mV

Package

Package

|Vnoise|=228 mV, 43% reduction

|Vnoise|=199 mV, 51% reduction

(c)

(d)

Fig. 2.39 Effect of adding two “decap” die. (a) Single die switching; (b) One “decap” die at the bottom and the other in the middle; (c) One “decap” die in the middle and the other on the top; (d) Both “decap” die on the top.

In the second case, the numbers of both P/G pads and TSVs in each die are increased. This causes the power supply noise to greatly reduce and even reach the level of a single chip, as shown in Figure 2.41. These two cases show that the bottleneck is still power/ground I/Os as they have a critical role in determining the power supply noise. The inductance of the package is the dominant part throughout the whole power delivery path for the first droop noise. Therefore, the power integrity problem needs an I/O solution that can provide high-density interconnection without sacrificing the mechanical attributes needed for reliability.

2.5 Conclusion In order to address the ever increasing adverse effects of conventional silicon ancillary technologies on the performance of CMOS nanosilicon technology, this chapter describes the implementation of low-cost and fully-compatible electrical, optical, and fluidic, or “trimodal,” I/O interconnects. We proposed that electrical I/Os be used for power delivery and signaling, optical I/Os for massive off-chip bandwidth, and fluidic I/Os (with integrated back-side heat sink) for heat removal. A key feature

44

2D and 3D integrated systems Die 5 Die 4 Die 3 Die 2

Die1 Package

(a) Absolute vaalue of power n noise (mV)

500

400 Worst case peak noise for Die 5 (SPICE simulation) Worst case peak noise for Die 5 (Physical model)

300

200

100

0

10000 20000 30000 Number of thru-vias per die

(b) Fig. 2.40 Effect of adding more TSVs: fixing the number of power/ground I/Os.

of the I/O technology is that it demands 4-5 minimally demanding masking steps and is fabricated using wafer-scale batch fabrication. The trimodal I/Os are flip-chip compatible making them usable with current assembly infrastructure and can be extended to enable the stacking of high-performance (high-power) microprocessors. Moreover, the aggressive scaling of CMOS integrated circuits makes the design of power distribution networks a serious challenge. This is because the supply voltages and thus the circuit noise margins are decreasing, while the supply current and clock frequency are increasing, which increases the power-supply noise. Excessive power-supply noise can lead to severe degradation of chip performance and even logic failure. Therefore, power-supply noise modeling and power-integrity validation are of great significance in GSI system designs. In 2-D systems, is it shown that ∆ I noise is sensitive to the amount of decaps, package level inductance, and the number of I/O pads. Decap insertion is an effective way to reduce the noise level. Package-level high density I/O solutions can also be used to suppress the power supply noise. High density chip I/Os can also alleviate the pressure of the integrity problems in future designs. Power delivery challenges are exacerbated in 3D systems. The supply current flowing through the microbumps and narrow through-silicon-vias (TSVs) may have large parasitics. This may potentially lead to a large ∆ I noise if stacked chips switch simultaneously. The relationships between the power supply noise, decap insertion, power/ground I/O allocation, and TSVs allocation are

2 2D and 3D integrated systems

45

Die 5 Die 4

Die 3 Die 2

Die1

Absolute valuee of power noisse (mV) A

Package (a)

400

300

Worst case peak noise for Die 5 (SPICE simulations) Worst case peak noise f Die for Di 5 (Ph (Physical i l model) d l)

200

100 0 10000 20000 30000 Number of power/ground I/Os under the bottommost die Number of TSVs for each die

(b)

Fig. 2.41 Effect of adding more TSVs and power/ground I/Os.

discussed quantitatively. Schemes for reducing the power supply noise in 3D integrated systems are also proposed and their impact on future 3D system designs are also emphasized in this section. Liquid cooling for a 3D stack of high-performance chips is also discussed. Acknowledgements The authors acknowledge the support of the Interconnect Focus Center, one of five research centers funded under the Focus Center Research Program, a DARPA and Semiconductor Research Corporation program. This work is also in part based upon work supported by the National Science Foundation under Grant Number 0701560.

References 1. M.S. Bakir, J.D. Meindl. Integrated interconnect technologies for 3D nanoelectronic systems. Artech House, Boston, 2009. 2. Semiconductor Industry Association, “International Technology Roadmap for Semiconductors (ITRS),” 2007. 3. R. Prasher, “Thermal interface materials: historical perspective, status, and future directions,” Proceedings of the IEEE, vol. 94, 2006, pp. 1571–1586. 4. G. Schrom, P. Hazucha, H. Jae-Hong, V. Kursun, D. Gardner, S. Narendra, T. Karnik, and V. De, “Feasibility of monolithic and 3D-stacked DC-DC converters for microprocessors in

46

5. 6.

7. 8. 9. 10. 11. 12.

13. 14. 15. 16. 17. 18. 19.

20. 21.

2D and 3D integrated systems 90 nm technology generation,” Proceedings of the IEEE International Symposium on Low Power Electronics and Design, 2004, pp. 263–268. D. Mallik, K. Radhakrishnan, J. He, C.-P. Chiu, T. Kamgaing, D. Searls, and J.D. Jackson, “Advanced package technologies for high performance systems,” Intel Technology Journal, vol. 9, 2005, pp. 259–271. P. Hazucha, G. Schrom, H. Jaehong, B.A. Bloechel, P. Hack, G.E. Dermer, S. Narendra, D. Gardner, T. Karnik, V. De, and S. Borkar, “A 233-MHz 80%-87% efficient four-phase DCDC converter utilizing air-core inductors on package,” IEEE Journal of Solid-State Circuits, vol. 40, 2005, pp. 838–845. K. Shakeri and J.D. Meindl, “Compact physical IR-drop models for chip/package co-design of gigascale integration (GSI),” IEEE Transactions on Electron Devices, vol. 52, 2005, pp. 1087–1096. D.A.B. Miller, “Rationale and challenges for optical interconnects to electronic chips,” Proceedings of the IEEE, vol. 88, 2000, pp. 728–749. D. Huang, T. Sze, A. Landin, R. Lytel, and H.L. Davidson, “Optical interconnects: out of the box forever?” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 614–623. M. Lipson, “Overcoming the limitations of microelectronics using Si nanophotonics: solving the coupling, modulation and switching challenges,” Journal of Nanotechnology, vol. 15, 2004, pp. 622–627. A.G. Kirk, D.V. Plant, M.H. Ayliffe, M. Chateauneuf, and F. Lacroix, “Design rules for highly parallel free-Space optical interconnects,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 531–547. C. Debaes, M. Vervaeke, V. Baukens, H. Ottevaere, P. Vynck, P. Tuteleers, B. Volckaerts, W. Meeus, M. Brunfaut, J. Van Campenhout, A. Hermanne, and H. Thienpont, “Low-cost microoptical modules for MCM level optical interconnections,” IEEE Journal of Selected Topics in Quantum Electronics, vol. 9, 2003, pp. 518–530. Y. Ishii, S. Koike, Y. Arai, and Y. Ando, “SMT-compatible large-tolerance ’OptoBump’ interface for interchip optical interconnections,” IEEE Transactions on Advanced Packaging, vol. 26, 2003, pp. 122–127. X. Wang, F. Kiamilev, P. Gui, J. Ekman, G.C. Papen, M.J. McFadden, M.W. Haney, and C. Kuznia, “A 2-Gb/s optical transceiver with accelerated bit-error-ratio test capability,” Journal of Lightwave Technology, vol. 22, 2004, pp. 2158–2167. J.W. Joyner, P. Zarkesh-Ha, and J.D. Meindl, “Global interconnect design in a threedimensional system-on-a-chip,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 12, 2004, pp. 367–372. G.G. Shahidi, “Evolution of CMOS technology at 32 nm and beyond,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 413–416. G. Huang, M. Bakir, A. Naeemi, H. Chen, and J.D. Meindl, “Power delivery for 3D chip stacks: physical modeling and design implication,” emphProceedings of the IEEE Conference on the Electrical Performance of Electronic Packaging, 2007, pp. 205–208. H. Ishikuro, N. Miura, and T. Kuroda, “Wideband inductive-coupling interface for highperformance portable system,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 13–20. J.Q. Lu, Y. Kwon, G. Rajagopalan, M. Gupta, J. McMahon, K.W. Lee, R.P. Kraft, J.F. McDonald, T.S. Cale, R.J. Gutmann, B. Xu, E. Eisenbraun, J. Castracane, and A. Kaloyeros, “A wafer-scale 3D IC technology platform using dielectric bonding glues and copper damascene patterned inter-wafer interconnects,” Proceedings of the IEEE International Interconnect Technology Conference, 2002, pp. 78–80. C.S. Tan, K.N. Chen, A. Fan, and R. Reif, “A back-to-face silicon layer stacking for threedimensional integration,” Proceedings of the IEEE International SOI Conference, 2005, pp. 87–89. J.A. Burns, B.F. Aull, C.K. Chen, C.-L. Chen, C.L. Keast, J.M. Knecht, V. Suntharalingam, K. Warner, P.W. Wyatt, and D.R.W. Yost, “A wafer-scale 3D circuit integration technology,” IEEE Transactions on Electron Devices, vol. 53, 2006, pp. 2507–2516..

2 2D and 3D integrated systems

47

22. D.J. Witte, F. Crnogorac, D.S. Pickard, A. Mehta, Z. Liu, B. Rajendran, P. Pianetta, and R.F.W. Pease, “Lamellar crystallization of silicon for 3-dimensional integration,” Microelectronic Engineering, vol. 84, 2007, pp. 118. 23. J. Feng, Y. Liu, P.B. Griffin, and J.D. Plummer, “Integration of Germanium-on-Insulator and Silicon MOSFETs on a silicon substrate,” IEEE Electron Device Letters, vol. 27, 2006, pp. 911–913. 24. M.S. Bakir, B. Dang, and J.D. Meindl, “Revolutionary nanosilicon ancillary technologies for ultimate-performance gigascale systems,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2007, pp. 421–428. 25. M.S. Bakir and J.D. Meindl, “Sea of polymer pillars electrical and optical chip I/O interconnections for gigascale integration,” IEEE Transactions on Electron Devices, vol. 51, 2004, pp. 1069–1077. 26. M.S. Bakir, T.K. Gaylord, K.P. Martin, and J.D. Meindl, “Sea of polymer pillars: compliant wafer-level electrical-optical chip I/O interconnections,” IEEE Photonics Technology Letters, vol. 15, 2003, pp. 1567–1569. 27. O.O. Ogunsola, H.D. Thacker, B.L. Bachim, M.S. Bakir, J. Pikarsky, T.K. Gaylord, and J.D. Meindl, “Chip-level waveguide-mirror-pillar optical interconnect structure,” IEEE Photonics Technology Letters, vol. 18, 2006, pp. 1672–1674. 28. B. Dang, M.S. Bakir, and J.D. Meindl, “Integrated thermal-fluidic I/O interconnects for an on-chip microchannel heat sink,” IEEE Electron Device Letters, vol. 27, 2006, pp. 117–119. 29. B. Dang, “Integrated input/output interconnection and packaging for GSI,” Ph.D. Thesis, Georgia Institute of Technology, 2006. 30. B. Dang, P. Joseph, M.S. Bakir, T. Spencer, P. Kohl, and J.D. Meindl, “Wafer-level microfluidic cooling interconnects for GSI,” Proceedings of the IEEE International Interconnect Technology Conference, 2005, pp. 180–182. 31. M.S. Bakir, B. Dang, O. Ogunsola, and J.D. Meindl, “’Trimodal’ wafer-level package: fully compatible electrical, optical, and fluidic chip I/O interconnects,” Proceedings of the Electronic Component and Technology Conference, 2007. 32. M.S. Bakir, D. Bing, O.O.A. Ogunsola, R. Sarvari, and J.D. Meindl, “Electrical and optical chip i/o interconnections for gigascale systems,” IEEE Transactions on Electron Devices, vol. 54, 2007, pp. 2426–2437. 33. M.S. Bakir, A.L. Glebov, M.G. Lee, P.A. Kohl, and J.D. Meindl, “Mechanically flexible chipto-substrate optical interconnections using optical pillars,” IEEE Transactions on Advanced Packaging, vol. 31, 2008, pp. 143–153. 34. A.L. Glebov, D. Bhusari, P. Kohl, M.S. Bakir, J.D. Meindl, and M.G. Lee, “Flexible pillars for displacement compensation in optical chip assembly,” IEEE Photonics Technology Letters, vol. 18, 2006, pp. 974–976. 35. H.Y. Zhang, D. Pinjala, T.N. Wong, and Y.K. Joshi, “Development of liquid cooling techniques for flip chip ball grid array packages with high heat flux dissipations,” IEEE Transactions on Components and Packaging Technology, vol. 28, 2005, pp. 127–135. 36. E.G. Colgan, B. Furman, A. Gaynes, W. Graham, N. LaBianca, J.H. Magerlein, R.J. Polastre, M.B. Rothwell, R.J. Bezama, R. Choudhary, K. Marston, H. Toy, J. Wakil, and J. Zitz, “A practical implementation of silicon microchannel coolers for high power chips,” Proceedings of the IEEE Semiconductor Thermal Measurement and Management Symposium, 2005, pp. 1– 7. 37. D.B. Tuckerman and R.F.W. Pease, “High-performance heat sinking for VLSI,” IEEE Electron Device Letters, vol. 2, 1981, pp. 126–129. 38. C.K. King, D. Sekar, M.S. Bakir, B. Dang, J. Pikarsky, and J.D. Meindl, “3D stacking of chips with electrical and microfluidic I/O interconnects,” Proceedings of the Electronics Components and Technology Conference, 2008. 39. D. Sekar, C. King, B. Dang, T. Spencer, H.D. Thacker, P. Joseph, M.S. Bakir, and J.D. Meindl, “A 3D-IC technology with integrated microchannel cooling,” Proceedings of the International Interconnect Technology Conference, 2008.

48

2D and 3D integrated systems

40. M.S. Bakir, C. King, D. Sekar, H.D. Thacker, B. Dang, G. Huang, A. Naeemi, and J.D. Meindl, “3D heterogeneous integrated systems: liquid cooling, power delivery, and implementation,” Proceedings of the IEEE Custom Integrated Circuits Conference, 2008. 41. H.D. Thacker, O. Ogunsola, A. Carson, M.S. Bakir, and J.D. Meindl, “Optical through-wafer interconnects for 3D hyper-integration,” Proceedings of the IEEE Lasers and Electro-Optics Society Annual Meeting, 2006, pp. 28–29. 42. J.H. Wu, J. Scholvin, and J.A. del Alamo, “A through-wafer interconnect in silicon for RFICs,” IEEE Transactions on Electron Devices, vol. 51, 2004, pp. 1765–1771. 43. J.D. Meindl, “Low Power Microelectronics: Retrospect and Prospects,” Proceedings of IEEE, vol. 83, 1995, pp. 619–635. 44. M. Swaminathan and E. Engin. Power Integrity: Modeling and Design for Semiconductor and Systems. Prentice Hall PTR, 2007. 45. H. Zheng, B. Krauter and L.T. Pileggi, “Electrical Modeling of Integrated-Package Power/Ground Distributions,” IEEE Design and Test of Computers, vol. 20, no. 3, 2003, pp. 23–31. 46. K.L. Wong, T. Rahal-Arabi, M. Ma, and G. Taylor, “Enhancing Microprocessor Immunity to Power Supply Noise with Clock-Data Compensation,” IEEE Journal of Solid-State Circuits, vol. 41, no. 4, 2006. 47. W.D. Becker, J. Eckhardt, R.W. Frech, G.A. Katopis, E. Klink, M.F. McAllister, T.G. MacNamara, P. Muench, S.R. Richter, and H.H. Smith, “Modeling, Simulation, and Measurement of Mid-Frequency Simultaneous Switching Noise in Computer Systems,” IEEE Transactions on Components, Packaging, and Manufacturing Technology, part B, vol. 21, 1998, pp. 157–163. 48. A. Dharchoudhury, R. Panda, D. Blaauw, R. Vaidyanathan, “Design and Analysis of Power Distribution Networks in PowerPC Microprocessors,” Design Automation Conference, 1998, pp. 738–743. 49. R. Tummala. Fundamentals of Microsystems Packaging. McGraw Hill, 2001. 50. G. Huang, D. Sekar, A. Naeemi, K. Shakeri, and J.D. Meindl, “Compact physical models for power supply noise and chip/package co-design of gigascale integration”, Proceedings of the Electronic Component and Technology Conference 2007. 51. M.S. Bakir, H. A. Reed, H.D. Thacker, P.A. Kohl, K.P. Martin, and J.D. Meindl, “Sea of Leads (SoL) ultrahigh density wafer level chip input/output interconnections,” IEEE Transactions on Electron Devices, vol. 50, no. 10, 2003, pp. 2039–2048. 52. Semiconductor Industry Association, “International Technology Roadmap for Semiconductors (ITRS),” 2004. 53. K. Banerjee, S.J. Souri, P. Kapur, and K.C. Saraswat, “3D ICs: A novel chip design for improving deep-submicrometer interconnect performance and systems-on-chip integration,” Proceedings of the IEEE, vol. 89, no. 5, 2001, pp. 602–633. 54. J.U. Knickerbocker, P.S. Andry, B. Dang, R.R. Horton, C.S. Patel, R. Polastre, K. Sakuma, E. Sprogis, C.K. Tsang, and S.L. Wright, “3D chip stacks and silicon packaging technology using through-silicon-vias (TSV) for systems integration,” 3D System Integration Conference (3D-SIC), 2007. 55. J. Held, J. Bautista, and S. Koehl, “From a few cores to many: a tera-scale computing research overview,” Research at Intel White Paper. 56. G. Huang, M. Bakir, A. Naeemi, H. Chen, and J.D. Meindl, “Power delivery for 3D chip stacks: physical modeling and design implication,” Proceedings of the Electrical Performance of Electronic Packaging, 2007, pp. 205–208.

Part III

Coupled Data Technologies

Chapter 3

Capacitive Coupled Communication David Hopkins, Alex Chow, Frankie Liu, Dinesh D. Patil, Hans Eberle

3.1 Introduction Capacitive coupled communication is a wireless chip to chip communication technology that uses capacitive coupling to transfer signals from a chip to neighboring chips. Its high-bandwidth, low-power, and low-latency chip-to-chip I/O capabilities enable the construction of high-performance and economical multi-chip modules (MCMs). Chips are placed face-to-face (Figure 1), with only a few microns of separation, such that overlapping transceiver circuits communicate through capacitive coupling between top-layer metal pads [1]. By using relatively small metal structures to communicate signals over short distances, capacitive coupled communication directly improves channel density, power, and latency to more closely match the performance of on-chip wires. With capacitive coupling, chips communicate without off-chip wires or soldered connections. The absence of permanent attachment enables easy removal and replacement of individual chips. This could simplify package rework during manufacturing, solve the known-good-die issue of multi-chip packages and further lower packaging cost [2]. While assembling chips using capacitive coupled communication yields important performance and cost benefits, it also presents a number of electrical and mechanical challenges. First, chips must be precisely aligned to ensure that each transmitter couples strongly to its corresponding receiver. As chips move apart in any of the six alignment axes (see Figure 3.2) and become misaligned from the nominal position, signal strength degrades and noise becomes more significant. Dense packing allows low-latency, energy efficient communication at the expense of increased power density. Furthermore, the spatial concentration required to form a two dimensional grid of chips in a multi-chip module necessitates new packaging solutions that David Hopkins, Alex Chow, Dr. Frankie Liu, Dr. Dinesh D. Patil, and Dr. Hans Eberle Sun Microsystems Research Labs, 16 Network Circle, Menlo Park, CA 94025, USA, e-mail: {robert.hopkins},{alex.chow},{frankie.liu},{dinesh.d.patil},{hans.eberle}@sun.com R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_3, © Springer Science+Business Media, LLC 2010

51

52

Capacitive Coupled Communication

Chip 2 Chip 1

Chip 3

Transmit

Receive

Receive

Transmit

© 2003 IEEE

Fig. 3.1 Capacitive coupled communication between face to face chips.

can hold the chips in alignment, deliver adequate power to every chip, and extract heat from the tightly packed module. In comparison to capacitive coupled communication, optical coupled communication may be more tolerant of z-misalignment (interchip gap). In addition, when combined with advanced techniques such as wavelength division multiplexing, optical coupling provides even higher bandwidth density than capacitve coupling. However, it requires an added layer of complexity to convert signals between the optical and electrical domains for use in standard electronic circuits. Area and energy efficient conversion between the optical and electrical domains, using technology compatible with standard CMOS devices and fabrication, is an area of active research today [3, 4, 5]. Inductive coupled communication has shown additional tolerance to z-misalignment (interchip gap), but has severe crosstalk that must be mitigated for reliable communication [6, 7, 8].

© 2007 IEEE

Fig. 3.2 Misalignment can be in any of six axes: three translational and three rotational.

This chapter begins with the development of an electrical model for capacitive interchip communication, followed by a discussion of transceiver circuitry, two di-

3 Capacitive Coupled Communication

53

mensional arrays of capacitive communication links, testchip measurement results and an application prototype.

3.2 An electrical model of capacitive interchip communication In this section we develop an electrical model of capacitive interchip communication in order to assess the signal and noise characteristics and their effect on the performance of a communication channel. As shown previously in 3.1, chips are placed face-to-face with a few microns of separation, such that overlapping transceiver circuits couple capacitively from the transmitter on one chip to the receiver on an adjacent chip. An interchip communication system may contain hundreds or thousands of capacitive coupled channels. We present here the electrical model of a single transmit-receive pair, describing the channel behavior and effects induced upon it by neighboring communication channels. In an isolated capacitive communication channel with nominal alignment, the transmitter pad is positioned exactly opposite the receiver pad, so that series coupling capacitance is provided by the entire overlap area. However, since capacitive channels are typically employed within a densely packed two-dimensional grid, we must amend this simplistic model to reflect the influence of both stray capacitance and crosstalk to neighboring communication channels. Each transmitter and receiver pad has additional stray capacitance coupling with its environment, including neighbors on the same chip, neighbors on the opposite chip, substrate and other nearby metal structures. The transmitter and receiver plates each have parasitic capacitance to fixed potentials and coupling capacitance between them. These three components yield a capacitive π-model. In addition, both the transmitter pad and receiver pad have capacitance to neighboring pads, resulting in crosstalk. Noise couples into the receiver pad only through fringe capacitance, which may be much lower than the area capacitance. Neighbors that couple only diagonally by way of a corner are less significant, and will be omitted in much of this discussion for simplicity, although it is straightforward to extend the analysis to include more remote neighbors. Figure 3.3 depicts a simple π-model with the side capacitors on either chip split into crosstalk and parasitic components.

TX attackers

RX attackers CTXT

Din

CTPar

Cc

CRXT Dout

CRPar

Fig. 3.3 Simple channel model showing crosstalk and parasitic capacitances. Though not shown here, capacitively coupled channels are typically differential.

54

Capacitive Coupled Communication

A thorough understanding of the types of noise present in a capacitive coupled communication link is essential to making informed design decisions. To ease analysis of the many types of noise present in modern computer systems, noise sources are partitioned into bounded noise (Nb) such as crosstalk and power supply noise, and unbounded noise (Nu) created by Gaussian random processes. To estimate the bit-error rate (BER) of the communication link, we calculate a signal to noise ratio (SNR) using the signal strength (defined as half the peak-to-peak voltage swing) and bounded noise, divided by the standard deviation of the unbounded noise. As there are a finite number of possibilities for the nearest neighbor crosstalk, a weighted sum is used to calculate the aggregate error rate. For each possible crosstalk condition, the SNR is calculated and the BER contribution is determined by using a weighting (pi ) corresponding to the likelihood of that crosstalk condition. By summing over all possible crosstalk conditions, we find a BER estimate. TX attackers

RX attackers

Signal C− Nbi RXT Cc σ (Nui ) � Q D�out � SNR � 1 1 i i BER = ∑ pi · · erfc √ = ∑ pi · · erfc √ 2 2 2 2 i i C C

CTXT = SNR i Din

TPar

(3.1) (3.2)

RPar

where erfc(x) denotes the complementary error function, associated with the probability that a normally distributed random variable lies outside a certain region defined by the argument x. TX attackers RX attackers Nui The most important sourcesCRXT of random and unbounded noise are the noise curCTXT comparator C c rents generated by the MOS transistors in the receiving amplifier. We consider a Din Dout simple model of the capacitive coupled channel, bias circuitry and sense amplifier sampler receiver where the differential receiver pads are connected directly to the inputs of bias CRPar Co input data TPar a receiving senseCamplifier. The sense amplifier periodically samples the and produces full-swing digital output levels. The input bias voltage is set by weak transistors. The minimum signal strength required to satisfy a given BER can be estimated by calculating the total receiver noise referred to the sense amp inputs. —

Cc

Nbi

Nui

comparator Dout

Si Signal l sampler Ci

bias

Co

vbias

Fig. 3.4 Channel model showing a receiving amplifier consisting of an ideal bias circuit, an ideal sampler, and an ideal comparator, with sources of both bounded, deterministic noise Nbi and random, unbounded noise Nui . The total receiver input capacitive load, Ci , includes all parasitics. Channels are typically differential in practice.

Two independent noise sources have significant impact on the operation of the capacitive coupled receiver. The first comes from the input bias circuits and its vari-

3 Capacitive Coupled Communication

55

ance is given by kT Ci , where Ci is the total capacitance on the input node, k the Boltzmann constant, and T the temperature. The second source of noise is the set of transistors in the sense amplifier. Computing the noise generated by the sense amplifier is much more complex. The sense amplifier is typically a non-linear and time-varying element, making it difficult to refer noise sources appropriately to the input. The sense amplifier periodically samples the inputs, amplifies the voltage difference and then regenerates the resulting signal as the transistors go through different regions of operation. Therefore, the input-referred noise cannot be calculated by treating the sense amp as a conventional linear amplifier.

clk

clk

out—

out+

in+

in—

Fig. 3.5 A typical sense-amplifier used in capacitively coupled channels.

Several different approaches to estimating the noise of a sense amplifier noise like that shown in Figure 3.5 have been proposed [10, 11]. A simplification that yields good results models the sense amplifier as a linear, periodically time-variant system. By calculating the output noise, and refering it to the input by dividing by the low-frequency gain of the amplifier, these methods reveal useful and reasonably accurate results. The low-frequency gain is the product of three terms: the gain of the differential pair at the input, the gain of the regenerative pair before the onset of regeneration, and the regenerative gain. Some simplifying assumptions, including operating the sense-amplifier as fast as the technology allows, lead to a very understandable and informative result: σ (Nui )2 =

kT γkT + Ci Co

(3.3)

where Ci and Co are the total capacitance at the input and output nodes of the amplifier, respectively, and γ is the excess noise factor in CMOS technologies. This reveals a fundamental tradeoff between power, maximum operating speed and noise. Given a fixed power target, a designer can use a small amplifier (with low input and output capacitance) that functions up to a high data rate but exhibits more noise than a larger, slower amplifier.

56

Capacitive Coupled Communication

3.2.1 Crosstalk mitigation For capacitive coupled channels with typical geometries, nearest neighbor crosstalk from adjacent transmitter and receiver pads is the dominant source of crosstalk. There are many different choices for the arrangement of I/O channels in a two dimensional grid and the arrangement effects the interaction between a channel and its nearest neighbors. Although some early work focused on single-ended signaling, differential signaling was found to have significant advantages in terms of sensitivity, noise rejection, and reduced return path ambiguity. Single-ended signaling devotes a single pad to each signal, whereby information is encoded as changes in the voltage of that pad relative to a common reference, usually the ground voltage on the chip. Single-ended signaling has the disadvantage that all four neighboring pads may oppose a given transition, leading to substantial crosstalk. Differential signaling sends information encoded as the difference in voltage between a pair of adjacent pads. Although twice as many pads are needed for such a scheme, the benefits in terms of channel reliability are significant enough to outweigh the area penalty. We consider three arrangements of pads for differential signaling along with single-ended, as shown in Figure 3.6. Side differential signaling places differential positions pairs side-by-side, such that each shares an edge with its complementary neighbor. Corner differential signaling and butterfly differential signaling place differential pairs diagonal to one another, such that each shares a corner with its complementary neighbor. Each of these configurations has a distinct impact on the magnitude of crosstalk noise. We developed an arrangement of channels called butterfly differential signaling (Figure 3.6d) which completely rejects nearest neighbor crosstalk, for a receiver with good common mode rejection. Figure 3.6d highlights a differential channel (pads A+ and A-) and four adjacent channels (B, C, D, and E). Channel A sees no net noise from channel B, because pads B+ and B- couple equally to A+; any noise due to a transition on B+ is canceled by an opposing transition on B-. Pads E+ and E- act similarly on A-. Channel A sees no net noise from D or C, because D+ and C- couple equally to A+ and A-; any noise due to a transition on D or C is thus common-mode to A. This crosstalk cancellation scheme enables reliable communication using smaller I/O pads, over a greater chip separation, and at higher data rates. This pad arrangement can also mitigate crosstalk in any 2D array of communication channels, including channels on different layers on a printed circuit board or on adjacent area solder connections.

3.2.2 Simulation results The signal and noise properties of capacitive coupled channels can be studied using a 3D electromagnetic field solver to extract all the coupling capacitances between pads on the two chips. The extracted coupling capacitance between corresponding signal pads indicates the available signal at the receiver, while the extracted cross-

3 Capacitive Coupled Communication

57

+

+

+

-

+

-

+

+

+

-

+

-

(a) Single-ended

(c) Corner Differential

(b) Side Differential

EE+ +

C+

D-

A-

D+ +

C-

A+

B-

B+ + (d) Butterfly Differential

Fig. 3.6 Pad arrangements for four different signaling schemes: (a) single-ended; (b) side differential; (c) corner differential; (d) butterfly differential.

coupling capacitances indicate the amount of noise that may be injected into a receiver pad. This enables a comparison of different signaling schemes in terms of their ability to reject noise. By modifying the geometries in these models, it is also possible to study how signal and noise levels are affected by chip misalignment, pad sizes, and different dielectric materials. In order to get a sense of scale, the metal and dielectric stackup from a representative modern process is shown in Figure 3.7. For a 90 nm process, there are typically six to ten copper interconnect layers sandwiched between a variety of insulating glass dielectric materials. Most of the dielectric materials have properties similar to silicon dioxide, so we use an approximate relative dielectric constant four times that of vacuum. Each interconnect layer and its respective interlayer dielectrics create a sandwich structure of approximately one micron in thickness (except for the top and bottom layers which are thicker and thinner, respectively). Figure 3.8 shows an example of a chip-to-chip pad model that can be used to extract signal and crosstalk capacitances between transmitting and receiving pads using an electromagnetic field solver. The model consists of two arrays of square pads, one on each chip. The central pad in each array is the transmitting or receiving pad. Neighboring pads that

58

Capacitive Coupled Communication

Passivation

Metal N (Top layer)

... Metal 3

Via Metal 2

Inter-layer dielectric Metal 1

Poly Well

Substrate

Fig. 3.7 Metal and dielectric stackup of a typical modern CMOS process.

introduce crosstalk noise are modeled by the eight pads surrounding the central pad. An outer ring represents all other pads on the chip. Dielectric slabs are used to represent the passivation and intermetal layers, and ground planes are used to model the presence of other circuitry beneath the arrays. In this model, there are 22 individual conductors: 9 square pads, an outer square annulus, and a ground plane on each of the two chips. The electromagnetic field solver returns a 22-by-22 matrix that gives the self and coupling capacitances of each conductor with all other conductors in the model. The self capacitances, given by the entries along the diagonal of the matrix, are the total capacitances seen by

3 Capacitive Coupled Communication

59

Chip 1

Chip 2

Fig. 3.8 Chip-to-chip pad model for capacitance extraction using a 3D electromagnetic field solver. Ground planes and dielectric layers are not shown.

each conductor. All other entries show the coupling capacitances between two conductors.

Capacitance (fF)

100

10

Signal Crosstalk 1

0.1

0

5

10

15

20

25

30

Chip separation (µm)

Fig. 3.9 Coupling capacitance as a function of interchip spacing (z), for I/O pads on a 36 x 36 µm pitch.

Figure 3.9 shows the variation of signal and crosstalk capacitances with chip spacing in the absence of any translational or rotational misalignment. It is worth noting that the signal capacitance drops with chip spacing approximately as Csig (z) ≈

1 k+z

(3.4)

60

Capacitive Coupled Communication

for some constant k. The cross-coupling capacitance between a receiving pad and the crosstalk-inducing transmitting pads drops approximately as � t� (3.5) Cxtalk (z) ≈ log 1 + z

where t represents the metal thickness. The difference in these two characteristics results from the fact that the signal capacitance mainly consists of area capacitance; cross-coupling capacitance, on the other hand, mainly consists of fringe capacitance, which drops with plate distance more slowly. This is unfortunate because it indicates that as chips move apart, not only does the desired signal decrease, but the relative contribution of crosstalk also increases. 100

Capacitance (fF)

10

Signal

1

Crosstalk 0.1

0.01

0

5

10

15

20

25

30

35

In-plane misalignment x, y (µm) Fig. 3.10 Coupling capacitance as a function of in-plane misalignment (x, y), for I/O pads on a 36 x 36 µm pitch.

Figure 3.10 shows the variation of signal and crosstalk capacitances with inplane misalignment in both dimensions. For a small amount of misalignment, signal coupling does not drop significantly. However, crosstalk coupling does increase appreciably; even with crosstalk-canceling signaling, crosstalk can become significant when misalignment is more than one-quarter of the pad pitch, because coupling from corner neighbors cannot be effectively eliminated. Electronic alignment correction is therefore useful in keeping communication within an acceptable alignment range.

3 Capacitive Coupled Communication

61

3.3 Transmitting data In the simplest implementations, the transmitter circuits for each data channel consist of CMOS inverters and standard retiming elements, as necessary. The very high pad density, however, requires very high accuracy mechanical alignment. To relax this constraint, we developed electronic alignment correction [12]. This technique shifts the location of the transmitting channel to compensate for physical misalignment between the transmitting and receiving channels (Figure 3.11). Each transmit pad is physically divided into a 4x4 array of micropads. Multiplexers on the transmitting chip steer data to the micropads that best align with the receiving chip. We determine the optimal multiplexer configuration by precisely measuring chip alignment [13]. More micropads per channel reduces the residual misalignment, but at the expense of additional circuit complexity and power consumption.

Normal Tx bit location (with no misalignment)

Misalignment (x,y)

Actual Tx bit location (with misalignment)

Rx pad

Fig. 3.11 Electronic alignment correction.

Electronic alignment increases power consumption due to the extra wires and multiplexers necessary for data steering. To reduce this power cost, some implementations contain a power-efficient multiplexer that uses NMOS-only pass gates (Figure 3.12). A low select signal drives M2’s gate low, making it opaque. A high select signal drives M2’s gate to one threshold voltage below VDD . It is held very weakly in this state, as M1 is off. A rising data transition bootstraps M2’s gate above VDD , allowing M2 to pass full VDD levels. Falling transitions restore the gate voltage on M2 to one threshold voltage below VDD . Because M2’s gate voltage tracks the channel voltage, the effective channel capacitance and resistance is lower. Compared to a typical CMOS pass gate, the bootstrapped NMOS pass gate reduces overall transmitter power by more than 20%, from 2.5 pJ/bit to 2.0 pJ/bit, while providing similar edge rates. Also, because the multiplexers consist of only NMOS devices, layout is more compact. This technique is used in memory design and demonstrates

62

Capacitive Coupled Communication

both high performance and reliability. However, it requires occasional data transitions to prevent droop and jitter if the data remains high over an extended time on the order of 1 ms. select

VDD

VDD - Vth + 'VB

M1 Cb

VDD - Vth Cb

in

out M2 © 2007 IEEE

Fig. 3.12 Power-efficient pass-gate circuit.

3.4 Receiving data The capacitive coupled channel presents two key challenges to reliable recovery of transmitted data: attenuation and loss of DC information.

3.4.1 Attenuation Although the transmitter transitions between full CMOS levels, the coupling capacitor forms a voltage divider with the total capacitance on the receiver input node. In most practical circumstances, the coupling capacitance is a small fraction of the total capacitance, leading to significant attenuation. Simulation and measurement results show that the received voltage varies between 1% and 20% of the transmitted voltage. It is the responsibility of the receiving amplifier to restore this low-swing signal to full CMOS levels that can be easily used by standard logic gates. Performing this voltage amplification, while functioning at high-speed and consuming little energy, can pose a significant challenge. In addition, device variability increases as devices scale down with each technology generation. Device variability leads to mismatch between devices that the designer intends to be identical, creating asymmetry in differential amplifiers. This asymmetry biases the amplifier so that its decision threshold is no longer at zero differential voltage. As a result, an acceptable input signal must exceed this offset voltage in addition to the signal needed for noise margin and

3 Capacitive Coupled Communication

63

sensitivity. Additional circuitry can be added to the receiving amplifier to reduce these effects, but these circuits always increase complexity, power consumption, and area.

3.4.2 Loss of DC information The capacitive coupled channel combines with any shunt conductance on the receiver input node to form a high-pass filter. Spectral content below the corner frequency of this filter is attenuated, with DC information being completely lost. This creates a problem for biasing the receiving amplifier. There are a number of ways to manage this loss of DC information and establish the appropriate DC bias for the amplifier (see Figure 3.13). The most widely used method to deal with this limitation is data encoding. Popular schemes include 8b10b and 64b/66b, which encode 8 and 64 bit words as 10 and 66 bit messages, respectively [14, 15]. The increased code space allows these encoding schemes to maintain a nearly equal number of 1’s and 0’s over every two word sequence. A data stream that has nearly as many 1’s as 0’s over a reasonable timeframe is often referred to as DC-balanced. Given a system with DC-balanced data, DC biasing of the input to the receiver can be accomplished with a simple lowpass filtered version of the data stream. Although this type of scheme is used widely, it presents several important drawbacks. An encoded channel requires more bandwidth and the process of encoding and decoding the signals increases complexity, area and energy consumption. In addition, encoding can add a significant amount of latency. Without encoding, these channels may have latency as low as a few bit periods. With 8b10b encoding applied to each data channel, there is up to an additional 10 bit periods of latency for encoding and decoding. For certain latency-sensitive applications, this is unacceptable. More generally, it is possible to create DC-balanced data streams by applying modulation techniques used in wireless communication systems. In most of these schemes, the incoming data sequence or a derivative sequence is combined with a sinusoidal carrier. The data sequence may modulate the frequency, amplitude, or the phase of the sinusoidal carrier. As long as the sinusoidal carrier is not highly correlated with the data sequence, the resulting signal is nearly DC-balanced. These modulation schemes can be combined with a variety of techniques from the wireless community including multiple-access (FDMA, CDMA and TDMA) and diversity techniques, as well as many others. In addition, as in 64b66b, the data stream can be scrambled by mixing it with a pseudo-random bit sequence (PRBS), which dramatically reduces the likelihood of long-term DC imbalance. Unfortunately, these modulation techniques will usually increase the number of signal transitions, and therefore consume more power in the transmitter circuitry. Another method to deal with the loss of DC information is to periodically restore the DC bias to a known state. This requires the simultaneous application of a known voltage to both the transmitter pads and the receiver pads. Although it is possible

64

Capacitive Coupled Communication x Gbps

y Gbps, y>x

vbias

x Gbps

enc 'T

dec 'T

Din unbalanced data

Dout unbalanced data

DC balanced data

carrier or PRBS Din unbalanced data

vbias

carrier or PRBS

mixer

mixer

Dout unbalanced data

statistically DC balanced data

vbias refresh Din unbalanced data

refresh

0 VTbias

Dout

1 vtop

0 < vbot < vtop < VDD

vbot Din i unbalanced data

Dout

Fig. 3.13 Methods of dealing with the loss of DC information. From top to bottom: using coding such as 8b10B; mixing/modulation using a carrier or otherwise orthogonal code; explicit refresh of the channel; feedback with a keeper latch.

to impose these conditions without interrupting the flow of data within a channel, in most practical circumstances this requires a relatively brief pause in the flow of data on the channel undergoing this periodic restoration of DC bias which we call “refresh,” after the name given to the related process in DRAM cells. For systems that can accommodate this infrequent unavailability, refreshing the channel from time to time may be a suitable solution. Finally, the DC bias on the receiver can be maintained continuously, be means of a feedback keeper. The feedback keeper is placed around the receiving amplifier, such that once a decision is made about whether a bit is a logic 1 or 0, the keeper maintains the voltage at the input to the receiver. Thus, even if no data transitions occur for a very long time, the data signal is correctly received. The challenge with this method, however, is supplying the feedback keeper with the appropriate high and low levels. The high and low levels result from the attenuation of the channel, which depends upon environmental factors, and will not only be different for distant channels, but may also vary with time. Setting the levels incorrectly will lead to inter-symbol interference (ISI) leading to degraded noise margins. In order to minimize this degradation, adaptive schemes are usually needed.

3 Capacitive Coupled Communication

65

3.4.3 Comparators A comparator can be used as a simple receiver for capacitive coupled links. A comparator samples the small voltage difference between the pair of input signals and decides which of the signals is larger. A comparator can be either a continuous or clocked amplifier. A clocked comparator typically operates in two phases: a reset and a comparison phase. During the reset phase, the comparator is drawn asymptotically toward a metastable point, readying it for a quick decision once the input signal arrives. During the comparison phase, the incoming signal tips this delicate balance, and regenerative positive feedback amplifies the voltage difference in an exponential fashion. The usual analog circuit design trade-offs apply to clocked sense amplifiers; the smaller the initial voltage differential, the longer it takes for the comparator to resolve to full CMOS levels. As is often the case, one may also choose to trade-off added latency to further amplify the signal. Additional gain and latency are usually associated with larger power consumption, and often require more complex, multi-phase clocking schemes. On the other hand, one may also oversample the signal with a bank of comparators, so that each comparator is given more time to resolve, but the extra cost in area and power, and more importantly, the added signal attenuation due to increased parasitic capacitance from the parallel paths usually makes this an expensive option. For capacitive links, the comparator offset may be stored in its input parasitic capacitance, in order to perform offset cancelation and increase its ability to correctly sense small signals in the presence of device variability. Because the offset voltage is stored as a voltage on the parasitic capacitor it requires periodic re-calibration, at an interval that depends on how quickly charge is lost through leakage mechanisms on the input node. Because offset compensation requires this periodic interruptions of data flow in the channel, one may choose instead to reduce the intrinsic offset of the comparator. This can be accomplished by increasing the size of the devices in the comparator. This comes at a cost of area and power, and since the offset is reduced proportionally to the square root of the device area, decreasing the offset can be very expensive in terms of area. An alternative is to implement a digital offset calibration scheme, whereby a finite number of levels of compensation are available. Often implemented as a set of binary weighted capacitances that can be switched onto the internal nodes of the comparator to introduce an intentional offset to compensate for the device variation, this technique is quite effective if the ratio of maximum expected offsets to minimum resolvable voltage difference is less than ten. If this ratio is very large it becomes very expensive and complex to implement. Another challenge encountered in comparators is the kickback of charge during the comparison phase. When the regenerative feedback takes the output to the rails, some of that output charge is transferred to the input through parasitic capacitances, creating hysteresis in the response, which reduces voltage margin. This kickback can be reduced considerably by buffering the input stage with a low gain pre-amplifier stage followed by the comparator.

66

Capacitive Coupled Communication

3.4.4 Receiver sizing The sizing of data receivers has a profound impact on the sensitivity, power consumption, performance and reliability of the capacitive-coupled link. Two competing effects govern the sensitivity of the receiver (i.e. the ability of a receiver to correctly identify and amplify the incoming signal). First, the data receiver presents capacitive loading on the receiving pad; to maximize the voltage at the receiver’s input, the transistors connected to the receiving pad should be small. Second, transistor variability decreases for larger devices; to minimize threshold variations and corresponding input offset voltages, the transistors connected to the receiving pad should be large. The choice of receiver size must balance the competing goals of maximizing received voltage and minimizing threshold variation, while satisfying data rate, power consumption and area constraints.

Signal/Threshold variation (V)

1E+0

Z=0

3

1E-1

Z = 3µm Z = 5µm Z = 10µm

1E-2

Z = 15µm

1E-3

0

5

10

15

20

25

30

Device width (µm) Fig. 3.14 Variation in received voltage and device threshold uncertainty as a function of receiver size.

by

In general, the voltage on the receiving pad in a capacitive-coupled link is given Vr =

Csig Csig +Cpad +Crx

(3.6)

where Csig is the signal coupling capacitance, Cpad is the total capacitance seen by the receiving pad (including the signal coupling capacitance), and Crx is the capacitive loading presented by the receiver. Figure 3.14 shows the variation in Vr for different chip separations, as a function of the width of a transistor whose gate terminal is connected to the receiving pad. The standard deviation in the threshold of

3 Capacitive Coupled Communication

67

such a device, with respect to the nominal, is given by AVth σVth = √ WL

(3.7)

where W and L are the effective physical width and length of the device, and AVth is a process-specific parameter. For normally-distributed device thresholds, about 1 in 1000 devices has a threshold variation exceeding 3σ of the mean. Figure 3.14 also shows the 3σ variation as a function of transistor width, for a device with L = 0.1 µm in a process with AVth = 5 mV·µm. The points at which the curves of received voltage and threshold variation intersect indicate the minimum receiver size required at the corresponding chip separation in order to keep the received voltage above the 3σ threshold variation. For example, targeting for a receiver fallout of no more than 1 in 1000 at a chip separation of 10 µm requires a minimum transistor width of about 17 µm. Explicit offset compensation can be added to the receiver circuit to reduce the impact of device variation and allow the use of smaller devices. As power consumption is increased with larger devices, this added circuit complexity may be worthwhile.

3.4.5 Timing schemes In addition to data moving from one chip to another, there must also be a means of recovering a timing signal, indicating the validity of the data. This timing signal determines when the comparator samples the received signal on its input pads. Ideally, it should sample the data when it produces a strong and stable signal. Often, one aims to sample the data midway between adjacent data transitions, minimizing the bit-error rate with typical channel characteristics. There are a number of ways to obtain a suitable timing signal. In the simplest case, both the transmitter chip and receiver chip distribute a low-skew global clock signal with identical frequency and well-controlled phase. In this case, the timing signal for the receiver circuits can be easily derived from this global clock signal. Unfortunately, such clock signals are not always available. Obtaining a timing signal from the data can be divided into two parts: frequency acquisition and phase recovery. For capacitive coupled channels, frequency acquisition is fairly straightforward. A reference clock is typically forwarded along with the data from the transmitter chip to the receiver chip. Although unnecessary when both the transmitting and receiving chips have identical clock frequencies, it provides additional flexibility in the use of the capacitive coupled links. Determining the optimal time to sample the data is more difficult. Although the forwarded clock can be configured to provide the appropriate phase, this phase will drift as the channel and environmental conditions change. As the chips move apart, for example, the received signal voltage falls. This increases the delay through the receivers amplifying the forwarded timing signal, causing a change in the relative phase of the clock

68

Capacitive Coupled Communication

and data channels. These variations can be accommodated by using a control loop to track the phase of the data channels, and adjust the delay of the forwarded clock channel to match.

3.5 Two-dimensional arrays Capacitive coupled I/O pads are arranged in two-dimensional arrays. In most applications, there are separate arrays of transmitting and receiving pads for bidirectional communication. Figure 3.15 shows one possible arrangement of these arrays. The transmitting array, on top, is slightly larger than the receiving array because it has an extra border of pads to enable electronic alignment correction. The jagged shape of the arrays is a result of the asymmetric pad arrangement for butterfly differential signaling. Array width W

Border pads for electronic alignment correction Array depth H

...

Transmitting Array

Timing Channels

Receiving Array ...

Slice 1

Slice 2

Slice 3

Slice N

Fig. 3.15 Two-dimensional transmitting and receiving arrays, arranged as modular slices for scalability.

3 Capacitive Coupled Communication

69

The placement and dimensions of the arrays have significant effects on their performance and layout feasibility. In most applications, the arrays are placed near a chip edge to enable sufficient overlap with another chip (Figure 3.16). The shaded area–which recedes from the chip edge by an amount equal to the depth of the arrays plus some additional clearance–must be void of any I/O pads or other features that disturb the smoothness of the overlapping chip surface. This restriction may limit the locations at which the arrays are placed. The need for a keepout area also means that the arrays are often located far away from power I/O pins; surrounding areas must therefore have adequate metal coverage to ensure proper power delivery to the arrays.

Chip 1 (face up)

Tx

Line of symmetry

Rx

Keepout region

Chip 2 (face down) Fig. 3.16 Typical locations of transmitting and receiving arrays, near the chip edge in a keepout region.

For a fixed number of I/O channels in an array, a tradeoff must be made between the width and depth of the arrays to satisfy physical and electrical constraints. The array width W is mainly limited by available area along the chip edge. It may also be constrained by power delivery if power is mainly supplied from the two side edges. The array depth H is mainly limited by the latency of signal propagation up the array. In most applications, data propagates vertically along the depth of the arrays; deep arrays may therefore require additional latches or flip-flops, adding latency and complexity. The array depth may also be constrained by wiring resources, as the width of each column (or pair of columns for differential signaling) must accommodate data wires for all the channels along the column. For modularity and scalability, the arrays are often designed as slices, so that an arbitrary number of such slices can be assembled to satisfy the bandwidth requirements of different applications. Although the number of channels to include within a slice is somewhat arbitrary, a reasonable choice is the word width plus extra channels for control signals (e.g. arbitration, flow control, or parity bits). Each slice may also have separate timing and clock distribution. In source synchronous designs, each slice contains a separate clocking channel. The clocking I/O pads are large compared to data I/O pads, because one timing channel clocks all data channels in the slice; a larger pad size allows a correspondingly larger timing receiver, which

70

Capacitive Coupled Communication

reduces the required fanout. In addition, it is possible to eliminate the electronic alignment correction circuits from a large timing channel, leading to a reduction in timing uncertainty and variation with power supply noise. Additional issues complicate the design of the actual transmitting and receiving arrays. Circuitry for electronic alignment correction imposes significant demands on chip real estate and wiring resources beneath the transmitting pads. As a result, the transmitting array is typically placed close to the edge of the chip, so that data wires from the receiving pads do not need to be routed through it. The receiving array, however, also experiences wire constraints because it has fewer available metal layers; one or two metal layers beneath the top-level I/O pads are typically left empty, or with floating fill metal, in order to minimize parasitic pad loading. These wiring constraints become more stringent as I/O pad sizes scale down, although this is somewhat alleviated by the availability of more metal layers and finer wire pitches in more advanced fabrication processes.

3.6 Measurement results We designed and tested a chip fabricated in the TSMC 180 nm CMOS process. The chip has four capacitive coupled I/O slices for a total of 72 transmit and 72 receive data channels. All channels can operate simultaneously at up to 1.8 Gbps per channel [16]. This provides a maximum aggregate I/O bandwidth of 260 Gbps on each chip, equivalent to 430 Gbps/mm2 . On PRBS31 data, the measured bit error rate (BER) is lower than 10−15 . This BER was limited by test time, with this measurement representing 3 weeks of operation without a single error. A hand calculation gives an estimated BER of 10−26 for this system operating under nominal conditions. The combined energy cost of the transmitter, receiver and amortized clock distribution is 3.0 pJ/bit. In addition, this implementation includes electronic alignment correction capable of correcting up to + 18 microns of planar misalignment and our noise cancelling layout configuration scheme that reduces the BER estimate by three orders of magnitude.

3.6.1 Voltage waterfall Figure 3.17 shows a plot of bit-error rate (BER) versus offset voltage, often referred to as a “waterfall plot” due to its characteristic shape. In this experiment the channels are operating at 1.6 Gbps and the differential inputs to the receiving amplifiers are intentionally biased to different voltages, to introduce a voltage offset. As the amount of offset voltage is varied, the signal to noise ratio varies, creating a corresponding change in the BER. For the results shown here, no bit errors were observed until the magnitude of the offset voltage exceeded 200 mV, followed by a rapid in-

3 Capacitive Coupled Communication

71

crease in BER. This indicates a channel with quite large voltage margins, and very little random noise. 1E+0

1.60Gbps

1E-1 1E-2 1E-3 1E-4 1E-5

BER

1E-6 1E-7 1E-8 1E-9 1E-10 1E-11 1E-12 1E-13

Voltage margin = 207mV

1E-14 1E-15 -250 -200 -150 -100

-50

0

50

100

Voltage Offset (mV)

150

200

250

© 2007 IEEE

Fig. 3.17 Voltage waterfall curve: variation in BER vs. voltage offset.

3.6.2 Timing waterfall Analogous to the voltage waterfall is a timing waterfall plot (Figure 3.18). In this experiment, the relative timing between the edges of the clock channel and the data channel are intentionally skewed, to introduce a timing offset. As the amount of timing offset is increased, the sampling time of the data approaches the time when the data transitions and the BER increases. For the results shown here, no bit errors were observed until the magnitude of the timing offset exceeded 35% of the bit period, followed by a fairly rapid increase in BER. This indicates a channel with quite large timing margins, and reasonably low timing jitter.

72

Capacitive Coupled Communication

1E+0

1.60Gbps

1E-1 1E-2 1E-3 1E-4

BER

1E-5 1E-6 1E-7 1E-8 1E-9 1E-10 1E-11 1E-12 1E-13

Timing margin = 0.72 Tbit = 450ps

1E-14 1E-15 -0.5

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

Timing Offset (% Bit Period) © 2007 IEEE

Fig. 3.18 Timing waterfall curve: variation in BER vs. timing offset.

3.6.3 Combined eye diagram By varying both voltage and timing offsets and measuring BER at each combination of voltage and timing, a type of “eye” diagram can be created. Figure 3.19 shows a contour of constant BER (1 in 109 bits) as both voltage and timing offsets are varied. The figure provides a visual confirmation that the channel still has substantial margin when operating at 1.8 Gbps.

3.6.4 BER versus chip separation In this experiment, a pair of chips was mounted on a high precision six-axis positioning system that provides sub-micron placement resolution. Using this system, BER is measured as the interchip gap is increased (Figure 3.20). As the gap was widened, no errors were observed until the gap reached about 9 microns. At that point the BER degrades quite rapidly and by 12 microns the channel is not useable

3 Capacitive Coupled Communication

73

175

1.80Gbps

Voltage Offset VPN (mV)

150 125 100 75 50 25 0 -25 -50 -75

-100 -125 -150 -175 -1

-0.75

-0.5

-0.25

0

0.25

0.5

0.75

1

Timing Offset tm (% Bit Period) © 2007 IEEE

Fig. 3.19 Eye diagram: constant-BER contours of voltage and timing offsets.

for most applications. Applying receiver offset compensation can extend this range. All measurements are taken with air as the interchip dielectric; this tolerance can be significantly increased by using interposer materials with higher permittivity.

3.7 Prototype application: a high-radix switch Capacitive coupled communication offers a number of important advantages over conventional interchip communication technologies that can be leveraged to develop systems that have lower cost, higher reliability, more flexibility or greater capability. In order to get a flavor for these possibilities, we present an application example that highlights some of the system-level advantages gained by architecting a computing system with this technology in mind. This example is a study of high-radix switching networks enabled by extremely high bandwidth interconnects. The huge increase in chip I/O bandwidth made possible by capacitive coupling allows the system designer to completely re-architect large-scale switching systems.

74

Capacitive Coupled Communication 1E+00 1E-01 1E-02 1E-03 1E-04

BER

1E-05 1E-06 1E-07 1E-08 1E-09 1E-10 1E-11 1E-12 1E-13 1E-14 1E-15

8

9

10

11

12

13

Chip Separation (µm)

14

15

1.80Gbps 1.50Gbps

© 2007 IEEE

Fig. 3.20 Measured BER as a function of interchip separation.

Today, single-chip high bandwidth Ethernet and Infiniband switches are limited to about 36 ports, due to the costs of interchip I/O bandwidth. If a larger switch is required, designers resort to hierarchical, multistage topologies. There is significant cost and complexity associated with building these multistage networks. Furthermore, their performance is inferior to that of a single-stage network. For example, a multistage network can suffer from saturation under non-uniform, real-world traffic. Thanks to the large amount of chip I/O bandwidth offered by capacitive coupled communication, it is possible to extend a single-stage switch architecture to a larger scale. Multi-chip switch fabrics can be implemented with a simple crossbar architecture that was previously only applicable to single-chip switch implementations. This simple architecture is made possible because PxC offers enough bandwidth that a large switch can be partitioned in such a way that the full bisection bandwidth can be exposed at the chip boundaries. A large scale crossbar switch can be implemented in an MCM, where chips are interconnected by capacitive coupled I/O links. The MCM may contain a onedimensional vector of chips or a two-dimensional array of chips. Although a vector is easy to implement and package, a matrix design affords greater flexibility and enables larger scale systems. It is easy to map such a crossbar onto a vector MCM. Figure 3.21 shows such a system, with each crossbar slice mapped onto an Island chip and the bus segments stitched together by capacitive coupled links by way of the Bridge chips. The

3 Capacitive Coupled Communication

75

sliced design shown in Figure 3.21 suggests an implementation through a linear array of chips; however, it is equally straightforward to map these designs onto chips arranged in a two-dimensional matrix. Input Port 1

Input Port 2

Island chip

Input Port 3

Island chip

Bridge chip

Output Port 1

Input Port 4

Island chip

Bridge chip

Output Port 2

Island chip

Bridge chip

Output Port 3

Output Port 4 © 2008 IEEE

Fig. 3.21 Architecture of a output-buffered crosspoint switch using capacitive coupled communication.

In order to demonstrate the viability of capacitive coupled I/O, we implemented a small-scale vector switch prototype [17]. This prototype extends previous demonstrations of capacitive coupled communication in several ways: 1. it uses larger chips that are more representative of high-performance systems applications and that are more challenging from a mechanical and thermal perspective, 2. it uses a larger chip assembly that consists of four Island chips and two Bridge chips, 3. it uses a prototype face-to-face chip package to align the chips, and 4. it demonstrates an actual system application in the form of a fully-functional switch. This prototype implements an Ethernet switch with four 10 Gbps ports. The internal architecture is a fully-buffered 4x4 crossbar that is vertically sliced such that each slice corresponds to a single Island chip and implements four crosspoints, one input port and one output port (Figure 3.21). Each Island chip connects to the PCB through two 16-bit wide LVDS data links (Figure 3.22). There are three such pairs of 16-bit wide capacitive coupled I/O interfaces on the left and right sides of the chip to connect to the neighboring Island chips, through face-down Bridge chips.

76

Capacitive Coupled Communication

All links run at a data rate of 1 Gbps, corresponding to a 500 MHz DDR clock rate. Because the internal datapaths run only 1/4th as fast, there are deserializers and serializers interfacing the external I/O to the switching core. Because data transmission over these capacitive coupled communication links is assumed to be DC-balanced, packets are 8B10B-encoded when they enter an Island chip. The resulting 25% encoding overhead is accommodated by running the I/O links with a raw bandwidth of 16 Gbps. The switch core of each Island chip is mainly made up of buffer memories.

Island chip

Bridge chip

Fig. 3.22 Prototype system of a crosspoint buffered switch in a packaged one-dimensional vector array. Note that two bridge chips are physically implemented as one long monolithic chip, for ease of packaging.

A flat switch offers many advantages including low and uniform latency, resistance to saturation, increased scalability, reduced chip count and cost, reduced power consumption, and higher reliability. Because the minimum forwarding delay in a multistage network is typically proportional to the number of stages, latency for a single-stage switch can be much lower than for a multi-stage switch. Furthermore, many multi-stage networks are susceptible to traffic imbalance that causes a further increase in latency. In contrast, it is much easier to guarantee that a flat switch operates in such a way that a congested output does not slow or stop traffic flow to other uncongested outputs. Often, it is a beneficial for an architecture to span a wide range of switch sizes. A sliced crossbar gives much flexibility as it allows for building switches with any number of slices up to a maximum given by the bisection bandwidth of the system. Finally, a flat switch requires fewer switch chips than a multi-stage network with the same number of ports. For example, a 3-stage 288port switch requires 36 switch chips whereas a similar single-stage switch requires only 12 switch chips (each implementing 24 ports). Reducing component count not only reduces cost but also reduces power consumption and increases reliability. Capacitive coupled I/O technology changes the way we build large-scale systems based on MCMs. By offering many times more chip-to-chip bandwidth than conventional I/O technologies, capacitive coupled communication allows the designer to rethink how systems are partitioned.

3 Capacitive Coupled Communication

77

References 1. R.J. Drost, R.D. Hopkins, R. Ho, I.E. Sutherland, “Proximity communication,” IEEE Journal of Solid-State Circuits, vol. 39, no. 9, 2004, pp. 1529–1535. 2. A. Chow, D. Hopkins, R. Drost, R. Ho, “Exploiting capacitance in high-performance computer systems,” 4th Annual IEEE International Symposium on VLSI Design, Automation, and Test, 2008, pp. 55–58. 3. A.V. Krishnamoorthy, R. Ho, X. Zheng, H. Schwetman, J. Lexau, P. Koka, G. Li, I. Shubin, J.E. Cunningham, “Computer systems based on silicon photonic interconnects,” Proceedings of the IEEE, vol. 97, no. 7, 2009. 4. J. Cunningham, X. Zheng, I. Shubin, R. Ho, J. Lexau, A.V. Krishnamoorthy, M. Asghari, D. Feng, J. Luff, H. Liang, C. Kung, “Optical proximity communication in packaged SiPhotonics,” 5th IEEE International Conference on Group IV Photonics, 2008. 5. X. Zheng, P. Koka, H. Schwetman, J. Lexau, R. Ho, I. Shubin, J. Cunningham, A.V. Krishnamoorthy, “A Silicon photonic WDM network for high performance macrochip communications,” Proceedings, SPIE Photonics West, Vol. 7221: Photonics packaging, integration, and interconnects IX, 2009. 6. N. Miura, Y. Kohama, Y. Sugimori, H. Ishikuro, T. Sakurai, T. Kuroda, “A high-speed inductive-coupling link with burst transmission,” IEEE Journal of Solid-State Circuits, vol. 44, no. 3, 2009, pp. 947–955. 7. N. Miura, H. Ishikuro, T. Sakurai, T. Kuroda, “A 0.14 pJ/b inductive-coupling inter-chip data transceiver with digitally-controlled precise pulse shaping,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 358–359. 8. N. Miura, D. Mizoguchi, M. Inoue, K. Niitsu, Y. Nakagawa, M. Tago, M. Fukaishi, T. Sakurai, T. Kuroda, “A 1 Tb/s 3W inductive-coupling transceiver for inter-chip clock and data link,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2006, pp. 424–425. 9. N. Miura, D. Mizoguchi, M. Inoue, H. Tsuji, T. Sakurai, T. Kuroda, “A 195 Gb/s 1.2 W 3Dstacked inductive inter-chip wireless superconnect with transmit power control scheme,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2005, pp. 264– 265. 10. J. Kim, B.S. Leibowitz, J. Ren, C.J. Madden, “Simulation and analysis of random decision errors in clocked comparators,” IEEE Transactions on Circuits and Systems I, in press. 11. P. Nuzzo, F. De Bernardinis, P. Terreni, G. Van der Plas, “Noise analysis of regenerative comparators for reconfigurable ADC architectures,” IEEE Transactions on Circuits and Systems I, vol. 55, no. 6, 2008, pp. 1441–1454. 12. R. Drost, R. Ho, R. Hopkins, I. Sutherland, “Electronic alignment for proximity communication,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2004, pp. 144–518. 13. A. Chow, R. Hopkins, R. Ho, R. Drost, “Measuring 6D chip alignment in multi-chip packages,” 6th Annual IEEE Conference on Sensors, 2007, pp. 1307–10. 14. A.X. Widmer, P.A. Franaszek, “A DC-balanced, partitioned-block, 8B/10B transmission code,” IBM Journal of Research and Development, vol. 27, no. 5, 1983, pp. 440-452. 15. R. Walker, R. Dugan, “64b/66b low-overhead coding proposal for serial links,” IEEE 802.3 HSSG 10G Study proposal, January 12, 2000. 16. D. Hopkins, A. Chow, R. Bosnyak, B. Coates, J. Ebergen, S. Fairbanks, J. Gainsley, R. Ho, J. Lexau, F. Liu, T. Ono, J. Schauer, I. Sutherland, R. Drost, “Circuit techniques to enable 430 Gb/s/mm/mm proximity communication,” Digest of Technical Papers, IEEE International Solid-State Circuits Conference, 2007, pp. 368–369. 17. H. Eberle, P.J. Garcia, J. Flich, J. Duato, R. Drost, N. Gura, D. Hopkins, W. Olesinski, “High-radix crossbar switches enabled by proximity communication,” Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, 2008.

Chapter 4

Inductive Coupled Communications Noriyuki Miura, Takayasu Sakurai, and Tadahiro Kuroda

4.1 Introduction Inductive coupled communication is a wireless communication technology for three-dimensionally (3D) stacked chips in a package. As discussed in a previous chapter, capacitive coupled communication (see Figure 4.1) utilizes a pair of metal electrodes which forms a capacitive-coupling channel–essentially a capacitor–as a vertical wireless data link between stacked chips. In inductive coupled communication, a pair of metal coils creates an inductive-coupling channel–essentially a transformer–between stacked chips. Both of these are pure digital circuit solutions compatible with a standard CMOS technology. The metal electrodes and/or the metal coils can be fabricated by using IC interconnections. No additional wafer or mechanical processes are required, and hence they are inexpensive. In addition, since the capacitive- and the inductive-coupling channels can create inter-chip link without any physical and mechanical contacts, electro-static-discharge (ESD) protection devices are not needed, enabling the inter-chip link to be high-speed, lowpower, and small-area. Moreover, since these two channels are AC-coupling channels, they can communicate between chips operating under different supply voltages without level-shifters. As described previously, these two wireless communication technologies have many potential advantages over wired mechanical solutions such as micro bumps and through-Si vias (TSVs). However, electromagnetic and circuit co-optimization are necessary in order to deliver high-performance and high-reliable Professor Noriyuki Miura Department of Electrical Engineering, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522 JAPAN, e-mail: [email protected] Professor Takayasu Sakurai Institute of Industrial Science, University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, JAPAN, e-mail: [email protected] Professor Tadahiro Kuroda Department of Electrical Engineering, Keio University, 3-14-1, Hiyoshi, Kohoku-ku, Yokohama 223-8522, JAPAN, e-mail: [email protected] R. Ho and R. Drost (eds.), Coupled Data Communication Techniques for High-Performance and Low-Power Computing, Integrated Circuits and Systems, DOI 10.1007/978-1-4419-6588-2_4, © Springer Science+Business Media, LLC 2010

79

80

Inductive Coupled Communications

operation. This chapter deals with inductive coupled communications. First, basic characteristics of the inductive-coupling channel are explained with comparison to the capacitive-coupling channel. Next, channel and transceiver circuit co-design is described. Several circuit techniques for performance enhancement are introduced and evaluated by test-chip measurements. Finally, example applications of the inductive coupled communications are discussed and prototype demonstrations are presented.

Metal Electrode

Metal Coil

Capaciti e Co Capacitive Coupled pled Communications Technology

Ind cti e Coupled Inductive Co pled Communications Technology © 2008 IEEE

Fig. 4.1 Capacitive (left) and inductive (right) coupled communications technologies. Fig.1.1 Capacitive (left) and inductive (right) coupled communications technology.

4.2 Inductive-coupling channel This section covers channel characteristics, coupling range through silicon, and crosstalk issues.

4.2.1 Overview of channel characteristics Figure 4.2 illustrates an inductive-coupling channel model. A transmitter (Tx) coil is driven by transmit current IT . According to changes in IT , a magnetic field H is generated and a received voltage VR is induced in a receiver (Rx) coil. As shown in the equivalent circuit of the channel in Figure 4.2, the Tx and the Rx coils are modeled as a parallel resonator where L, C, and R represent the self-inductance, parasitic capacitance, and parasitic resistance of the coil respectively. The magnetic coupling between the coils is given by the mutual inductance M. Based on this equivalent circuit, the transfer function of the inductive-coupling channel is given

4 Inductive Coupled Communications

81

by 1 1 � � VR = � · jωM · � · IT 1 − ω 2 LRCR + jωCR RR 1 − ω 2 LT CT + jωCT RT

(4.1)

Equation 4.1 can be expressed as

VR = BR (ω) · jωM · BT (ω) · IT 1 B(ω) = � 2 1 − ω LC) + jωCR

(4.2) (4.3)

The first term BR (ω) and the third term BT (ω) in Equation 4.2 denote bandwidth limitations due to parasitic C and R. In an ideal inductive-coupling channel without any parasitics (C=0, R=0), VR = jωM · IT = M

CT

IT

H

LT RT

M RR

Rx + VR -

+ CR VR -

LR

|B(Z)|

Gain n

Coil

T Tx

(4.4)

fSR RT Tra ans-Imped dance [:]

IT

dIT dt

|VR/IT|

ZM

RR Frequency [Hz]

Channel Model

Frequency Characteristics

Fig. 4.2 Channel model (left) and frequency characteristics of inductive coupling (right).

Fig.1.2 model4.4 (left) and frequency of inductive coupling (right). AsChannel Equation indicates, the characteristics ideal inductive-coupling channel functions

as a first-order differentiator jωM, and the frequency characteristics are proportional to the frequency as shown in Figure 4.2. In an actual inductive-coupling channel, the above-mentioned bandwidth limitation of the resonator B(ω) is multiplied in each transmitter and receiver side. As shown in Figure 4.2, B(ω) behaves as a secondorder low-pass filter with peaking at the self-resonant frequency of the coil fSR , fSR =

1 √ 2π LC

(4.5)

82

Inductive Coupled Communications

Overall, the inductive-coupling channel behaves as a band-pass filter with a peak at around fSR . It can be seen in Figure 4.2 that the channel operates as a differentiator at the frequencies below fSR . That means the bandwidth of the inductive-coupling channel is determined by fSR . In this frequency range, the received voltage VR is approximately given by Equation 4.4 and therefore the VR amplitude is proportional to the trans-impedance ωM. Here, ωM is rewritten as √ ωM = ωk LT LR (4.6) where k is a coupling coefficient between the transmitter and the receiver coils. In order to increase the operating frequency ω, the channel bandwidth and thus fSR has to be increased, which √ requires the self-inductance L to be reduced (see Equation 4.5). As a result, ω LT LR keeps constant in most cases. Finally, the coupling coefficient k determines the trans-impedance and hence the VR amplitude. k is a parameter defined by the ratio between the amount of transmitted and received magnetic flux. It is approximately given by the coil diameter D and the communication distance between the coils X as �1.5 � 0.25 (4.7) k= � �2 X/D + 0.25

Couplling Coeffficient, k

Saturation 1

Linear

10-1

Square

Cubic

D Coil IT

X

10-2 + VR 10-3 0.1

1/5

1/3

X/D

1

5

Fig. 4.3 Calculated coupling coefficient depending on communication distance and coil diameter.

4 Inductive Coupled Communications

83

Figure 4.3 plots k calculated by Equation 4.7 as a function of X/D. It can be seen in 4.3 that X/D dependency of k can be classified into four different regions according to the values of X/D: 1. 2. 3. 4.

Saturation region: X/D < 1/5 Linear region: 1/5 < X/D < 1/3 Square region: 1/3 < X/D < 1 Cubic region: 1 < X/D

In the square region, many early prototypes of inductive-coupling transceivers were reported [1, 2, 3, 4, 5, 6]. Since the received signal is strongly attenuated by square of the communication distance variation ∆ X, a data recovery scheme with high noise immunity is required, such as synchronous data recovery schemes where the received voltage is sampled by a synchronous clock. In this approach, the receiver is not exposed to noise except at the sampling moment. As a result, signal-tonoise ratio (SNR) can be improved, enabling highly reliable data communications in the square region. Details of the synchronous transceiver will be described in Section 4.3. In the linear region, the signal attenuation is mitigated so that asynchronous data recovery schemes can be used. An asynchronous inductive-coupling transceiver [7, 8] achieved high-speed operation by eliminating a complicated timing controller used in the synchronous schemes. Details of the asynchronous transceiver will be discussed in Section 4.5. In the cubic region, k is significantly degraded by the cube of X/D, however a receiver with multiple amplification stages can communicate even in such an adverse region [9]. In the saturation region, k asymptotically approaches to one and the coupling efficiency becomes very high. The channel in this region is mainly used for wireless power delivery such as in [10].

4.2.2 Range extendability The inductive-coupling channel can extend the communication distance by simply increasing the coil diameter. As described in the previous section, the coupling gain of the inductive-coupling channel is governed by the coupling coefficient k. Equation 4.7 denotes that k is only defined by the ratio between the distance and the diameter of the coils, X/D. Consequently, X/D determines the coupling gain of the inductive-coupling channel. Therefore, even if the communication distance is extended, the inductive-coupling channel can keep the coupling gain constant by increasing the coil diameter linearly. For the capacitive-coupling channel, on the other hand, it is difficult to extend the communication distance. Figure 4.4 depicts a simplified model of a capacitivecoupling channel. A transmitter electrode is driven by a transmit voltage VT . According to changes in VT , an electric field E is generated and then a received voltage VR is induced in a receiver electrode. The received voltage VR is approximately given as:

84

Inductive Coupled Communications

VT X

Electrode Area, S

E

VT CC VR

VR

XSUB

CSUB Fig. 4.4 Simplified channel model of capacitive coupling.

VR =

CC VT CC +CSUB

(4.8)

where CC is the capacitance between the electrodes and CSUB is that between the receiver electrode and the substrate. Modeling CC and CSUB as simple parallel-plate capacitors, we have XSUB VT (4.9) VR = XSUB + X where X is the distance between the electrodes and XSUB is that between the receiver electrode and the substrate. As Equation 4.9 indicates, VR is reduced for long-distance communication since CC decreases with increasing X and VT is limited under the supply voltage VDD . Even if the electrode size is enlarged, VR hardly increases because both CC and CSUB increase in a similar way. As a result, the communications distance of the capacitive-coupling channel is limited.

4.2.3 Coupling strength through Si substrate The other advantage of the inductive coupling is coupling strength through a Si substrate. Figure 4.5 plots simulated S21 parameters of the capacitive- and inductivecoupling channel through the substrate. When the substrate resistivity is reduced to between 1 and 0.1 Ω ·cm (the typical resistivity of p+ Si), the electrical field of capacitive coupling is significantly attenuated in the substrate, causing a rapid decrease in the S21 parameter. As a result, for the capacitive-coupling channel, it is difficult to communicate through the Si substrate. The capacitive coupling is therefore used only for data links in face-to-face chip stacks. On the other hand, inductive coupling utilizes magnetic fields for signal transmission. Even if p+ Si substrate is inserted between the coils, the magnetic field is minimally attenuated due to eddy currents in the Si substrate. Since the S21 parameter is only degraded by several percent through the substrate, the inductive-coupling channel can be applied to not only face-to-face but also face-up, face-down, and even back-to-back chip stacks. This stacking variety provides better flexibility to chip designers in 3D integration. In addition, conventional and hence inexpensive packaging technologies can be used for power delivery to the stacked chips. For ex-

4 Inductive Coupled Communications

85

@10GHz 30Pm ~ ~ H

0.6

30 30Pm 30P Pm

0.4

02 0.2

0 10-66

IEDDY

10-44

Si

E

10-22 100 102 Substrate Resistivity, U [:cm]

~ ~

Si

p+ Si Resistiv vity

0.8

30Pm m

Normalize N ed S21

1

Infinity

Fig. 4.5 Simulated S21 parameters of inductive and capacitive coupling through substrate.

ample, in a back-to-back chip stack with inductive-coupling communications [11], a processor chip is mounted face down on a package using C4 bumps and an SRAM chip is glued on it face up, with power provided by conventional wire-bonding. The inductive-coupling channel in it communicates through the substrates of both the processor and the SRAM chips. Further details will be introduced in Section 4.8.

4.2.4 Crosstalk Array-area distribution of either capacitive- or inductive-coupling channels increases data bandwidth. However, since these two technologies employ wireless communications, crosstalk between neighboring channels may degrade performance. Compared to the capacitive-coupling channel, the crosstalk of the inductive-coupling channel is stronger. In capacitive coupling, since the electric field is confined inside the capacitor, the crosstalk is essentially small. Only the crosstalk from the most adjacent channels should be considered but it can be easily reduced by a ground shield structure [12]. Therefore, crosstalk is not a serious issue in capacitive-coupling communications. On the other hand, in inductive coupling, the magnetic field of the coil easily extends to adjacent coils. Figure 4.6 shows the inductive-coupling crosstalk calcu-

86

Inductive Coupled Communications

0 D =3X

20

X Y

40

60

Crosstalk-to-Signal Ratio o [dB]

Cross stalk-to-Siignal Ratio [dB]

0

80

1 2 3 4 6 8 10 Normalized Horizontal Distance, Y/D Crosstalk from Single Channel

-5

P

-10 10 -15

D=3X

-20 20 -25 -30 -35 -40

1 2 3 4 5 Normalized Channel Pitch, P/D Aggregated Crosstalk from Channel Array

Fig. 4.6 Calculated crosstalk from a single channel (left) and from multiple channels in array (right).

Fig.1.6 Calculated crosstalk from single channel (left) and from multiple channels in array (right).

lated by a theoretical model based on the Biot-Savart law [13]. When the horizontal distance between the coils increases over twice the coil diameter (Y > 2D), the crosstalk rapidly decreases by 1/Y 3 . As a result, the crosstalk from the channels of Y ≥ 3D is negligibly small. However the crosstalk from the channels of Y ≤ 2D cannot be ignored. Unfortunately, the inductive-coupling crosstalk cannot be reduced by the ground shield structure. A channel pitch has to be increased to suppress the crosstalk in the channel array. Figure 4.6 plots aggregated crosstalk in the channel array as a function of the channel pitch P. In order to suppress the crosstalk sufficiently far enough (≈-20dB), the channel pitch has to be increased to 2D to 3D. For high-density channel arrangements, crosstalk reduction techniques are required. Circuit solutions based on time [3, 4] and space division multiplexing [14] are presented. Further details will be described in Section 4.6.

4.3 Inductive-coupling transceiver In this section, a basic design theory of inductive-coupling transceiver is studied. A proto-type synchronous inductive-coupling transceiver is taken as a design example [3, 4]. First, a signaling scheme is discussed. Characteristics of transmitted and received signals are analyzed. Next, coil layout design is explained. A design guideline for channel characteristic optimization is discussed. Next, transceiver cir-

4 Inductive Coupled Communications

87

cuit design is described. Finally, an inductive-coupling transceiver designed based on this theory is evaluated in inter-chip communication.

4.3.1 Signaling Signaling schemes utilized in wireless communications are classified into carrier modulations or pulse modulations. For long-distance wireless communications (e.g. mobile phone and wireless LAN), carrier modulations are employed. In carrier modulation, the signal spectrum can be concentrated around the carrier frequency, so out-of-band noise can be appropriately filtered out to improve Signal-to-Noise Ratio (SNR). This enables highly reliable wireless communication even if the communication distance is long and the channel is lossy. However, carrier modulation requires complicated analog circuits, such as a voltage-controlled oscillator, low-noise amplifier, mixer, or filter, which results in high power and large area consumption in a transceiver. Pulse modulation, on the other hand, spreads the signal spectrum over a wide frequency band, making noise filtering more difficult. As a result, SNR is significantly degraded in a lossy channel. Therefore, it is hard to apply pulse modulations to longdistance wireless communications. However, in an inductive-coupling channel, the coils are coupled in close proximity, providing a low-loss wireless channel. It guarantees high SNR even if a wideband frequency is used. Therefore, pulse modulation can be utilized in inductive-coupling communications.

Txclk Pulse Generator Txdata Tx IT

Txclk Txdata IT VR

-+

VR Rx

Rxdata

Rxclk

Rxclk Rxdata Time

Fig. 4.7 Bi-phase modulation (BPM). Fig.1.7 Bi-phase modulation (BPM).

88

Inductive Coupled Communications

Pulse-modulated signals can be generated by simple digital circuits using digital clock and data. Complicated analog circuits are not needed, enabling a transceiver to be low-power and small-area. Bi-Phase Modulation (BPM) can be employed for the data link (see Figure 4.7) [3, 4]. At the rising edge of the transmitter clock Txclk, a transmitter produces positive or negative pulse current IT , according to Txdata. A positive pulse is generated when Txdata is High and a negative pulse is generated when Txdata is Low. The IT signal induces a positive or negative pulse-shaped voltage VR in the receiver coil. The receiver directly samples VR by the receiver clock Rxclk, and recovers digital data Rxdata.

ETX~WIPVDD SP=IP/W

|IT(Z)|

IT((t)

IP

W

VP VP~1.7MIP/W 0

W/2

-VP

|VR(Z)|=ZMIIT(Z)

VR(t)=MdIT(tt)/dt

0

Time

fP~0.45/W

2fp~0.9/W

Frequency © 2007 IEEE

Fig. 4.8 Characteristics of transmitted and received BPM pulse in time domain (left) and frequency domain (right) .

For inductive-coupling channel design, it is necessary to understand the time and frequency characteristics of transmitted and received pulse signals. The input transmitted current IT can be modeled as a Gaussian pulse (see Figure 4.8), � 4t 2 � IT (t) = IP exp − 2 τ

(4.10)

where IP is a pulse amplitude and τ is a pulse width. By the inductive-coupling channel, the received voltage VR is given as a time-derivative form of IT : VR (t) = M

� 4t 2 � 8t dIT (t) = −MIP 2 exp − 2 dt τ τ

(4.11)

4 Inductive Coupled Communications

89

As mentioned previously, VR becomes a Gaussian monocycle double pulse (see Figure 4.8). The receiver samples the former or latter half of the double pulse to detect the polarity of the transmitted pulse and hence transmitted data. The pulse width of VR is given by τ/2. Therefore, τ finally determines the receiver’s sampling timing margin. The received pulse amplitude VP is obtained from Equation 4.11 as � IP 2 IP M ≈ 1.7M = 1.7MSP (4.12) VP = 2 e τ τ where VP is determined by the slew rate SP . The right side of Figure 4.8 depicts the frequency spectrum of the transmitted and the received signal IT (ω) and VR (ω). IT (ω) is given by a Gaussian distribution. The derivative property of the channel jωM removes low-frequency components of IT (ω). As a result, VR (ω) becomes a convex distribution with the peak frequency of fP . In order to deliver the VR pulse signal without distortion, the inductive-coupling channel requires a frequency bandwidth of 2 fP which is given by √ 2 2 0.9 ≈ (4.13) 2 fP = πτ τ The pulse width τ determines the required channel bandwidth. The inductivecoupling channel is designed to maximize the mutual inductance M while keeping the bandwidth (self-resonant frequency fSR ) over 2 fP .

4.3.2 Coil design Characteristics of an inductive-coupling channel are determined by the communication distance X and the coil layout. Figure 4.9 illustrates an example of the coil layout. It can be defined by four layout parameters (diameter D, turns n, line width w, and line space s). There is a complex relationship between the layout parameters and the circuit parameters k, L, C, and R which finally decide the channel characteristics [15]. Here we describe a basic design guideline based on the first-order approximation of the relationship. First, the coil diameter D is determined by the communication distance X. As discussed in Section 4.2.1, the coupling coefficient as well as the operating region of the inductive-coupling channel is defined by X/D (Equation 4.7 and Figure 4.3). A synchronous transceiver in this study (Figure 4.7) operates within the square region of the inductive-coupling channel (1/3 < X/D < 1). The coil diameter D is designed to be around 2X so that the transceiver can operate in the middle of the square region. Next, the channel bandwidth is adjusted by the coil turns n. Recall from Section 4.2.1, the channel bandwidth is equal to the self-resonant frequency of the coil fSR . As written in Equation 4.5, fSR is given by a product of L and C. For a first-order approximation,

90

Inductive Coupled Communications

Diameter, D

Turns, n

Width, w

Space, s

Fig. 4.9 Metal inductor layout.

L ∝ Dn2 C ∝ Dn

(4.14) (4.15)

Typically for the coil, interconnections in upper metal layers are utilized to reduce parasitic substrate capacitance. Mostly, C is given by the parasitic capacitance between the wires and floating capacitance of the wires so that it is proportional to the total wire length Dn. Since D is already determined by X, the minimum value of C is given when n is set to 1. Similarly, the minimum value of L is given so that the maximum value of fSR is determined. As for the rest, considering the relationship in Equations 4.14 and 4.15, n is increased until fSR reaches the signal bandwidth 2 fP . As a result, L is optimized for maximizing the mutual inductance M while keeping the channel bandwidth required for the pulse signal transmission. Finally, the line width w is determined to adjust the parasitic resistance R within an appropriate range. R is inversely proportional to w. If w is too narrow and R is too high, the channel bandwidth is limited by the RC delay of the wire rather than the LC resonant frequency. On the other hand, if w is too wide and R is too low, the Q factor of the coil (ωL/R) is increased, causing resonance in the received signal and hence inter-symbol interference (ISI) which degrades BER. In order to avoid ISI, the Q factor should be reduced to two or three. The simplest design guideline is using w of around 1–2% of D. The line space s slightly changes C. The minimum line space allowed in the process can be used.

4 Inductive Coupled Communications

91

Based on the design guideline, the coil layout parameter can be roughly optimized. Fine tuning and evaluation of the channel characteristics should be done by iterative calculation using an electro-magnetic field solver and a circuit simulator.

4.3.3 Transceiver circuit design Figure 4.10 depicts an inductive-coupling transceiver circuit for BPM with its operating waveforms. A pulse generator in a transmitter consists of a NAND gate and a delay line by an inverter chain. By taking the NAND of a transmitter clock Txclk and its delayed inverted signal Txclkd, a negative pulse Pulse is generated in every clock cycle. The pulse width is determined by the delay of the inverter chain τ. A succeeding H-bridge driver generates positive or negative pulse current IT according to transmit data Txdata. When Txdata is High, P1 is ON, N2 is driven by X2 and a positive pulse is generated. When Txdata is Low, P2 is ON, N1 is driven by X1 and a negative pulse is generated. Txclk Pulse Generator

Delay, y W

Txclk Txclkd

Txclkd Pulse

Txdata

P1 I P2 T N1 N2 +

Rxclk

N3

VR

X2

W

IT

IP

-

X1

W

Pulse

Txdata

Txdata

VR

VB

Rxclk N4 VSP

N5 VSN

P3

P4

Rxdata

Rxdata

VSP, VSN

VSP

VSN

Rxdata Time © 2007 IEEE

Fig. 4.10 Inductive-coupling BPM data transceiver.

The pulse amplitude of IT is determined by the channel width of N1 and N2. The pulse slew rate of IT is determined by the slew rate of the gate input of N1 and N2.

92

Inductive Coupled Communications

The received voltage VR in a receiver coil is given as a derivative form of IT . The receiver coil is biased at VB through the high resistance of several kΩ to give the input common mode of the receiver circuit. Since the channel is AC coupling, it can give arbitrary bias voltage to the receiver without affecting the transmitter. The receiver circuit is a latch comparator. It directly samples VR by the receiver clock Rxclk, and detects the polarity of the pulses to recover digital data Rxdata. The latch comparator consists of a sense amplifier and an SR latch. The first-stage sense amplifier has two operating phases depending on Rxclk. When Rxclk is Low, the sense amplifier is in a pre-charge phase where the output voltages VSP and VSN are both pre-charged to High by PMOS transistors P3 and P4. In this operating phase, since the inputs of the SR latch are both High, Rxdata holds the data. When Rxclk goes High, the sense amplifier is in an evaluation phase where P3 and P4 are OFF, N3 is ON and the NMOS differential pair is activated. If VR is positive at the rising edge of Rxclk, N4 is strongly ON and VSP is pulled down to Low while VSN is kept High. As a result, Rxdata in the latch becomes High. Again, Rxclk goes Low and the VSP is pre-charged to hold Rxdata. If VR is negative in the evaluation phase, VSN is pulled down and Rxdata becomes Low. In this receiver, VSP and VSN temporarily drop at the rising edge of Rxclk. The SR latch erroneously operates if both VSP and VSN drop down to Low. The input threshold voltage of the SR latch should be carefully designed to avoid the erroneous operation.

4.3.4 Inter-chip communications Two test chips for a transmitter and a receiver (see Figure 4.11) were fabricated in 180 nm CMOS. The transmitter chip is thinned down to 10 µm and stacked faceup over the receiver chip with 5 µm-thick glue. The communication distance between the transmitter and the receiver X is therefore 15 µm. The coil diameter of the transceiver is 30 µm (2X) so that the transceiver is tested in the middle of square region of the inductive-coupling channel. The delay line in the transmitter is designed to set the transmit pulse width τ to be 180 ps for the receiver’s timing margin of 150 ps. From Equation 4.13, the required bandwidth of the inductive-coupling channel is calculated to be 5 GHz. Based on the design guideline above, the coil layout is optimized to maximize the mutual inductance M while keeping the self-resonant frequency fSR higher than 5 GHz. The transceiver circuit is placed under the coil to save layout area. In order to evaluate the interference between the circuit and the coil, a transceiver, whose circuit is placed aside of the coil, is also implemented. Figure 4.12 presents measurement results of the inductive-coupling transceiver whose circuit is placed under the coil. The figure shows a snapshot of the transmitted and the received data waveforms on the left. It is confirmed that a 223 -1 Pseudo Random Binary Sequence (PRBS) data at 1 Gb/s is correctly delivered through the inductive-coupling transceiver. The right side of the figure depicts the measured timing bathtub curve. A timing margin of 200 ps is achieved for a BER under 10−12

4 Inductive Coupled Communications

93

Tx under Coil Tx 30Pm Tx Coil Stacked Face-Up

Transmitter

Rx under Coil Rx 30Pm Rx Coil Receiver Fig. 4.11 Die photos of inductive-coupling transmitter (left) and receiver (right).

which is 50 ps wider than designed. It is because the transmit pulse width τ is increased due to process variation in the delay of the inverter chain. The transceiver whose circuit is placed aside of the coil is also measured. There is no difference in measured timing bathtub curve. Interference between the transceiver circuit and the coil is seen to be negligible. The transceiver consumes 2.6 mW in the transmitter and 0.2 mW in the receiver from a 1.8 V supply. In this section, we described the basic design theory of inductive-coupling transceivers. In the next three sections, circuit techniques for performance improvements are introduced. The effectiveness of each technique is evaluated by test-chip measurements.

4.4 Power reduction techniques This section introduces circuit techniques for power reduction in the inductivecoupling transceiver. As can be seen in the measurement results in the previous section, power dissipation in the transmitter is more dominant than that of the receiver. The latch comparator in the receiver only consumes charge and discharges energy 2 . The energy dissipation is only equivalent to that consumed in four CMOS CVDD gates. In addition, this energy dissipation can be effectively reduced by device scal-

94

Inductive Coupled Communications

223-1 PRBS Data @ 1Gb/s

10-3 223-1 PRBS Data @ 1Gb/s

Rxdata

10-6 BER R

Txdata

10-9 Txclk Timing Margin =200ps Snapshot

10-12

-150 -100 -50 0 50 100 Sampling Timing [ps] Bathtub Curve

Fig. 4.12 Measured snapshot of data waveforms (left) and measured timing bathtub curve (right).

ing. On the other hand, in the transmitter, the output H-bridge driver consumes large short current for generating the transmit pulse current IT . In this section, two circuit techniques are introduced for effective generation of the transmit current.

4.4.1 Pulse shaping The transmitter’s energy dissipation ETX strongly depends on the transmit pulse shape. Based on an understanding of the relationship between them, the pulse shape should be optimized for effective use of charge. The transmitter’s energy dissipation ETX is given by a product of the supply voltage VDD and total electric charge Q carried for the IT pulse. As shown in Figure 4.8, Q is equal to the area of the IT pulse. Thus, ETX is given as ETX = QVDD ≈ τIPVDD

(4.16)

Recall from Section 4.3.1, the amplitude of the received voltage VP is determined by the pulse slew rate SP . Using SP from Equation 4.13, we can rewrite Equation 4.16 as (4.17) ETX = τ 2 SPVDD

4 Inductive Coupled Communications

95

Equations 4.17 and 4.13 indicate that, by reducing the pulse width τ with constant SP , the transmitter’s energy dissipation can be reduced by τ 2 with constant VP .

IP

SP

IT

ETX~W2SPVDD

SP

W

0

0

VP

VP VP~1.7MSP

0

-VP

VR=MdIT/d dt

VR=MdIT/d dt

IT

IP

W/2

W ETX~W2SPVDD/4

0

-VP Time Pulse Width=W

Time Pulse Width=W/2

Fig. 4.13 Waveform sketch of transmitted current and received voltage when pulse width is τ (left) and pulse width is τ/2 (right).

Figure 4.13 sketches this relationship conceptually where the IT pulse is approximated by a simple triangular waveform. It shows that, when the pulse width of IT is reduced from τ to τ/2, the area (total electric charge) of IT is reduced to 1/4 while keeping VP constant. Consequently, in the inductive-coupling transmitter, reducing the pulse width τ is effective for power reduction. However, as we saw in the test-chip measurements in Section 4.3.4, the transmit pulse shape is changed due to variations in process, voltage, temperature (PVT), and chip thickness (communication distance). In order to adjust the pulse width and also the slew rate against variations, a precise pulse shaping circuit is needed. In addition, since the narrower pulse width reduces the receiver’s timing margin, a robust timing design is required to maintain BER. To solve these problems, a digitally-controlled pulse shaping circuit and timing control circuit has been introduced [5],[6]. Figure 4.14 depicts the pulse-shaping circuit. It consists of pulse width, pulse slew rate and pulse amplitude controls. In the pulse width control, a 4-phase clock generator provides 0◦ , 45◦ , 90◦ , and 135◦ clocks to two phase interpolators (PIs). One of the PI interpolates a clock phase between 0◦ and 45◦ by 1/256 of a UI step, which is equivalent to 4 ps at 1 GHz operation. Another PI is a dummy circuit which always outputs 135◦ clock. A succeeding AND gate generates a pulse clock that determines the pulse width τ. The pulse slew rate is digitally controlled by variable capacitors. The pulse amplitude is

96

Inductive Coupled Communications

Pulse Width Control (5bit) 1/256-UI Step

Txclk 4-Phase Clk 0º 45º 90º 135º PI

5bit 0º~45º

20w

Pulse

24w

Pulse Amplitude Control (5bit)

W

Txdata 24w

20w

IT Tx Chip -+

Rxclk

135º

135º

Pulse

Txdata

Pulse Slew Rate Control (4bit)

PI

0º~45º

Rx Chip

VR Rx Rxdata

© 2008 IEEE

Fig.1.14 Digitally controlled pulse shaping circuit.

Fig. 4.14 Digitally controlled pulse shaping circuit.

digitally controlled by changing the channel width of NMOS in the H-bridge driver. Figure 4.15 describes the timing design. An inductive-coupling clock link is located adjacent to the data link. The timing jitter caused by supply noise and temperature variations can be effectively rejected as common-mode noise. A sampling timing controller calibrates timing shift due to the process variations. Stacked test chips are fabricated in 180 nm and 90 nm CMOS (Figure 4.16). In both of them, the transmitter chip is stacked face-up over the receiver chip. The thickness of the transmitter chip is 10 µm and that of glue layer is 5 µm. The communication distance between the transmitter and the receiver is thereby 15 µm. Coil size is 30 µm diameter for the data link and 200 µm for the clock link. Data rate is 1 Gb/s. The experimental condition, such as distance, coil size, and data rate, is identical with that of the previous proto-type inductive-coupling transceiver (Figure 4.11). The left side of Figure 4.17 presents measured bathtub curves of the transceiver in 180 nm CMOS. By using pulse shaping circuit, the pulse width is reduced from 120 ps to 60 ps. The received pulse amplitude VP is adjusted to 60 mV by the transmit pulse amplitude and slew rate controls. It is confirmed that ETX is reduced by τ 2 . When τ is set to the minimum pulse width of 60 ps, ETX is reduced to 0.13 pJ/b which is 17 times lower than the previous proto-type design. However, in this case, the timing margin for BER under 10−12 is reduced to 25 ps. Static timing variation due to the process variations can be calibrated by using the timing controller

97

Txclk 4-Phase Clk 0º 45º 90º 135º PI 5bit 0º~45º

PI 135º

Txdata

Txdata

Tx

Tx

ITC

IT

Pulse Width Control

4 Inductive Coupled Communications

Tx Chip

1bit

5bit

Rx Chip

VR Rx Rxdata

0º~135º



Rxclk

PI

-+

VRC Rx

PI

4-Phase Clk 45º 90º 135º

Sampling Timing Control

-+

Clock Link

© 2008 IEEE

Fig.1.15 Sampling timing controller.

Fig. 4.15 Sampling timing controller.

Data Link

R Chi Rx Chip

Tx Chip

Data Link

30Pm

(10Pm-Thick) 30Pm Tx Chip (10Pm-Thick) Cl k Li Clock Link k

Rx Chip

200Pm 180nm CMOS

90nm CMOS © 2007 IEEE

Fig. 4.16 Stacked test chips of low-power inductive-coupling transceiver in 180 nm (left) and 90 nm CMOS (right).

98

Inductive Coupled Communications

VP=60mV @ 1Gb/s

1

W=60ps ETX=0.13pJ/b

BER R

10-3

VP=60mV @ 1Gb/s

W=60ps, ETX=0.11pJ/b, ERX=0.03pJ/b

10-6

10-9 25 25ps 10-12

20

40 60 80 100 120 Sampling S li Ti Timing i [[ps]] 180nm CMOS

30ps -40 -30 -20 -10 0 10 20 S Sampling li Ti Timing i [[ps]] 90nm CMOS

© 2008 IEEE

Fig. 4.17 Measured bathtub curves in 180nm (left) and 90nm CMOS (right).

and the timing can be adjusted within the 25 ps timing margin. The robustness of the timing design against dynamic timing variation (jitter) is measured by giving power supply noise intentionally. An individual load is connected to the local supply of each transmitter and receiver chip. The load is randomly changed at various frequencies. The data transceiver communicates at 1 Gb/s with BER under 10−12 under the supply noise of 350mV peak-to-peak (±10% of VDD ). It is confirmed that, by the source synchronous transmission, timing jitter caused by the supply noise is effectively rejected and suppressed within the timing margin of 25 ps. The right side of Figure 4.17 shows a measured bathtub curve of the transceiver in 90 nm CMOS. The pulse width is 60 ps. The transceiver operates at 1 Gb/s with a BER under 10−12 and timing margin of 30 ps. The energy dissipation in the transmitter and receiver is reduced to 0.11 pJ/b and 0.03 pJ/b by the device scaling. The total energy dissipation is 0.14 pJ/b which is 1/20 of that in the prototype transceiver.

4.4.2 Daisy chain transmitter Another power reduction technique is a daisy-chain transmitter [16]. The left of Figure 4.18 shows a channel array of conventional H-bridge transmitters; here, transmit pulse current IT1 –ITN is dissipated in each transmitter. The right of Figure 4.18 de-

4 Inductive Coupled Communications

99

picts a daisy-chain transmitter where multiple channels of the transmitters are concatenated in order to reuse the transmit current between the channels. The polarity of the transmit current in each transmitter coil is determined by switching NMOS transistors according to the transmit data of the adjacent channels. BPM pulse signals are finally generated by the pulse generator connected to the bottom NMOS transistors. In this transmitter, the energy efficiency is improved by increasing the number of concatenated transmitter stages N. Ideally, the power dissipation per transmitter stage is reduced by 1/N. In practice, N is restricted by bandwidth limitation caused by increasing serially-connected NMOS transistors in the current path. Txdata1

IT1

Txdata1 Txdata1

IT1

Txdata1•Txdata2

Txdata1•Txdata2

Txdata2

IT2

Txdata2

Txdata1•Txdata2

IT2

ITN

TxdataN

TxdataN-1•TxdataN

Txdata1•Txdata2

Txdata2•Txdata3

Txdata2•Txdata3

TxdataN

Txdata1

ITN

TxdataN-1•TxdataN TxdataN

TxdataN

Pulse Generator

Pulse Generator

Txclk

Txclk © 2008 IEEE

Fig. 4.18 Conventional H-bridge parallel transmitters (left) and daisy-chain transmitters (right).

Figure 4.19 depicts microphotographs of stacked test chips in 90 nm CMOS. A transmitter chip is back-ground to 10 µm thickness and stacked face-up over a receiver chip with 5 µm-thick glue. As a result, the communication distance is 15 µm. The transmitter chip integrates the daisy-chain transmitters with the number of concatenated stages N = 2, 4, 6. The coil diameter is 30 µm. The experimental setup is identical with that of the previous inductive-coupling transceiver with the pulse shaping circuit (Figure 4.16). Figure 4.20 presents measured energy dissipation of the daisy-chain transmitter as a function of the number of concatenated stages N When N = 4, the energy dissipation is reduced to 35 fJ/b for the same data rate (1 Gb/s/ch), BER (under 10−12 ), and timing margin (30 ps) in the previous inductive-coupling transceiver with the pulse shaping. When N exceeds six, the bandwidth limitation due to stacked NMOS transistors starts to degrade the performance. Device scaling will improve

100

Inductive Coupled Communications

Transmitter (Upper)

Receiver (Lower)

Tx Chip (10Pm-Thick) Rx Chip

© 2008 IEEE

Fig. 4.19 Stacked test chips.

frequency characteristics of the transistors, enabling N to be increased more than six for further power reduction.

4.5 High-speed techniques In this section we introduce a number of high-speed circuit techniques. Compared to wired solutions such as micro bumps and TSVs, inductive-coupling communication has an advantage in high-speed operation since non-contacted circuits do not need highly-capacitive ESD protection circuits and can have improved channel bandwidth. The load capacitance of the inductive coupled channel can be reduced to less than 10 fF due to the absence of the ESD protection circuits. As a result, the self-resonant frequency of the coil and hence the channel bandwidth can be designed to be higher than 100 GHz in 180 nm CMOS [7, 8]. Furthermore, since the inductive-coupling channel is formed using on-chip structures, the bandwidth can be further improved by device scaling. The inductive-coupling channel does not limit the data rate of the transceiver. By optimizing the transceiver circuit topology, the data rate can be maximized up to the performance limitations of the transistors. However, in the synchronous inductive-

4 Inductive Coupled Communications

101

Normalized Energ gy Dissip pation

ETX=110fJ/b (Pulse Shaping Only) 1.0 Data Rate=1Gb/s. BER