The computer engineering handbook

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

© 2002 by CRC Press LLC

© 2002 by CRC Press LLC

www.crcpress.com

Fernanda por tu orgullo coraje y dignidad de una mujer luchadora

© 2002 by CRC Press LLC

Preface

Purpose and Background Computer engineering is such a vast field that it is difficult and almost impossible to present everything in a single book. This problem is also exaggerated by the fact that the field of computers and computer design has been changing so rapidly that by the time this book is introduced some of the issues may already be obsolete. However, we have tried to capture what is fundamental and therefore will be of lasting value. Also, we tried to capture the trends, new directions, and new developments. This book could easily fill thousands of pages because there are so many issues in computer design and so many new fields that are popping out daily. We hope that in the future CRC Press will come with new editions covering some of the more specialized topics in more details. Given that, and many other limitations, we are aware that some areas were not given sufficient attention and some others were not covered at all. However, we hope that the areas covered are covered very well given that they are written by specialists that are recognized as leading experts in their fields. We are thankful for their valuable time and effort.

Organization This book contains a dozen sections. First, we start with the fabrication and technology that has been a driving factor for the electronic industry. No sector of the industry has experienced such tremendous growth. The progress has surpassed what we thought to be possible, and limits that were once thought of as fundamental were broken several times. When the first 256 kbit DRAM chips were introduced the “alpha particle scare” (the problem encountered with alpha particles discharging the memory cell) predicted that radiation effects would limit further scaling in dimensions of memory chips. Twenty years later, we have reached 256 Mbit DRAM chips—a thousand times improvement in density—and we see no limit to further scaling. In fact, the memory capacity has been tripling every two years while the number of transistors on the processor chip has been doubling every two years. The next section deals with computer architecture and computer system organization, a top-level view. Several architectural concepts and organizations of computer systems are described. The section ends with description of performance evaluation measures, which are the bottom line from the user’s point of view. Important design techniques are described in two separate sections, one of which deals exclusively with power consumed by the system. Power consumption is becoming the most important issue as computers are starting to penetrate large consumer product markets, and in several cases low-power consumption is more important than the performance that the system can deliver. Penetration of computer systems into the consumer’s market is described in the sections dealing with signal processing, embedded applications, and future directions in computing. Finally, reliability and testability of computer systems is described in the last section.

© 2002 by CRC Press LLC

Locating Your Topic Several avenues are available to access desired information. A complete table of contents is presented at the front of the book. Each of the sections is preceded with an individual table of contents. Finally, each chapter begins with its own table of contents. Each contributed article contains comprehensive references. Some of them contain a “To Probe Further” section where a general discussion of various sources such as books, journals, magazines, and periodicals are discussed. To be in tune with the modern times, some of the authors have also included Web pointers to valuable resources and information. We hope our readers will find this to be appropriate and of much use. A subject index has been compiled to provide a means of accessing information. It can also be used to locate definitions. The page on which the definition appears for each key defining term is given in the index. The Computer Engineering Handbook is designed to provide answers to most inquiries and to direct inquirers to further sources and references. We trust that it will meet the needs of our readership.

Acknowledgments The value of this book is completely based on the work of many experts and their excellent contributions. I am grateful to them. They spent hours of their valuable time without any compensation and with a sole motivation to provide learning material and help enhance the profession. I would like to thank Prof. Saburo Muroga, who provided editorial advice, reviewed the content of the book, made numerous suggestions, and encouraged me to do it. I am indebted to him as well as to other members of the advisory board. I would like to thank my colleague and friend Prof. Richard Dorf for asking me to edit this book and trusting me with this project. Kristen Maus worked tirelessly to put all of this material in a decent shape and so did Nora Konopka of CRC Press. My son, Stanisha, helped me with my English. It is their work that made this book.

© 2002 by CRC Press LLC

Editor-in-Chief

Vojin G. Oklobdzija is a Fellow of the Institute of Electrical and Electronics Engineers and Distinguished Lecturer of IEEE SolidState Circuits and IEEE Circuits and Systems Societies. He received his Ph.D. and M.Sc. degrees from the University of California, Los Angeles in 1978 and 1982, as well as a Dipl. Ing. (MScEE) from the Electrical Engineering Department, University of Belgrade, Yugoslavia in 1971. From 1982 to 1991 he was at the IBM T. J. Watson Research Center in New York where he made contributions to the development of RISC architecture and processors. In the course of this work he obtained a patent on Register-Renaming, which enabled an entire new generation of super-scalar processors. From 1988–90 he was a visiting faculty at the University of California, Berkeley, while on leave from IBM. Since 1991, Prof. Oklobdzija has held various consulting positions. He was a consultant to Sun Microsystems Laboratories, AT&T Bell Laboratories, Hitachi Research Laboratories, Silicon Systems/Texas Instruments Inc., and Siemens Corp. where he was principal architect of the Siemens/Infineon’s TriCore processor. Currently he serves as an advisor to SONY and Fujitsu Laboratories. In 1988 he started Integration, which was incorporated in 1996. Integration Corp. delivered several successful processor and encryption processor designs. (see: www.integration-corp.com) Prof. Oklobdzija has held various academic appointments, besides the current one at the University of California. In 1991, as a Fulbright professor, he was helping to develop programs at universities in South America. From 1996–98 he taught courses in the Silicon Valley through the University of California, Berkeley Extension, and at Hewlett-Packard. He holds seven US, four European, one Japanese, and one Taiwanese patents in the area of computer design and seven others currently pending. Prof. Oklobdzija is a member of the American Association for Advancement of Science, and the American Association of the University Professors. He serves on the editorial boards of the IEEE Transaction of VLSI Systems and the Journal of VLSI Signal Processing. He served on the program committees of the International Conference on Computer Design, the International Symposium on VLSI Technology and Symposium on Computer Arithmetic. In 1997, he was a General Chair of the 13th Symposium on Computer Arithmetic and is serving as a program committee member of the International Solid-State Circuits Conference (ISSCC) since 1996. He has published over 120 papers in the areas of circuits and technology, computer arithmetic and computer architecture, and has given over 100 invited talks and short courses in the USA, Europe, Latin America, Australia, China, and Japan.

© 2002 by CRC Press LLC

Editorial Board

Krste Asanovic

Kevin Nowka

Massachusetts Institute of Technology Cambridge, Massachusetts

IBM Austin Research Laboratory Austin, Texas

William Bowhill

Takayasu Sakurai

Compaq/DEC Shrewsbury, Massachusetts

Tokyo University Tokyo, Japan

Anantha Chandrakasan

Alan Smith

Massachusetts Institute of Technology Cambridge, Massachusetts

University of California, Berkeley Berkeley, California

Hiroshi Iwai

Ian Young

Tokyo Institute of Technology Yokohama, Japan

Intel Corporation Hillsboro, Oregon

Saburo Muroga University of Illinois Urbana, Illinois

© 2002 by CRC Press LLC

© 2002 by CRC Press LLC

© 2002 by CRC Press LLC

© 2002 by CRC Press LLC

© 2002 by CRC Press LLC

Contents

SECTION I

Fabrication and Technology

1

Trends and Projections for the Future of Scaling and Future Integration Trends Hiroshi Iwai and Shun-ichiro Ohmi

2

CMOS Circuits

3

High-Speed, Low-Power Emitter Coupled Logic Circuits Tadahiro Kuroda

4

Price-Performance of Computer Technology

2.1 2.2 2.3 2.4

VLSI Circuits Eugene John Pass-Transistor CMOS Circuits Shunzo Yamashita Synthesis of CMOS Pass-Transistor Logic Dejan Markovi´c Silicon on Insulator (SOI) Yuichi Kado

SECTION II

Computer Systems and Architecture

5

Computer Architecture and Design

6

System Design

5.1 5.2 5.3 5.4 5.5 5.6

6.1 6.2 6.3 6.4

John C. McCallum

Server Computer Architecture Siamack Haghighi Very Large Instruction Word Architectures Binu Mathew Vector Processing Krste Asanovic Multithreading, Multiprocessing Manoj Franklin Survey of Parallel Systems Donna Quammen Virtual Memory Systems and TLB Structures Bruce Jacob

Superscalar Processors Mark Smotherman Register Renaming Techniques Dezsö Sima Predicting Branches in Computer Programs Kevin Skadron Network Processor Architecture Tzi-cker Chiueh

© 2002 by CRC Press LLC

7

Architectures for Low Power

8

Performance Evaluation 8.1 8.2 8.3

9

Pradip Bose

Measurement and Modeling of Disk Subsystem Performance Jozo J. Dujmovic,´ Daniel Tomasevich, and Ming Au-Yeung Performance Evaluation: Techniques, Tools, and Benchmarks Lizy Kurian John Trace Caching and Trace Processors Eric Rotenberg

Computer Arithmetic 9.1 9.2

High-Speed Computer Arithmetic Earl E. Swartzlander, Jr. Fast Adders and Multipliers Gensuke Goto

SECTION III

Design Techniques

10

Timing and Clocking

11

Multiple-Valued Logic Circuits

12

FPGAs for Rapid Prototyping

13

Issues in High-Frequency Processor Design

10.1 Design of High-Speed CMOS PLLs and DLLs John George Maneatis 10.2 Latches and Flip-Flops Fabian Klass 10.3 High-Performance Embedded SRAM Cyrus (Morteza) Afghahi

K. Wayne Current James O. Hamblen Kevin J. Nowka

SECTION IV Design for Low Power

14

Low-Power Design Issues Hemmige Varadarajan, Vivek Tiwari, Rakesh Patel, Hema Ramamurthy, Shahram Jamshidi, Snehal Jariwala, and Wenjie Jiang

15

Low-Power Circuit Technologies

16

Techniques for Leakage Power Reduction Vivek De, Ali Keshavarzi, Siva Narendra, Dinesh Somasekhar, Shekhar Borkar, James Kao, Raj Nair, and Yibin Ye

© 2002 by CRC Press LLC

Masayuki Miyazaki

17

Dynamic Voltage Scaling

18

Low-Power Design of Systems on Chip

19

Implementation-Level Impact on Low-Power Design Katsunori Seno

20

Accurate Power Estimation of Combinational CMOS Digital Circuits Hendrawan Soeleman and Kaushik Roy

21

Clock-Powered CMOS for Energy-Efficient Computing Nestoras Tzartzanis and William Athas

SECTION V

Thomas D. Burd

Embedded Applications

22

Embedded Systems-on-Chips

23

Embedded Processor Applications

SECTION VI

Christian Piguet

Wayne Wolf Jonathan W. Valvano

Signal Processing

24

Digital Signal Processing

25

DSP Applications

26

Digital Filter Design James H. McClellan

27

Audio Signal Processing

Adam Dabrowski and Tomasz Marciniak

28

Digital Video Processing

Todd R. Reed

29

Low-Power Digital Signal Processing

© 2002 by CRC Press LLC

Fred J. Taylor

Daniel Martin Worayot Lertniphonphun and

Thucydides Xanthopoulos

SECTION VII

30

Communications and Networks

Communications and Computer Networks

SECTION VIII

Anna Ha´c

Input/Output

31

Circuits for High-Performance I/O

32

Algorithms and Data Structures in External Memory Jeffrey Scott Vitter

33

Parallel I/O Systems

34

A Read Channel for Magnetic Recording

Chik-Kong Ken Yang

Peter J. Varman

34.1 Recording Physics and Organization of Data on a Disk Bane Vasi´c and Miroslav Despotovi´c 34.2 Read Channel Architecture Bane Vasi´c, Pervez M. Aziz, and Necip Sayiner 34.3 Adaptive Equalization and Timing Recovery Pervez M. Aziz 34.4 Head Position Sensing in Disk Drives Ara Patapoutian ∨ 34.5 Modulation Codes for Storage Systems Brian Marcus and Emina Soljanin ∨ 34.6 Data Detection Miroslav Despotovi´c and Vojin Senk 34.7 An Introduction to Error-Correcting Codes Mario Blaum

SECTION IX

35

Operating System

Distributed Operating Systems

SECTION X

Peter Reiher

New Directions in Computing

36

SPS: A Strategically Programmable System M. Sarrafzadeh, E. Bozorgzadeh, R. Kastner, and S. O. Memik

37

Reconfigurable Processors 37.1 Reconfigurable Computing John Morris 37.2 Using Configurable Computing Systems Danny Newport and Don Bouldin

© 2002 by CRC Press LLC

37.3 Xtensa: A Configurable and Extensible Processor Ricardo E. Gonzalez and Albert Wang

38

Roles of Software Technology in Intelligent Transportation Systems Shoichi Washino

39

Media Signal Processing 39.1 Instruction Set Architecture for Multimedia Signal Processing Ruby Lee 39.2 DSP Platform Architecture for SoC Products Gerald G. Pechanek 39.3 Digital Audio Processors for Personal Computer Systems Thomas C. Savell 39.4 Modern Approximation Iterative Algorithms and Their Applications in Computer Engineering Sadiq M. Sait and Habib Youssef

40

Internet Architectures

41

Microelectronics for Home Entertainment

42

Mobile and Wireless Computing

Borko Furht Yoshiaki Hagiwara

42.1 Bluetooth—A Cable Replacement and More John F. Alexander and Raymond Barrett 42.2 Signal Processing ASIC Requirements for High-Speed Wireless Data Communications Babak Daneshrad 42.3 Communication System-on-a-Chip Samiha Mourad and Garret Okamoto 42.4 Communications and Computer Networks Mohammad Ilyas 42.5 Video over Mobile Networks Abdul H. Sadka 42.6 Pen-Based User Interfaces—An Applications Overview Giovanni Seni and Jayashree Subrahmonia 42.7 What Makes a Programmable DSP Processor Special? Ingrid Verbauwhede

43

Data Security

SECTION XI

44

Matt Franklin

Testing and Design for Testability

System-on-Chip (SoC) Testing: Current Practices and Challenges for Tomorrow R. Chandramouli

© 2002 by CRC Press LLC

45

Testing of Synchronous Sequential Digital Circuits Z. Stamenkovi´c, and H. T. Vierhaus

46

Scan Testing

47

Computer-Aided Analysis and Forecast of Integrated Circuit Yield Z. Stamenkovi´c and N. Stojadinovi´c

© 2002 by CRC Press LLC

U. Glaeser,

Chouki Aktouf

I Fabrication and Technology 1 Trends and Projections for the Future of Scaling and Future Integration Trends Hiroshi Iwai and Shun-ichiro Ohmi Introduction • Downsizing below 0.1 µm • Gate Insulator • Gate Electrode • Source and Drain • Channel Doping • Interconnects • Memory Technology • Future Prospects

2 CMOS Circuits Eugene John, Shunzo Yamashita, Dejan Markovi´c, and Yuichi Kado VLSI Circuits • Pass-Transistor CMOS Circuits • Synthesis of CMOS Pass-Transistor Logic • Silicon on Insulator (SOI)

3 High-Speed, Low-Power Emitter Coupled Logic Circuits Tadahiro Kuroda Active Pull-Down ECL Circuits • Low-Voltage ECL Circuits

4 Price-Performance of Computer Technology

John C. McCallum

Introduction • Computer and Integrated Circuit Technology • Processors • Memory and Storage—The Memory Hierarchy • Computer Systems—Small to Large • Summary

© 2002 by CRC Press LLC

1 Trends and Projections for the Future of Scaling and Future Integration Trends

Hiroshi Iwai Tokyo Institute of Technology

Shun-ichiro Ohmi Tokyo Institute of Technology

1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9

Introduction Downsizing below 0.1 µm Gate Insulator Gate Electrode Source and Drain Channel Doping Interconnects Memory Technology Future Prospects

1.1 Introduction Recently, information technology (IT)—such as Internet, i-mode, cellular phone, and car navigation— has spread very rapidly all over of the world. IT is expected to dramatically raise the efficiency of our society and greatly improve the quality of our life. It should be noted that the progress of IT entirely owes to that of semiconductor technology, especially Silicon LSIs (Large Scale Integrated Circuits). Silicon LSIs provide us high speed/frequency operation of tremendously many functions with low cost, low power, small size, small weight, and high reliability. In these 30 years, the gate length of the metal oxide semiconductor field effect transistors (MOSFETs) has reduced 100 times, the density of DRAM increased 500,000 times, and clock frequency of MPU increased 2,500 times, as shown in Table 1.1. Without such a marvelous progress of LSI technologies, today’s great success in information technology would not be realized at all. The origin of the concept for solid-state circuit can be traced back to the beginning of last century, as shown in Fig. 1.1. It was more than 70 years ago, when J. Lilienfeld using Al/Al2O3/Cu2S as an MOS structure invented a concept of MOSFETs. Then, 54 years ago, first transistor (bipolar) was realized using germanium. In 1960, 2 years after the invention of integrated circuits (IC), the first MOSFET was realized by using the Si substrate and SiO2 gate insulator [1]. Since then Si and SiO2 became the key materials for electronic circuits. It takes, however, more than several years until the Silicon MOSFET evolved to Silicon ICs and further grew up to Silicon LSIs. The Silicon LSIs became popular in the market from the beginning of 1970s as a 1 kbit DRAM and a 4 bit MPU (microprocessor). In the early 1970s, LSIs started

© 2002 by CRC Press LLC

TABLE 1.1 Year

Past and Current Status of Advanced LSI Products Min. Lg (µm)

Ratio

DRAM

Ratio

MPU

10 0.1

1 1/100

1k 512 M

1 256,000

750 k 1.7 G

1970/72 2001

Year 2001 20th C

FIGURE 1.1

Ratio 1 2,300

New Century for Solid-State Circuit

73 years since the concept of MOSFET 1928, J. Lilienfeld, MOSFET patent 54 years since the 1st Transistor 1947, J. Bardeen, W. Bratten, bipolar Tr 43-42 years since the 1st Integrated Circuits 1958, J. Kilby, IC 1959, R. Noice, Planar Technolgy 41 years since the 1st Si-MOSFET 1960, D. Kahng, Si-MOSFET 38 years since the 1st CMOS 1963, CMOS, by F. Wanlass, C. T. Sah 31 years since the 1st 1 kbit DRAM (or LSI) 1970 Intel 1103 16 years since CMOS became the major technology 1985, Toshiba 1 Mbit CMOS DRAM

History of LSI in 20th century.

by using PMOS technology in which threshold voltage control was easier, but soon the PMOS was replaced by NMOS, which was suitable for high speed operation. It was the middle of 1980s when CMOS became the main stream of Silicon LSI technology because of its capability for low power consumption. Now CMOS technology has realized 512 Mbit DRAMs and 1.7 GHz clock MPUs, and the gate length of MOSFETs in such LSIs becomes as small as 100 nm. Figure 1.2 shows the cross sections of NMOS LSIs in the early 1970s and those of present CMOS LSIs. The old NMOS LSI technology contains only several film layers made of Si, SiO2, and Al, which are basically composed of only five elements: Si, O, Al, B, and P. Now, the structure becomes very complicated, and so many layers and so many elements have been involved. In the past 30 years, transistors have been miniaturized significantly. Thanks to the miniaturization, the number of components and performance of LSIs have increased significantly. Figures 1.3 and 1.4 show the microphotographs of 1 kbit and 256 Mbit DRAM chips, respectively. Individual tiny rectangle units barely recognized in the 16 large rectangle units of the 256 M DRAM correspond to 64 kbit DRAM. It can be said that the downsizing of the components has driven the tremendous development of LSIs. Figure 1.5 shows the past and future trends of the downsizing of MOSFET’s parameters and LSI chip properties mainly used for high performance MPUs. Future trend was taken from ITRS’99 (International Technology Roadmap for Semiconductors) [2]. In order to maintain the continuous progress of LSIs for future, every parameter has to be shrunk continuously with almost the same rate as before. However, it was anticipated that shrinking the parameters beyond the 0.1 µm generation would face severe difficulties due to various kinds of expected limitations. It was expected that huge effort would be required in research and development level in order to overcome the difficulties. In this chapter, silicon technology from past to future is reviewed for advanced CMOS LSIs.

© 2002 by CRC Press LLC

(a)

6 µm NMOS LSI in 1974

Passivation (PSG) Al interconnects ILD (Interlayer Dielectrics) (SiO2 + BPSG)

Si Sisubstrate substrate

Field SiO2

magnification Poly Si gate electrode Gate SiO2

Source / Drain Source / Drain

Layers

Materials Si, SiO2 Si substrate BPSG Field oxide PSG Gate oxide Al Poly Si gate electrode Atoms Source/Drain diffusion Si, O, Al, Interlayer dielectrics P, B Aluminum interconnects (H, N, CI) Passivation

(b) 0.1 µ m CMOS LSI in 2001 Large number of layers, Many kinds of materials and atoms magnification

W via plug Low k ILD

Ultra-thin gate SiO2

W contact plug magnification CoSi2 magnification

FIGURE 1.2

Cross-sections of (a) NMOS LSI in 1974 and (b) CMOS LSI in 2001.

FIGURE 1.3

1 kbit DRAM (TOSHIBA).

© 2002 by CRC Press LLC

FIGURE 1.4

256 Mbit DRAM (TOSHIBA).

ITRS Roadmap (at introduction)

102

(a)

101

Minimum logic Vdd (V)

MP UL

100 10−1 10−2 10−3

g

X (µ j m)

(µm )

DR AM 1/2 pitc h (µ m)

Id (mA/µm)

To

x equ

ivale nt (µ

m)

Wave length of electron (µm) Tunneling limit in SiO2 (µ m) Bond length of Si atoms (Physical limit) ( µm)

10−4 1970

1980 1990 2000 2010 2020 Year ITRS Roadmap (at introduction)

103 101

(b)

MPU maximum current ?imA) MPU clock frequency (MHz) DRAM chip size (mm2) 2 ?j imm e? siz ip h Uc MP ca pa cit y? iM bi ts ?j

105

MPU MPUchip chipsize size?imm mm i 2 2j?j MPU power (W)

ors ist ns tra PU Hz) f M ?j y (T r o ors enc be sist equ fr m n k Nu tra cloc cal ?iM U lo MP

10−1

DR AM

S IP M

10−3 1970 1980 1990 2000 2010 2020 Year

FIGURE 1.5

Trends of CPU and DRAM parameters.

© 2002 by CRC Press LLC

1.2 Downsizing below 0.1 µm In digital circuit applications, a MOSFET functions as a switch. Thus, complete cut-off of leakage current in the “off ” state, and low resistance or high current drive in the “on” state are required. In addition, small capacitances are required for the switch to rapidly turn on and off. When making the gate length small, even in the “off ” state, the space charge region near the drain—the high potential region near the drain—touches the source in a deeper place where the gate bias cannot control the potential, resulting in a leakage current from source to drain via the space charge region, as shown in Fig. 1.6. This is the well-known, short-channel effect of MOSFETs. The short-channel effect is often measured as the threshold voltage reduction of MOSFETs when it is not severe. In order for a MOSFET to work as a component of an LSI, the capability of switching-off or the suppression of the short-channel effects is the first priority in the designing of the MOSFETs. In other words, the suppression of the short-channel effects limits the downsizing of MOSFETs. In the “on” state, reduction of the gate length is desirable because it decreases the channel resistance of MOSFETs. However, when the channel resistance becomes as small as source and drain resistance, further improvement in the drain current or the MOSFET performance cannot be expected. Moreover, in the short-channel MOSFET design, the source and drain resistance often tends to even increase in order to suppress the short-channel effects. Thus, it is important to consider ways for reducing the total resistance of MOSFETs with keeping the suppression of the short-channel effects. The capacitances of MOSFETs usually decreases with the downsizing, but care should be taken when the fringing portion is dominant or when impurity concentration of the substrate is large in the short-channel transistor design. Thus, the suppression of the short-channel effects, with the improvement of the total resistance and capacitances, are required for the MOSFET downsizing. In other words, without the improvements of the MOSFET performance, the downsizing becomes almost meaningless even if the short-channel effect is completely suppressed. To suppress the short-channel effects and thus to secure good switching-off characteristics of MOSFETs, the scaling method was proposed by Dennard et al. [3], where the parameters of MOSFETs are shrunk or increased by the same factor K, as shown in Figs. 1.7 and 1.8, resulting in the reduction of the space charge region by the same factor K and suppression of the short-channel effects. 2 In the scaling method, drain current, Id (= W/L × V /tox), is reduced to 1/K. Even the drain current is reduced to 1/K, the propagation delay time of the circuit reduces to 1/K, because the gate charge reduces 2 to 1/K . Thus, scaling is advantageous for high-speed operation of LSI circuits. 2 If the increase in the number of transistors is kept at K , the power consumption of the LSI—which 2 is calculated as 1/2fnCV as shown in Fig. 1.7—stays constant and does not increase with the scaling. Thus, in the ideal scaling, power increase will not occur. 0 V

Vdd (V)

0 V Gate Source

Leakage Current Space Charge Region

FIGURE 1.6

Short channel effect at downsizing.

© 2002 by CRC Press LLC

Drain

TABLE 1.2

Real Scaling (Research Level) 1972

2001

Limiting Factor

Ratio

Gate length Gate oxide

6 µm 100 nm

0.1 µm 2 nm

1/60 1/50

Junction depth Supply voltage Threshold voltage Electric field (Vd/tox)

700 nm 5V 0.8 V

35 nm 1.3 V 0.35 V

1/20 1/3.8 1/2

0.5 MV/cm

6.5 MV/cm

Gate leakage TDDB Resistance Vth Subthreshold leakage TDDB

13

Drain Current: Id → 1/K 2 Gate area: Sg = Lg · Wg → 1/K Gate capacitance: Cg = a · Sg/tox → 1/K 2 Gate charge: Qg = Cg · Vg → 1/K Propagation delay time: tpd = a · Qg /Id → 1/K Clock frequency: f = 1/tpd → K Chip area: Sc: set const. → 1 2 Number of Tr. in a chip: n → K 2 Power consumption: P = (1/2) · f · n · Cg · V d → 1 2 2 K K 1/k 1/k FIGURE 1.7

Parameters change by ideal scaling. 1 1

1 1

I

D

0 0 X , Y , Z : 1/K

V : 1/K

1/K 1/K

FIGURE 1.8

1/K D ∝ V/Na D : 1/K

V

1

Na : K 1/K I 0

I : 1 /K 0

V

1/K

Ideal scaling method.

However, the actual scaling of the parameters has been different from that originally proposed as the ideal scaling, as shown in Table 1.2 and also shown in Fig. 1.5(a). The major difference is the supply voltage reduction. The supply voltage was not reduced in the early phase of LSI generations in order to keep a compatibility with the supply voltage of conventional systems and also in order to obtain higher operation speed under higher electric field. The supply voltage started to decrease from the 0.5 µm generation because the electric field across the gate oxide would have exceeded 4 MV/cm, which had been regarded as the limitation in terms of TDDB (time-dependent break down)—recently the maximum field is going to be raised to high values, and because hot carrier induced degradation for the shortchannel MOSFETs would have been above the allowable level; however, now, it is not easy to reduce the

© 2002 by CRC Press LLC

Log Id Subthreshold leakage current increase 10−6A 10−7A 10−8A 10−9A - A 10−10

Vth lowering Vth

Vth

Vg (V)

Vg = 0V

FIGURE 1.9

Subthreshold leakage current at low Vth. Vd (supply voltage) Operation speed

Device dimension Lithography

1/K 0.1 m

W L (0.1 m) m)

Vth (Threshold voltage) Subthreshold 1/K 0.3 V leakage Channel Channel Source Drain Drain Source Silicon substrate Sub-Doping Concentration K 1018/cm3 Junction leakage : tunnel diode

FIGURE 1.10

xj (diffusion)

1/K 1/ 1.2 — 0.9 V

tox (gate oxide) 1/K 3 nm Tunneling TDDB

1/K 40 nm

Resistance increase

Scaling limitation factor for Si MOSFET below 0.1 µm.

supply voltage because of difficulties in reducing the threshold voltage of the MOSFETs. Too small threshold voltage leads to significantly large subthreshold leakage current even at the gate voltage of 0 V, as shown in Fig. 1.9. If it had been necessary to reduce the supply voltage of 0.1 µm MOSFETs at the same ratio as the dimension reduction, the supply voltage would have been 0.08 V ( = 5 V/60) and the threshold voltage would have been 0.0013 V ( = 0.8 V/60), and thus the scaling method would have been broken down. The voltage higher than that expected from the original scaling is one of the reasons for 2 the increase of the power. Increase of the number of transistors in a chip by more than the factor K is another reason for the power increase. In fact, the transistor size decreases by factor 0.7 and the transistor area decreases by factor 0.5 ( = 0.7 × 0.7) for every generation, and thus the number of transistors is expected to increase by a factor of 2. In reality, however, the increase cannot wait for the downsizing and the actual increase is by a factor of 4. The insufficient area for obtaining another factor 2 is earned by increasing the chip area by a factor of 1.5 and further by extending the area in the vertical direction introducing multilayer interconnects, double polysilicon, and trench/stack DRAM capacitor cells. In order to downsizing MOSFETs down to sub-0.1 µm, further modification of the scaling method is required because some of the parameters have already reached their scaling limit in the 0.1 µm generation, as shown in Fig. 1.10. In the 0.1 µm generation, the gate oxide thickness is already below the directtunneling leakage limit of 3 nm. The substrate impurity concentration (or the channel impurity con18 −3 centration) has already reached 10 cm . If the concentration is further increased, the source-substrate and drain-substrate junctions become highly doped pn junctions and act as tunnel diodes. Thus, the isolation of source and drains with substrate cannot be maintained. The threshold voltage has already

© 2002 by CRC Press LLC

FIGURE 1.11

Top view of 40 nm gate length MOSFETs [4].

FIGURE 1.12

Cross-sectional TEM image of 1.5 nm gate oxide [5].

decreased to 0.3–0.25 V and further reduction causes significant increase in subthreshold leakage current. Further reduction of the threshold voltage and thus the further reduction of the supply voltage are difficult. In 1990s, fortunately, those difficulties were shown to be solved somehow by invention of new techniques, further modification of the scaling, and some new findings for short gate length MOSFET operation. In the following, examples of the solutions for the front end of line are described. In 1993, first successful operation of sub-50 nm n-MOSFETs was reported [4], as shown in Fig. 1.11. In the fabrication of the MOSFETs, 40 nm length gate electrodes were realized by introducing resist-thinning technique using oxygen plasma. In the scaling, substrate (or channel doping) concentration was not increased any more, and the gate oxide thickness was not decreased (because it was not believed that MOSFETs with direct-tunnelling gate leakage operates normally), but instead, decreasing the junction depth more aggressively (in this case) than ordinary scaling was found to be somehow effective to suppress the short-channel effect and thus to obtain good operation of sub-50 nm region. Thus, 10-nm depth S/D junction was realized by introduction of solid-phase diffusion by RTA from PSG gate sidewall. In 1994, it was found that MOSFETs with gate SiO2 less than 3 nm thick—for example 1.5 nm as shown in Fig. 1.12 [5]—operate quite normally when

© 2002 by CRC Press LLC

Epi Channel MOSFETs June 1993 Epitaxial channel Channel ion implantation

Selective Si epitaxial growth Epitaxial film MOSFET fabrication

FIGURE 1.13

Epitaxial channel [9].

S4D MOSFETs June 1995 S4D (Silicided Silicon-Sidewall Source and Drain) Structure Low resistance with ultra-shallow junction Silicided Si-Sidewall Si TiSi

SiN Spacer

2

GATE

Source

FIGURE 1.14

Leff

Drain

4

S D MOSFETs [9].

the gate length is small. This is because the gate leakage current decreases in proportion with the gate length while the drain current increases in inverse proportion with the gate length. As a result, the gate leakage current can be negligibly small in the normal operation of MOSFETs. The performance of 1.5 nm was record breaking even at low supply voltage. In 1993, it was proposed that ultrathin-epitaxial layer shown in Fig. 1.13 is very effective to realize super retrograde channel impurity profiles for suppressing the short-channel effects. It was confirmed that 25 nm gate length MOSFETs operate well by using simulation [6]. In 1993 and 1995, epitaxial channel MOSFETs with buried [7] and surface [8] channels, respectively, were fabricated and high drain current drive with excellent suppression of the short-channel effects were experimentally confirmed. In 1995, new raised (or elevated) S/D structure was proposed, as shown in Fig. 1.14 [10]. In the structure, extension portion of the S/D is elevated with self-aligned to the gate electrode by using silicided silicon sidewall. With minimizing the Si3N4 spacer width, the extension S/D resistance was dramatically reduced. In 1991, NiSi salicide were presented for the first time, as shown in Fig. 1.15 [10]. NiSi has several advantages over TiSi2 and CoSi2 salicides, especially in use for sub-50 nm regime. Because NiSi is a monosilicide, silicon consumption during the silicidation is small. Silicidation can be accomplished at low temperature. These features are suitable for ultra-shallow junction formation. For NiSi salicide, there

© 2002 by CRC Press LLC

FIGURE 1.15

NiSi Salicide [10].

1

(a)

(b)

2000 update Intel 1999

103

1994

Lg ( µ m)

CPU Clock Frequency (MHz)

104

0.1

Intel

1994

102 1990 1995 2000 2005 2010 2015 Year

2000 update 0.01 1990 1995 2000 2005 2010 2015 Year

10

199

Vdd (V)

4

Ver

sio

n

1 99

19

0.1 1990 1995 2000 2005 2010 2015 Year

Tox equivalent (nm)

10

(c)

FIGURE 1.16

1999

5

(d)

0.35 0.25 0.18

1994

SiO2

0.13 0.10 0.07

?

3 0.14 0.12 0.10 0.080

1

Direct tunneling limit

1999

0.070 0.065

? 0.022 0.045 Intel (2000) 0.032 High-k insulator? 0.5 1990 1995 2000 2005 2010 2015 Year

ITRS’99. (a) CPU clock frequency, (b) gate length, (c) supply voltage, and (d) gate insulator thickness.

was no narrow line effect—increase in the sheet resistance in narrow silicide line—and bridging failure by the formation of silicide path on the gate sidewall between the gate and S/D. NiSi-contact resistances + + to both n and p Si are small. These properties are suitable for reducing the source, drain, and gate resistance for sub-50 nm MOSFETs. The previous discussion provides examples of possible solutions, which the authors found in the 1990s for sub-50 nm gate length generation. Also, many solutions have been found by others. In any case, with the possible solutions demonstrated for sub-50 nm generation as well as the keen competitions among semiconductor chipmakers for high performance, the downsizing trend or roadmap has been significantly accelerated since late 1990s, as shown in Fig. 1.16. The first roadmap for downsizing was published in 1994 by SIA (Semiconductor Industry Association, USA) as NTRS’94 (National Technology Roadmap for Semiconductors) [11]—at that time, the roadmap was not an international version. On NTRS’94, the clock frequency was expected to stay at 600 MHz in year 2001 and expected to exceed 1 GHz in 2007. However, it has already reached 2.1 GHz for 2001 in ITRS 2000 [12]. In order to realize high clock frequencies, the

© 2002 by CRC Press LLC

gate length reduction was accelerated. In fact, in the NTRS’94, gate length was expected to stay at 180 nm in year 2001 and expected to reach 100 nm only in 2007, but the gate length is 90 nm in 2001 on ITRS 2000, as shown in Fig. 1.16(b). The real world is much more aggressive. As shown in Fig. 1.16(a), the clock frequency of Intel’s MPU already reached 1.7 GHz [12] in April 2001, and its roadmap for gate length reduction is unbelievably aggressive, as shown in Fig. 1.16(b) [13,14]. In the roadmap, 30-nm gate length CMOS MPU with 70-nm node technology is to be sold in the market in year 2005. It is even several years in advance compared with the ITRS 2000 prediction. With the increase in clock frequency and the decrease in gate length, together with the increase in number of transistors in a chip, the tremendous increase in power consumption becomes the main issue. In order to suppress the power consumption, supply voltage should be reduced aggressively, as shown in Fig. 1.16(c). In order to maintain high performance under the low supply voltage, gate insulator thickness should be reduced very tremendously. On NTRS’94, the gate insulator thickness was not expected to exceed 3 nm throughout the period described in the roadmap, but it is already 1.7 nm in products in 2001 and expected to be 1.0 nm in 2005 on ITRS’99 and 0.8 nm in Intel’s roadmap, as shown in Fig. 1.16(d). In terms of total gate leakage current of an entire LSI chip for use for mobile cellular phone, 2 nm is already too thin, in which standby power consumption should be minimized. Thus, high K materials, which were assumed to be introduced after year 2010 at the earliest on NTRS’94, are now very seriously investigated in order to replace the SiO2 and to extend the limitation of gate insulator thinning. Introduction of new materials is considered not only for the gate insulator, but also almost for every portion of the CMOS structures. More detailed explanations of new technology for future CMOS will be given in the following sections.

1.3 Gate Insulator Figure 1.17 shows gate length (Lg) versus gate oxide thickness (tox) published in recent conferences [4,5,14–19]. The x-axis in the bottom represents corresponding year of the production to the gate length according to ITRS 2000. The solid curve in the figure is Lg versus tox relation according to the ITRS 2000 [12]. It should be noted that most of the published MOSFETs maintain the scaling relationship between Lg and tox predicted by ITRS 2000. Figures 1.18 and 1.19 are Vd versus Lg, and Id (or Ion) versus Lg curves, respectively obtained from the published data at the conferences. From the data, it can be estimated that MOSFETs will operate quite well with satisfaction of Ion value specified by the roadmap until the generation

8

6

2

Lg ( µm) 1

0.1

0.022

T ox (nm)

100 IBM’99 (SOI) 10

1

1970

FIGURE 1.17

Trend of Tox.

© 2002 by CRC Press LLC

Toshiba’ 94

:outside of ITRS spec. : almost within ITRS spec. 1980

Intel ’ 99

Toshiba’ 93 Lucent’ 99 Intel’00

Intel (plan) 1990 Year

2000

2010

2020

Lg ( µ m)

Vdd (V)

101

8

6

2

1

0.1

0.03 Toshiba (Tox: 1.5 nm) Lucent (Tox: 1.3 nm)

Intel (Tox: 2 nm)

100 Intel (plan) Intel (Tox: 0.8 nm)

10−1 1970

1980

1990

2000

2010

2020

Year

FIGURE 1.18

Trend of Vdd.

8

6

Id (mA/ µ m)

0.04 0.03 Toshiba’94 (IEDM94)

)

I on S(

0.008 Lucent’99

(IEDM99) Toshiba’93

IBM’99 (SOI) (IEDM99)

(IEDM93) Intel’00 (IEDM00) Intel 2000 (plan)

ITR

10−2 1970

FIGURE 1.19

Lg ( µ m) 1 0.1

Intel’99 (IEDM99)

100

10−1

2

ITRS’99 (plan)

NEC’99 (EJ-MOSFET) (SSDM99) : with ITRS scaling parameters : thicker gate insulator than ITRS

1980

1990

2000 Year

2010

2020 2030

Trend of drain current.

around Lg = 30 nm. One small concern is that the Ion starts to reduce from Lg = 100 nm and could be smaller than the value specified by the roadmap from Lg = 30 nm. This is due to the increase in the S/D extension resistance in the small gate length MOSFETs. In order to suppress the short-channel effects, the junction depth of S/D extension needs to be reduced aggressively, resulting in high sheet resistance. This should be solved by the raised (or elevated) S/D structures. This effect is more significantly observed in the operation of an 8-nm gate length EJ-MOSFET [20], as shown in Fig. 1.19. In the structure, S/D extension consists of inversion layer created by high positive bias applied on a 2nd gate electrode, which is placed to cover the 8-nm, 1st gate electrode and S/D extension area. Thus, reduction of S/D extension resistance will be another limiting factor of CMOS downsizing, which will come next to the limit in thinning the gate SiO2. In any case, it seems at this moment that SiO2 gate insulator could be used until the sub-1 nm thickness with sufficient MOSFET performance. There was a concern proposed in 1998 that TDDB (Time Dependent Dielectric Breakdown) will limit the SiO2 gate insulator reduction at tox = 2.2 nm [21]; however, recent results suggest that TDDB would be OK until tox = 1.5 − 1.0 nm [22–25]. Thus, SiO2 gate insulator would be used until the 30 nm gate length generation for high-speed MPUs. This is a big change

© 2002 by CRC Press LLC

of the prediction. Until only several years ago, most of the people did not believe the possibility of gate SiO2 thinning below 3 nm because of the direct-tunnelling leakage current, and until only 2 years ago, many people are sceptical about the use of sub-2 nm gate SiO2 because of the TDDB concern. However, even excellent characteristics of MOSFETs with high reliability was confirmed, total gate leak2 age current in the entire LSI chip would become the limiting factor. It should be noted that 10 A/cm gate 2 leakage current flows across the gate SiO2 at tox = 1.2 nm and 100 A/cm leakage current flows at tox = 1.0 nm. However, AMD has claimed that 1.2 nm gate SiO2 (actually oxynitrided) can be used for high end MPUs 2 [26]. Furthermore, Intel has announced that total-chip gate leakage current of even 100 A/cm is allowable for their MPUs [14], and that even 0.8 nm gate SiO2 (actually oxynitrided) can be used for product in 2005 [15]. Total gate leakage current could be minimized by providing plural gate oxide thicknesses in a chip, and by limiting the number of the ultra-thin transistors; however, in any case, such high gate leakage current density is a big burden for mobile devices, in which reduction of standby power consumption is critically important. In the cellular phone application, even the leakage current at tox = 2.5 nm would be a concern. Thus, development of high dielectric constant (or high-k) gate insulator with small gate leakage current is strongly demanded; however, intensive study and development of the high-k gate dielectrics have started only a few years ago, and it is expected that we have to wait at least another few years until the high-k insulator becomes mature for use of the production. The necessary conditions for the dielectrics are as follows [27]: (i) the dielectrics remain in the solidphase at the process temperature of up to about 1000 K, (ii) the dielectrics are not radio-active, (iii) the dielectrics are chemically stable at the Si interface at high process temperature. This means that no barrier film is necessary between the Si and the dielectrics. Considering the conditions, white columns in the periodic law of the elements shown in Fig. 1.20 remained as metals whose oxide could be used as the high-k gate insulators [27]. It should be noted that Ta2O5 is now regarded as not very much suitable for use as the gate insulator of MOSFET from this point of view. Figure 1.21 shows the statistics of high-k dielectrics—excluding Si3N4—and its formation method published recently [28–43]. In most of the cases, 0.8–2.0 nm capacitance equivalent thicknesses to SiO2 (CET) were tested for the gate insulator of MOS diodes and MOSFETs and leakage current of several orders of magnitude lower value than that of SiO2 film was confirmed. Also, high TDDB reliability than that of the SiO2 case was reported.

React with Si. Other failed reactions.

H Li

Reported since Dec. 1999. (MRS, IEDM, ECS, VLSI)

Be

Na Mg K

Ca Sc Ti

Rh Sr

He

V

B

C

N

O

F Ne

Al

Si

P

S

Cl Ar

Cr Mn Fc Co Ni Cu Zn Ga Ge As Se Br Kr

Zr Nb Mo Tc Ru Rb Pd Ag Cd In

Sn Sb Te

Cs Ba

Hf Ta W Re Os Ir

Pb

Fr

Rf Ha Sg Ns Hs Mt

Ra

Y

Pt Au Hg Tl

Bi

I

Xe

Po At Rn

La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm Yb Lu Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No Lr Plotted on the material given by J. R. Hauser at IEDM Short Course on Sub-100 nm CMOS (1999)

FIGURE 1.20

Metal oxide gate insulators reported since Dec. 1998 [27].

© 2002 by CRC Press LLC

Zr alminate Zr-Al silicate Pr2O3 LaAlO3 CeO2 Gd2 O3 Y2 O3 BST Y silicate

ZrO2 Zr silicate

La2 O3 La silicate

RTC

HfO2 Hf silicate

(a)

D CV MO LPCVD

Al Al sil 2 O ica 3 te

CVD

RPE

Sputtering

O5 Ta 2 ate silic Ta

FIGURE 1.21

ALC VD

TiO2 Ti silicate

E MB s) BE ou M rph o m (a

Tix Tay O TaOx Ny

PLD

VD

(b)

Recently reported (a) high-k materials and (b) deposition methods.

Among the candidates, ZrO2 [29–31,34–37] and HfO2 [28,32,34,36,38–40] become popular because their dielectric constant is relatively high and because ZrO2 and HfO2 were believed to be stable at the Si interface. However, in reality, formation and growth of interfacial layer made of silicate (ZrSixOy , HfSixOy) or SiO2 at the Si interface during the MOSFET fabrication process has been a serious problem. This interfacial layer acts to reduce the total capacitance and is thought to be undesirable for obtaining high performance of MOSFETs. Ultrathin nitride barrier layer seems to be effective to suppress the interfacial layer formation [37]. There is a report that mobility of MOSFETs with ZrO2 even with these interfacial layers were significantly degraded by several tens of percent, while with entire Zr silicate gate dielectrics is the same as that of SiO2 gate film [31]. Thus, there is an argument that the thicker interfacial silicate layer would help the mobility improvement as well as the gate leakage current suppression; however, in other experiment, there is a report that HfO2 gate oxide MOSFETs mobility was not degraded [38]. For another problem, it was reported that ZrO2 and HfO2, easily form micro-crystals during the heat process [31,33]. Comparing with the cases of ZrO2 and HfO2, La2O3 film was reported to have better characteristics at this moment [33]. There was no interfacial silicate layer formed, and mobility was not degraded at all. The dielectric constant was 20–30. Another merit of the La2O3 insulator is that no micro-crystal formation was found in high temperature process of MOSFET fabrication [33]. There is a strong concern for its hygroscopic property, although it was reported that the property was not observed in the paper [33]. However, there is a different paper published [34], in which La2O3 film is reported to very easily form a silicate during the thermal process. Thus, we have to watch the next report of the La2O3 experiments. Crystal Pr2O3 film grown on silicon substrate with epitaxy is reported to have small leakage current [42]. However, it was shown that significant film volume expansion by absorbing the moisture of the air was observed. La and Pr are just two of the 15 elements in lanthanoids series. There might be a possibility that any other lanthanoid oxide has even better characteristics for the gate insulator. Fortunately, the atomic content of the lanthanoids, Zr, and Hf in the earth’s crust is much larger than that of Ir, Bi, Sb, In, Hg, Ag, Se, Pt, Te, Ru, Au, as shown in Fig. 1.22. Al2O3 [41,43] is another candidate, though dielectric constant is around 10. The biggest problem for the Al2O3 is that film thickness dependence of the flatband shift due to the fixed charge is so strong that controllability of the flatband voltage is very difficult. This problem should be solved before it is used for the production. There is a possibility that Zr, Hf, La, and Pr silicates are used for the next generation gate insulator with the sacrifice of the dielectric constant to around 10 [31,35,37]. It was reported that the silicate prevent from the formation of micro-crystals and from the degradation in mobility as described before. Furthermore, there is a possibility that stacked Si3N4 and SiO2 layers are used for mobile device application. Si3N4 material could be introduced soon even though its dielectric constant is not very high [44–46], because it is relatively mature for use for silicon LSIs.

© 2002 by CRC Press LLC

Clarke number (ppb)

109

106

108

105

107 104

106 105

O Si Al Fe CaNa K Mg Ti H P Mn F Ba Sr S C Zr V Cl

103 Cr Rb Ni Zn CeCu Y La NdCo Sc Li N Nb GaPb B Pr ThSm

Element

104

Element

Clarke number (ppb)

103 102 101

103

100

102 GdDy CsYb Hf Er Be Br Ta Sn U As W Mo GeHoEu Tb TI Lu 10−1 Tm Element

FIGURE 1.22

I Bi Sb Cd In Hg Ag Se Pt Te Pd Ru Rh Au Ir Os Re

Element

Clarke number of elements.

1.4 Gate Electrode Figure 1.23 shows the changes of the gate electrode of MOSFETs. Originally, Al gate was used for the MOSFETs, but soon poly Si gate replaced it because of the adaptability to the high temperature process and to the acid solution cleaning process of MOSFET fabrication. Especially, poly gate formation step can be put before the S/D (source and drain) formation. This enables the easy self-alignment of S/D to the gate electrode as shown in the figure. In the metal gate case, the gate electrode formation should come to the final part of the process to avoid the high temperature and acid processes, and thus selfalignment is difficult. In the case of damascene gate process, the self-alignment is possible, but process becomes complicated as shown in the figure [47]. Refractory metal gate with conventional gate electrode process and structure would be another solution, but RIE (Reactive Ion Etching) of such metals with good selectivity to the gate dielectric film is very difficult at this moment. As shown in Fig. 1.24, poly Si gate has a big problem of depletion layer formation. This effect would not be ignored when the gate insulator becomes thin. Thus, despite the above difficulties, metal gate is desirable and assumed to be necessary for future CMOS devices. However, there is another difficulty for the introduction of metal gate to CMOS. For advance CMOS, work function of gate electrode should be selected differently for n- and p-MOSFETs to adjust the threshold voltages to the optimum values. Channel doping could shift the threshold voltage, but cannot adjust it to the right value with good control + + of the short-channel effects. Thus, n -doped poly Si gate is used for NMOS and p -doped poly Si gate is used for PMOS. In the metal gate case, it is assumed that two different metals should be used for Nand PMOS in the same manner as shown in Table 1.3. This makes the process further complicated and makes the device engineer to hesitate to introduce the metal gate. Thus, for the short-range—probably to 70 or 50 nm node, heavily doped poly Si or poly SiGe gate electrode will be used. But in the long range, metal gate should be seriously considered.

© 2002 by CRC Press LLC

TABLE 1.3 Candidates for Metal Gate Electrodes (unit: eV) Dual Gate Midgap

Metal gate

NMOS

PMOS

W

4.52

Hf Zr

3.9 4.05

RuO2 WN

4.9 5.0

Ru

4.71

Al Ti

4.08 4.17

Ni Ir

5.15 5.27

TiN

4.7

Ta Mo

4.19 4.2

Mo2N TaN Pt

5.33 5.41 5.65

Al

Salicide gate Polycide gate

CoSi2 Poly Si

Al gate

Poly Si gate

MoSi2 or WSi 2 Poly Si Poly Si gate

Mask misalignment between S/D and gate

W WN x Poly Si

overlap contact to poly Si gate

TiN, Mo etc

Al, W etc

Misalignment margin

Conventional

Damascene

Dummy gate (poly Si)

ILD

CMP Gate dielectrics

barrier metal

metal

CMP Removal of dummy gate

FIGURE 1.23

Gate electrode formation change.

Positive bias Effective thickness Poly Si Gate SiO2 Inversion layer

FIGURE 1.24

Depletion in poly-Si gate.

© 2002 by CRC Press LLC

Depletion layer

Self-align between S/D and to poly Si gate

Non-self-aligned contact

Self-aligned contact Metal gate (Research level) TiN etc

Ion implantation Poly metal gate

Poly Si

Depletion layer Gate SiO2 Inversion layer

1.5 Source and Drain Figure 1.25 shows the changes of S/D (source and drain) formation process and structure. S/D becomes shallower for every new generation in order to suppress the short-channel effects. Before, the extension part of the S/D was called as LDD (Lightly Doped Drain) region and low doping concentration was required in order to suppress electric field at the drain edge and hence to suppress the hot-carrier effect. Structure of the source side becomes symmetrical as the drain side because of process simplicity. Recently, major concern of the S/D formation is how to realize ultra-shallow extension with low resistance. Thus, the doping of the extension should be done as heavily as possible and the activation of the impurity should be as high as possible. Table 1.4 shows the trends of the junction depth and sheet resistance of the extension requested by ITRS 2000. As the generation proceeds, junction depth becomes shallower, but at the same time, the sheet resistance should be reduced. This is extremely difficult. In order to satisfy this request, various doping and activation methods are being investigated. As the doping method, low energy implantation at 2–0.5 keV [48] and plasma doping with low energy [49] are thought to be the most promising at this moment. The problem of the low energy doping is lower retain dose and lower activation rate of the implanted species [48]. As the activation method, high temperature spike lamp anneal [48] is the best way at this moment. In order to suppress the diffusion of the dopant, and to keep the over-saturated activation of the dopant, the spike should be as steep as possible. Laser anneal [50] can realize very high activation, but very high temperature above the melting point at the silicon surface is a concern. Usually laser can anneal only the surface of the doping layer, and thus deeper portion may be necessary to be annealed by the combination of the spike lamp anneal.

TABLE 1.4

Trend of S/D Extension by ITRS 1999

2000

2001

2002

2003

2004

2005

2008

2011

2014

Technology 180 130 100 70 50 35 node (nm) Gate length (nm) 140 120 100 85 80 70 65 45 32 22 Extension Xj (nm) 42–70 36–60 30–50 25–43 24–40 20–35 20–33 16–26 11–19 8–13 Extension sheet 350–800 310–760 280–730 250–700 240–675 220–650 200–625 150–525 120–450 100–400 resistance (Ω/䊐)

Gas / Solid phase diffusion

S

LDD

P, B

Ion Implantation

As, P, B

Diffused layer D

P, As, B, BF2

Extension

As, BF2

Pocket / Halo As, BF2, In

Low E Ion Imp.

LDD (Lightly Doped Drain)

FIGURE 1.25

Source and drain change.

© 2002 by CRC Press LLC

Extension

Pocket

TABLE 1.5

Physical Properties of Silicides MoSi2

WSi2

C54–TiSi2

CoSi2

NiSi

100 1000 Si

70 950 Si

10~15 750~900 Si

18~25 550~900 ∗ Co

30~40 400 Ni

Resistivity (µΩ cm) Forming temperature (°C) Diffusion species ∗

Series resistance [Ω− µm]

Si(CoSi), Co(Co2Si).

LDD

500 SPDD

400 300

S4D

200 100 0

0

-2.0

−1.5

S [mV/decade]

Vg [V] 200

LDD

100 S4D 50

FIGURE 1.26

Vd = −1.5 V

150 SPDD

0.1

Lg [ µ m]

1.0

Elevated source and drain.

In order to further reduce the sheet resistance, elevated S/D structure of the extension is necessary, as shown in Fig. 1.26 [6]. Elevated S/D will be introduced at the latest from the generation of sub-30 nm gate length generation, because sheet resistance of S/D will be the major limiting factor of the device performance in that generation. Salicide is a very important technique to reduce the resistance of the extrinsic part of S/D—resistance of deep S/D part and contact resistance between S/D and metal. Table 1.5 shows the changes of the salicide/silicide materials. Now CoSi2 is the material used for the salicide. In future, NiSi is regarded as promising because of its superior nature of smaller silicon consumption at the silicidation reaction [10].

1.6 Channel Doping Channel doping is an important technique not only for adjusting the threshold voltage of MOSFETs but also for suppressing the short-channel effects. As described in the explanation of the scaling method, the doping of the substrate or the doping of the channel region should be increased with the downsizing of the device dimensions; however, too heavily doping into the entire substrate causes several problems, such as too high threshold voltage and too low breakdown voltage of the S/D junctions. Thus, the heavily doping portion should be limited to the place where the suppression of the depletion layer is necessary, as shown in Fig. 1.27. Thus, retrograde doping profile in which only some deep portion is heavily doped is requested. To realize the extremely sharp retrograde profile, undoped-epitaxial-silicon growth on the heavily doped channel region is the most suitable method, as shown in the figure [7–9]. This is called as epitaxial channel technique. The epitaxial channel will be necessary from sub-50 nm gate length generations.

© 2002 by CRC Press LLC

Boron Concentration (cm-3)

Retrograde profile 1019 1018 1017 1016

epi Si

10150

Si sub.

0.1

D

S

0.2 0.3 0.4 0.5 Depth (µm) D

S

Depletion region

Depletion region

Highly doped region

FIGURE 1.27

Retrograde profile.

Al

Al-Si

Al-Cu Al-(Si)-Cu Ti/TiN

FIGURE 1.28

Al-Si-Cu

Global

Cu TaN SiN

W Intermediate Local

W

Interconnect change.

1.7 Interconnects Figure 1.28 shows the changes of interconnect structures and materials. Aluminium has been used for many years as the interconnect metal material, but now it is being replaced by cupper with the combination of dual damascene process shown in Fig. 1.29, because of its superior characteristics on the resistivity and electromigration [51,52]. Figure 1.30 shows some problems for the CMP process used for the copper damascene, which is being solved. The major problem for future copper interconnects is the necessity of diffusion barrier layer, as shown in Fig. 1.31. The thickness of the barrier layer will consume major part of the cross-section area of copper interconnects with the reduction of the dimension, because it is very difficult to thin the barrier films less than several nanometers. This leads to significant increase in the resistance of the interconnects. Thus, in 10 years, diffusion-barrier-free copper interconnects process should be developed.

© 2002 by CRC Press LLC

Photo resist SiN ILD (Low k) SiO 2 Si

Seed Cu layer

FIGURE 1.29

TaN

Cu

Dual damascene for Cu.

Cu dishing

if input1 = '0' then state

(T)

if any (LSB) Average

Loop

KPD of Phase

(LSB)

Detector

tin

fg 2nd order path ->

z-1 1-z -1

Filter

T(z)

KV (T/LSB)

(LSB/T)

t (k)

(T)

tout (a)

(b)

FIGURE 34.27 Linearized model: (a) second-order DPLL loop filter, (b) timing loop with phase detector modeled by its average signal gain. closed loop magnitude response 5

mag. resp. (dB)

0

-5

-10

-15

-20

0

0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0.009

0.01

FREQUENCY

FIGURE 34.28

Closed loop frequency response of SLT DPLL for low pg and fg update gains.

The open loop DPLL transfer function, G(z), incorporating the loop filter L(z) and clock update gain is −1

−1 fg z z −L  --------------- -------------  G ( z ) = K V 1 – z −1 + p g z 1 – z −1    

Referring to the timing loop model of Fig. 34.27(b), the closed loop transfer functiont (Tout /Tin) = H(z) is

Kp G ( z ) H ( z ) = --------------------------1 + Kp G ( z )

(34.28)

Note that Kp has dimensions of LSB/T, KV and G(z) have dimensions of T/LSB and H(z) is a transfer function with respect to two time quantities. The effective noise bandwidth is then,

ENB = 2



0.5

0

2

H ( f ) df

An example of a closed loop transfer function for the SLT DPLL is shown in Fig. 34.28 for LOW update gains. To find the effect of AWG noise, n(k), first convert the σn to an effective timing noise by dividing by the rms slope, σs, of the signal that is obtained during the numerical generation of the signal slopes

© 2002 by CRC Press LLC

and calculation of the slope generating filter coefficients. Now it can be multiplied by the square root of the ENB to determine the corresponding noise induced timing jitter σj (units of T). Therefore,

sn s j = ----- ENB ss

(34.29)

The equivalent model for the above method of analysis is shown in Fig. 34.27(b). For the SLT-based DPLL, the total jitter is simply the above σj. For the MM DPLL the phase detector output noise is colored; however, we know its properties here and can examine its effect from this point onwards. The only difference is that the closed loop transfer function seen by the MM phase detector output noise is,

G(z) F ( z ) = --------------------------1 + Kp G ( z )

(34.30)

The noise jitter is then obtained as,

2∫

sj =

0.5

0

2

P n ( f ) F ( f ) df

(34.31)

where Pn(f ) is the noise p.s.d. at the phase detector output. Figure 34.29 plots the jitter performance of the SLT- and MM-based DPLLs for three sets of (pg,fg): LOW, MED, HGH. Shown are the output, noise-induced timing jitter of the loop for four channel error event rates. Observe that the MM timing loop’s output noise jitter is almost the same but slightly better than that of the SLT-based timing loop. Jitter and BER Simulation Results Simulations on the SLT-based timing loop and the MM loop are run within the simulator framework described in Fig. 34.22. The same DPLL loop filter structure is used for both systems. Simulations are run at a channel bit density bit of 2.8 without noise and SNRs, which correspond with channel EERs of SLT / MM ANALYTICAL OUTPUT JITTER 3 SLT HGH MM HGH SLT MED MM MED SLT LOW MM LOW

2.5

T/64

2

1.5

1

0.5

-6

10

-5

-4

10

10 EER

FIGURE 34.29

Analytically calculated output jitter for SLT and MM timing loops.

© 2002 by CRC Press LLC

-3

10

TABLE 34.2 Simulation-Based Timing Loop Output Jitter σjt (Units of T/64) Performance of SLT and MM Timing Loops for Final EERs of Zero (Noiseless), −4 −2 10 , and 10 SLT pg, fg GAINS

EER 0

LOW MED HGH

0.49 0.49 0.67

EER 10

−4

1.30 1.69 2.67

MMPD EER 10

−2

2.18 2.99 4.86

EER 0 0.45 0.46 0.70

EER 10

−4

1.16 1.56 2.67

EER 10

−2

1.86 2.51 4.38

Phase Transient For Approx 0.1875T (12T/64) Samp Phase Error 5

SLT Phase (T/64)

0

-5 -10 -15 -20

300

400

500

600

700

800

900

1000

1100

1200

1300

1400

400

500

600

700

800

900

1000

1100

1200

1300

1400

MM Phase (T/64)

5

0

-5 -10 -15 -20

300

Symbols

FIGURE 34.30 simulation. −4

SLT and MM DPLL reaction to 0.1875T (12T/64) phase step. Low pg, fg gains. No noise in this

−2

10 and 10 . The steady-state jitter is examined in the DPLL output phase and the response of the timing loop to a phase step in the data field. Figure 34.30 shows the transient phase response plots of the SLT and MM DPLLs responding to a 0.1875T (12T/64) phase step in data field for the same LOW pg and fg settings. Note that they have very similar responses. Table 34.2 shows the steady-state output jitter of the two timing loops for various combinations of gains and noise levels corresponding to EERs −4 −2 of 10 and 10 . The settled DPLL phases show some nonzero jitter without additive noise from quantization effects. Timing jitter at the DPLL output is measured by measuring the standard deviation of the DPLL phase. Again, observe that the two timing loops have very similar jitter numbers although the MM timing loop jitter is slightly lower. Finally, the Viterbi detector BER performance is examined instead of the timing loop jitter performance for the read channel architecture of Fig. 34.31 employing the MM and SLT timing loops. Observe that the BERs of the two systems are practically indistinguishable. Conclusions An overview of timing recovery methods for PRML magnetic recording channels, including interpolative and traditional symbol rate VCO-based timing recovery methods, was provided. Also reviewed was the MMSE timing recovery from first principles and its formulation in the framework of a SLTbased timing gradient. A framework for analyzing the performance of the timing loops in terms of output noise jitter was provided. The jitter calculation is based on obtaining linearized Z domain closed loop transfer functions of the timing loop. Also compared was the output timing jitter, due to input noise,

© 2002 by CRC Press LLC

BER Comparison of SLT and MM Timing Loops MM SLT

-2

10

-3

BER

10

-4

10

-5

10

19

19.5

20

20.5

21

21.5

22

22.5

23

23.5

24

24.5

SNR (dB)

FIGURE 34.31

Simulated BERs of practical read channel using SLT and MM timing loops.

of the SLT and MM timing loops—two commonly used timing loops. The jitter performance of the MM loop is almost the same but very slightly better than that obtained with the SLT-based timing loop; however, the Viterbi BER performance of read channel systems employing the two timing loops are practically indistinguishable.

References 1. A. Oppenheim and R. Schafer, Discrete Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1989. 2. K. Fisher, W. Abbott, J. Sonntag, and R. Nesin, “PRML detection boosts hard-disk drive capacity” IEEE Spectrum, pp. 70–76, November, 1996. 3. M. E. Van Valkenburg, Analog Filter Design, Holt Rinehart Winston, 1982. 4. R. Schaumann, M. Ghausi, and K. Laker, Design of Analog Filters, Prentice-Hall, Englewood Cliffs, NJ, 1990. 5. J. Park and L. R. Carley, “Analog complex graphic equalizer for EPR4 channels,” IEEE Transactions on Magnetics, pp. 2785–2787, September, 1997. 6. A. Bishop, et al., “A 300 Mb/s BiCMOS disk drive channel with adaptive analog equalizer,” Digests, Int. Solid State Circuits Conf., pp. 46–47, 1999. 7. P. Pai, A. Brewster, and A. Abidi, “Analog front-end architectures for high speed PRML magnetic recording channels,” IEEE Transactions on Magnetics, pp. 1103–1108, March 1995. 8. R. Cideciyan and F. Dolivo, et al., “A PRML system for digital magnetic recording,” IEEE Journal on Selected Areas in Communications, pp. 38–56, January, 1992. 9. G. Mathew, et al., “Design of analog equalizers for partial response detection in magnetic recording,” IEEE Transactions on Magnetics, pp. 2098–2107. 10. S. Qureshi, “Adaptive equalization,” in Proceedings of the IEEE, September, 1973, pp. 1349–1387. 11. S. Haykin, Communication Systems, John Wiley & Sons, New York, 1992, pp. 487–497. 12. S. Haykin, Adaptive Filter Theory, Prentice-Hall, 1996. 13. J. Bergmans, Digital Baseband Transmission and Recording, Kluwer Academic Publishers, Dordrecht, the Netherlands, 1996. 14. P. Aziz and J. Sonntag, “Equalizer architecture tradeoffs for magnetic recording channels,” IEEE Transactions on Magnetics, pp. 2728–2730, September, 1997. 15. L. Du, M. Spurbeck, and R. Behrens, “A linearly constrained adaptive FIR filter for hard disk drive read channels,” in Proceedings, IEEE Int. Conf. on Communications, pp. 1613–1617.

© 2002 by CRC Press LLC

16. J. Sonntag, et al., “A high speed low power PRML read channel device,” IEEE Transactions on Magnetics, pp. 1186–1189, March, 1995. 17. D. Welland, et al., “Implementation of a digital read/write channel with EEPR4 detection,” IEEE Transactions on Magnetics, pp. 1180–1185, March 1995. 18. G. Tuttle, et al., “A 130 Mb/s PRML read/write channel with digital-servo detection,” Digests, IEEE Int. Solid-State Circuits Conf., 1996. 19. J. Chern, et al., “An EPRML digital read/write channel IC,” Digests, IEEE Int. Solid State Circuits Conf., 1997. 20. N. Nazari, “A 500 Mb/s disk drive read channel in 0.25 µm CMOS incorporating programmable noise predictive Viterbi detection and trellis coding,” Digests, Intl. Solid-State Circuits Conf., pp. 78–79, 2000. 21. M. Spurbeck and R. Behrens, “Interpolated timing recovery for hard disk drive read channels,” in Proceedings, IEEE Int. Conf. on Communications, 1997, pp. 1618–1624. 22. Z. Wu and J. Cioffi, “A MMSE interpolated timing recovery scheme for the magnetic recording channel”, in Proceedings, IEEE Int. Conf. on Communications, 1997, pp. 1625–1629. 23. A. Patapoutian “On phase-locked loops and Kalman filters,” IEEE Transactions on Communications, pp. 670–672, May, 1999. 24. K. Mueller and M. Muller, “Timing recovery in digital synchronous data receivers,” IEEE Transactions on Communications, pp. 516–531, May, 1976. 25. F. Dolivo, W. Schott, and G. Ungerbock, “Fast timing recovery for partial response signaling systems,” IEEE Conf. on Communications, pp. 18.5.1–18.5.4, 1989. 26. S. Qureshi, “Timing recovery for equalized partial-response systems,” IEEE Transactions on Communications, pp. 1326–1331, December, 1976. 27. H. Shafiee, “Timing recovery for sampling detectors in digital magnetic recording,” IEEE Conf. on Communications, pp. 577–581, 1996. 28. J. Bergmans, “Digital baseband transmission and recording,” Kluwer Academic Publishers, Dordrecht, the Netherlands, pp. 500–513, 1996. 29. P. Aziz and S. Surendran “Symbol rate timing recovery for higher order partial response channels,” IEEE Journal on Selected Areas in Communications, April, 2001 (to appear).

34.4 Head Position Sensing in Disk Drives Ara Patapoutian Introduction Data in a disk drive is stored in concentric tracks on one or more disk platters. As the disks spin, a magnetic transducer known as a read/write head, transfers information between the disks and a user [1]. When the user wants to access a given track, the head assembly moves the read/write head to the appropriate location. This positioning of the head is achieved by use of a feedback servo system as shown in Fig. 34.32. First, a position sensor generates a noisy estimate of the head location. Then by comparing the difference between this estimate and the desired location, a controller is able to generate a signal to adjust the actuator accordingly. Two known approaches are used in sensing the head position. An external optical device can be used to estimate the head position by emitting a laser light and then by measuring the reflected beam. This approach is relatively expensive, may need frequent calibrations, and at present is limited to servo writers, which are discussed later. In the second approach, a read head, which is designed primarily to detect the recorded user data pattern, will itself sense position specific magnetic marks recorded on a disk surface. Using statistical signal-processing techniques, the read waveform is decoded into a head position estimate. At present this second approach is preferred for disk drives and is the topic of this article. In an embedded servo scheme, as shown in Fig. 34.33, a portion of each platter, which is divided into multiple wedges, is reserved to provide radial and sometimes angular position information for the read

© 2002 by CRC Press LLC

position sensor

read/write head

desired position actuator track

controller

spinning disk

FIGURE 34.32

Position control loop for a disk drive.

data field servo field

a wedge

FIGURE 34.33

a disk platter

Data and servo fields on a disk drive. digital field

preamble

FIGURE 34.34

address mark

track address

wedge address

servo bursts

A generic composition of a servo field.

head. These reserved wedges are known as servo fields, and the number of wedges per surface varies significantly amongst different products. A generic servo wedge provides radial estimates in two steps. On a disk surface, each track is assigned a number known as the track address. These addresses are included in a servo field, providing complete radial position information with accuracy of up to a track. In other words, the information provided by a track address is complete but coarse. The positional error signal (PES) complements the track address by providing a more accurate estimate within a track. By combining these two estimates, a complete and accurate head position estimate can be obtained. A wedge may also contain coarse information regarding angular position if a specific address is assigned to each wedge. The user data field, with its own address mark and timing capability, can complement the wedge address by providing finer angular position estimates. A typical wedge will have multiple sub-fields, as shown in Fig. 34.34. A periodic waveform, known as a preamble, provides ample information to calibrate the amplitude of the waveform and, if necessary, to acquire the timing of the recorded pattern. Frame synchronization, or the start of a wedge, is recognized

© 2002 by CRC Press LLC

by a special sequence known as the address mark. This is followed by the track and wedge addresses, and finally by the servo burst that provides information regarding the PES. These multiple sub-fields can be divided into two distinct areas. Since the address mark, track address, and wedge address are all encoded as binary strings, they are referred to as the digital field, as shown in Fig. 34.34. By contrast, ignoring quantization effects of the read channel, the periodic servo burst field is decoded to a real number representing the analog radial position. Thus, the format designs as well as the demodulation techniques for the digital and burst fields are fundamentally different. The digital field demodulator is known as the detector while the servo burst field demodulator is known as the estimator. Despite their differences, the two fields are not typically designed independently of each other. For example, having common sample rates and common front-end hardware simplifies the receiver architecture significantly. Furthermore, it makes sense to use coherent or synchronous detection algorithms with coherent estimation algorithms and vice versa. Having a reserved servo field comes at the expense of user data capacity. A major optimization goal is to minimize the servo field overhead for a given cost and reliability target. Both the servo format design as well as that of the detectors/estimators in the read channel chip of a disk drive are optimized to minimize this overhead. This chapter section reviews position sensing formats and demodulators. Because estimation and detection are well-known subjects, presented in multiple textbooks [2,3], issues that are particular to disk drive position sensors are emphasized. Furthermore, rather than the servo control loop design, the statistical signal processing aspects of position sensing are presented. For a general introduction to disk drive servo control design, the reader is referred to [4], where the design of a disk drive servo is presented as a case study of a control design problem. In general, because of the proprietary nature of this technology, the literature regarding head position sensing is limited to a relatively few published articles, with the exception of patents. When a disk drive is first assembled in a factory, the servo fields have to somehow be recorded on the disk platters. Once a drive leaves the factory, these fields will only be read and never rewritten. Traditionally, an expensive external device, known as the servo writer, introduced in the subsection on “Servo Writers,” records the servo fields. In general, the servo writing process constrains and affects the servo field format choices as well as the demodulator performance. In the next section, the digital field format and detection approaches are addressed, while in the subsection on “The Burst Field,” the servo burst format and PES estimation approaches are introduced.

Servo Writers After a disk drive is assembled, the function of a servo writer is to record the servo wedges on a drive. While the disks are spinning, an external servo writer senses the radial position usually through the head assembly using the reflection of a laser beam. An external mechanical device moves the head assembly. Finally, an external read head locks on a clocking track on a disk to locate the angular position. By knowing both the radial and angular position, as well as controlling the radial position, the servo writer records the wedges, track by track, using the native head of the drive. Servo writing has been an expensive process. The servo writing time per disk is an important interval that disk manufacturers try to minimize and is proportional to the number of tracks per disk surface, to the spin of the disk drive, and to the number of passes needed to record a track. Since the number of tracks per disk is increasing faster than the spin speed, the servo writer time becomes a parameter that needs to be contained. To this end, the disk drive industry has attempted to minimize both servo writer time and the servo writer cost. Self-servo writing is a procedure where the wedges are written by the disk drive itself without using any external device [5,6]. Here, the servo writing time is increased but the process is less costly. Many hybrid proposals also use a combination of an external servo writer to record some initial marks and then complete the wedges by using the drive itself. An example of such a process is the printed media approach [7,8], where a master disk “stamps” each disk, and afterward the drive completes writing the wedges.

© 2002 by CRC Press LLC

transition

track n erase band track n +1 radial incoherence

FIGURE 34.35

Servo writer impairments: erase bands and radial incoherence.

In general, the servo writer cannot record an arbitrary wedge format. For example, it is very difficult for a servo writer that records wedges track-by-track to record a smooth angled transition across the radius. Furthermore, the wedges generated by a servo writer are not ideal. For example, servo writers that record wedges track-by-track create an erase band between tracks [9], where due to head and disk media characteristics, no transition is recorded in a narrow band between two adjacent tracks. Similarly, because of uncertainties in the angular position, two written tracks may not be aligned properly causing radial incoherence between adjacent tracks. These two impairments are illustrated in Fig. 34.35. In summary, the servo writer architecture affects both the wedge format design as well as the demodulator performance of a disk drive sensor.

The Digital Field The digital servo field has many similarities to the disk drive user data field [10] and to a digital communications system [2]. Each track in a digital field is encoded and recorded as a binary string similar to a data field. What differentiates a digital servo field from others is its short block length, and more importantly its off-track detection requirement. Before discussing these differences, let us start by saying that a magnetic recording channel, for both the data and servo digital fields, is an intersymbol interference (ISI) channel. When read back, the information recorded in a given location modifies the waveform not only at that given location but also in the neighboring locations. Finite length ISI channels can be optimally detected using sequence detectors [11], where at least theoretically, all the field samples are observed before detecting them as a string of ones and zeros. For about a decade now, such sequence detectors have been employed in disk drives to detect the user data field. The digital servo field length is very short relative to a data field. The present data sector length is around 512 bytes long, whereas the servo digital information string is only a few bytes long. So, whereas the percentage of overhead attributable to a preamble, address marks, and error correcting codes (ECC) is relatively small compared to the user data field, the overhead associated with a digital servo field can easily exceed one hundred percent. For example, it is well known that ECC coding efficiency increases with block length, i.e. codes with very short block lengths have weak error correction capability. One strategy in minimizing the preamble field length is to use asynchronous detection, which usually trades performance for format, since it does not require exact timing information. A simple strategy in minimizing the digital field is to write only partial information per wedge [12]. For example, with a smart controller, the absolute track or wedge address may not be needed, since it may be predicted using a state machine; however, such strategies improve format efficiency at the expense of performance robustness and complexity. Offtrack Detection A primary factor that differentiates digital servo fields from other types of detection channels is the requirement to detect data reliably at any radial position, even when the read head is between two adjacent tracks. In contrast, a user data field is expected to be read reliably only if the read head is directly above

© 2002 by CRC Press LLC

FIGURE 34.36 location.

X

+

-

+

+

+

-

Y

+

-

-

+

+

-

An example of two Gray-coded track addresses. The two addresses are different only in the third

that specific track. As will be discussed shortly, such a constraint influences the ECC as well as sequence detection strategies. A related concern is the presence of radial incoherence, and the erase field introduced during servo writing that are present when the read head straddles two tracks. The detector performance will suffer from such impairments. Formats that tolerate such impairments are desired. Because the recorded address mark and wedge address does not vary from one track to the next, the emphasis is on track addresses. When the read head is in the middle of two adjacent tracks, with track addresses X and Y, the read waveform is the superposition of the waveforms generated from each of the respective addresses. In general, the resulting waveform cannot be decoded reliably to any one of the two track addresses. A common solution is the use of a Gray code to encode track addresses, as shown in Fig. 34.36, where any two adjacent tracks differ in their binary address representation in only one symbol value. Hence, for the moment ignoring ISI, when the head is midway between adjacent tracks, the detector will decode the address bits correctly except for the bit location where the two adjacent tracks differ, that is, for the two track addresses labeled as X and Y, the decoder will decode the waveform to either track address X or Y, introducing an error of at most one track. By designing a radially periodic servo burst field, with period of at least two track widths, track number ambiguity generated by track addresses is resolved; however, as will be discussed next, Gray codes complicate the use of ECC codes and sequence detectors. A Gray code restricts two adjacent tracks to differ in only a single position, or equivalently forcing the Hamming distance between two adjacent track addresses to be one. Adding an ECC field to the digital fields is desirable since reliable detection of track addresses is needed in the presence of miscellaneous impairments such as electronic and disk media noise, radial incoherence, erase bands, etc.; however, any ECC has a minimum Hamming distance larger than one. That is, it is not possible to have two adjacent track-addresses be Gray and ECC encoded simultaneously. If an ECC field is appended to each track address, it can be used only when the head is directly above a track. A possible alternative is to write the track addresses multiple times with varying radial shifts so that, at any position, the head is mostly directly above a track address [13]. Such a solution improves reliability at the expense of significant format efficiency loss. Another complication of introducing Gray codes results from the interaction of these codes with the ISI channel. Consider an ISI free channel where the magnetic transitions are written ideally and where the read head is allowed to be anywhere between two adjacent Gray coded track addresses X and Y. As was discussed earlier, the track address reliability, or the probability that the decoded address is neither X nor Y, is independent of the read head position. Next, it is shown that for an ISI channel the detector performance depends on the radial position. In particular, consider the simple ISI channel with pulse response 1 − D, which approximates a magnetic recording channel. For such a channel, events of length two are almost as probable as errors of length one (same distance but different number of neighbors). Now, as the head moves from track X to Y, the waveform modification introduced by addresses X and Y, at that one location where the two tracks differ, can trigger an error of length two. The detector may decode the received waveform to a different track address Z, which may lie far from addresses X or Y. In other words, in an ISI channel, whenever the head is between two tracks X and Y, the probability that the received waveform is decoded to some other address Z increases.

© 2002 by CRC Press LLC

Z

Z 2dh dh dh

dh

dh

X

Y (a)

FIGURE 34.37

X

3

2dh

Y (b)

Signal space representation of three codewords. Configuration (a) ISI free (b) with ISI.

For readers familiar with signal space representation of codewords, the ISI free and 1 + D channels are shown for the three addresses X, Y, and Z with their decision regions in Fig. 34.37. Let dh denote to the shortest distance from codeword X to the decision boundaries of codeword Z. As shown in Fig. 34.37, when the head is midway between tracks X and Y, the shortest distance to cross the decision boundaries of codeword Z is reduced by a factor of 3 (or 4.77 dB). Therefore, when the head is in the middle of two tracks, represented by addresses X and Y, the probability that the decoded codeword is Z increases significantly. For an arbitrary ISI channel this reduction factor in shortest distance varies, and it can be shown to be at most 3. A trivial solution to address both the ECC and ISI complications introduced by the Gray code is not to use any coding and to write address bits far enough from each other to be able to ignore ISI effects. Then a simple symbol-by-symbol detector is sufficient to detect the address without the need for a sequence detector. Actually this is a common approach taken in many disk drive designs; however, dropping ECC capability affects reliability and forcing the magnetic recording channel to behave as ISI free requires additional format. Another approach is to use symbol based codes, such as a bi-phase code, rather than sequence-based codes, that is, rather than maximizing the minimum distance between any two codewords, the distance between two symbols is maximized. For example, in a magnetic recording channel, a bi-phase code produces a positive pulse at the middle of the symbol for a symbol “1” and a negative pulse for a symbol “0,” increasing symbol reliability [13,14]. In this example, it can be shown that the ISI related degradations are minimized and the detector performance is improved. A fundamentally different approach would not make use of a Gray code at all. Instead, codes would be designed from scratch in such a way that for any two addresses X and Y the distance between X and Y would increase, as they are radially located further away from each other.

The Burst Field In the previous subsection, track addresses were introduced, which provide head position information to about single-track accuracy. To be able to read the data field reliably, it is essential to position the read head directly upon the desired track within a small fraction of a track. To this end the track number addresses are complemented with the servo burst field, where the analog radial position is encoded in a periodic waveform such as a sinusoid. Three ways to encode a parameter in a sinusoidal waveform are used: amplitude, phase, and frequency [3]. Servo burst fields are also periodic radially. Because the track address already provides an absolute position, such a periodicity does not create any ambiguity. In a disk platter, information is recorded in one of two stable domains. Hence, a servo burst is recorded as a periodic binary pattern. The read back waveform, at the head output, is periodic and will contain both the fundamental and higher harmonics. The sinusoidal waveform is obtained by retaining only the fundamental harmonic. For a given format budget, it is possible to maximize the power of the read back

© 2002 by CRC Press LLC

saturated region two servo bursts A and B A-B track n

radial position

A B

track n +1

linear region

read head

(a) Amplitude burst format

(b) Position error transfer function

FIGURE 34.38 The amplitude burst format and its position error transfer function as the head moves from centertrack n to center-track n + 1.

waveform by optimizing the fundamental period [15]. If recorded transitions get too close, ISI destroys most of the signal power. On the other hand, if transitions are far from each other, then the read back pulses are isolated and contain little power. In this subsection, first the impairment sources are identified. Afterward, various servo burst formats and their performances are discussed [16,17]. Finally, various estimator characteristics and options are reviewed. Impairments Here, impairments in a servo burst field are classified into three categories: servo-writer induced, read head induced, and read channel induced. Not all impairments are present in all servo burst formats. As was discussed in the subsection on “Servo Writers”, when the servo-writer records wedges track-bytrack, erase band as well as radial incoherence may be present between tracks, degrading the performance of some of the servo burst formats. Also, the duty cycle of the recorded periods may be different than the intended 50%. Finally, write process limitations result in nonideal recorded transitions. The read head element as well as the preamplifier, which magnifies the incoming signal, generate electronic noise, modeled by additive white Gaussian noise (AWGN). Also, in many situations the width of the read head element ends up, being shorter than the servo burst radial width as shown in Fig. 34.38 (a). As will be discussed shortly, for some formats, this creates saturated radial regions where the radial estimates are not reliable [9]. Finally, the rectangular approximation of the read head element shown in Fig. 34.38(a) is not accurate. More specifically, different regions of the read head may respond differently to a magnetic flux. Hence, the read head profile may be significantly different than a rectangle [18,19]. The read channel, while processing the read waveform, induces a third class of errors. Most present estimators are digitally implemented and have to cope with quantization error. If only the first harmonic of the received waveform is desired then suppressing higher harmonics may leave residues that may interact with the first harmonic inside the estimator. Furthermore, sampling a waveform with higher harmonic residues creates aliasing effects, where higher harmonics fold into the first harmonic. Many read channel estimators require that the phase, frequency, or both phase and frequency of the incoming waveform are known. Any discrepancy results in estimator degradation. Finally, estimator complexity constraints result in suboptimal estimators, further degrading the accuracy of the position estimate. Formatting Strategies At present, the amplitude servo burst format, shown in Fig. 34.38(a), is the most common format used in the disk drive industry. Depending on the radial position of the read head, the overlap between the head and the bursts A and B varies. Through this overlap, or amplitude variation, it is possible to estimate the radial position. First, the waveforms resulting from the overlap of the read head and the burst fields

© 2002 by CRC Press LLC

FIGURE 34.39 B, respectively.

track n

A

B′

track n+1

A′

B

Alternative burst formats where A′ and B′ are either orthogonal to or of opposite polarity of A and

A and B are transformed into amplitude estimates. These amplitude estimates are then subtracted from each other and scaled to get a positional estimate. As the head moves from track center n to track center (n + 1), the noiseless positional estimate, known as position error transfer function, is plotted in Fig. 34.38(b). Here, since the radial width of the servo burst is larger than the read element, any radial position falls into either the linear region, where radial estimate is accurate, or in the saturated region, where the radial estimate is not accurate [9]. One solution to withstand saturated regions is to include multiple burst pairs, such that any radial position would fall in the linear region of at least one pair of bursts. The obvious drawback of such a strategy is the additional format loss. The amplitude format just presented does not suffer from radial incoherence since two bursts are not recorded radially adjacent to each other. Because nonrecorded areas do not generate any signal, in Fig. 34.38(a) only 50% of the servo burst format is recorded with transitions or utilized. In an effort to improve the position estimate performance, the whole allocated servo area can be recorded. As a result, at least two alternative formats have emerged, both illustrated by Fig. 34.39. In the first improved format, burst A is radially surrounded by an antipodal or “opposite polarity” burst A′. For example, if burst A is recorded as ++−−++−−… then burst A′ is recorded as −−++−−++…. For readers familiar with digital communications, the difference between the amplitude and antipodal servo burst formats can be compared to the difference between on-off and antipodal signaling. In on-off signaling, a symbol “0” or “1” is transmitted while in antipodal signaling 1 or −1 is transmitted. Antipodal signaling is 6 dB more efficient than on-off signaling. Similarly, it can be shown that the antipodal servo burst format gives a 6-dB advantage with respect to amplitude servo burst format under the AWGN assumption [17]. Instead of recording A′ to be the opposite polarity of A, another alternative is to record a pattern A′ that is orthogonal to A. For example, it is possible to pick up two sinusoids with different frequencies such that the two waveforms are orthogonal over a finite burst length interval. The resulting format is known as the dual frequency format [20]. Inside the read channel, two independent estimates of the head position can be obtained from two estimators, each tuned to one of the two frequencies. The final radial estimate is the average of the two estimates, resulting in a 3-dB improvement with respect to the amplitude format, again under AWGN assumption. Unlike the amplitude format, these more sophisticated formats are in general more sensitive to other impairments such as erase band and radial incoherence. A fundamentally different format is presented in Fig. 34.40. Here, the transitions are skewed and the periodic pattern gradually shifts in the angular direction as the radius changes. The radial information is stored in the phase of the period, so it is called the phase format. In Fig. 34.40 two burst fields A and B are presented where the transition slopes have the same magnitude but opposite polarities. An estimator makes two phase estimates, one from the sinusoid in field A and another one from the sinusoid in field B. By subtracting the second phase estimate from the first, and then by scaling the result, the radial position estimate can be obtained. Similar to the antipodal format, it can be shown that the phase pattern is 6 dB superior to the amplitude pattern [17] under AWGN. A major challenge for the phase format is successfully recording the skewed transitions on a disk platter without significant presence of radial incoherence and erase band.

© 2002 by CRC Press LLC

A

radial period

FIGURE 34.40

++ +

B

skewed transitions

read head

The phase format.

Position Estimators Estimating various parameters of a sinusoid is well documented in textbooks [3]. A decade ago position estimators were mostly implemented by analog circuitry, whereas at present, digital implementation is the norm and the one considered in this article [21–25]. One way of classifying estimators is to determine whether the phase and/or the frequency of the incoming waveform are known. Assume that the amplitude of a noisy sinusoid needs to be determined. If the phase of this waveform is known, a matched filter can be used to generate the amplitude estimate. This is known as coherent estimation. Under certain assumptions and performance criteria such a filter becomes optimal. When the phase of the waveform is not known, but the frequency is known, then two matched filters can be used, one tuned to a sine waveform while the other filter is tuned to a cosine waveform. The outputs of the two filters are squared and added to give the energy estimate of the waveform. This is known as noncoherent estimation and is equivalent to computing the Fourier transform at the first harmonic. Other ad hoc estimators include the peak estimator and digital area estimators [26], which respectively estimate the averaged peak and the mean value of the unsigned waveform. Neither of these estimators requires the phase or the frequency of the waveform. For the amplitude format, all the estimators mentioned here can be used. For the antipodal format, the phase of the waveform is needed and therefore a single matched filter is the required estimator. For dual frequency format, we need two estimators, each tuned to a different frequency. Since the two waveforms are orthogonal to each other, an estimator tuned to one of the waveforms will not observe the other waveform. Each estimator can utilize a single matched filter for coherent estimation or two matched filters for noncoherent estimation. Finally, for phase estimation, two matched filters are utilized, similar to noncoherent estimation; however, rather than squaring and adding the filter outputs, the inverse tangent function is performed on the ratio of the filter outputs.

References 1. Comstock, R.L. and Workman, M.L., Data storage in rigid disks, in Magnetic Storage Handbook, 2nd ed., Mee, C.D. and Daniel, E.D., Eds., McGraw-Hill, New York, 1996, chap. 2. 2. Proakis, J.G., Digital Communications, 4th ed., McGraw-Hill, New York, 2000. 3. Kay, S.M., Fundamentals of Statistical Signal Processing: Estimation Theory, Prentice-Hall, Englewood Cliffs, NJ, 1993. 4. Franklin, G.F., Powell, D.J., and Workman, M.L., Digital control of dynamic systems, 3rd ed., AddisonWesley, Reading, MA, 1997, chap. 14. 5. Brown, D.H., et al., Self-servo writing file, US patent 06,040,955, 2000. 6. Liu, B., Hu, S.B., and Chen, Q. S., A novel method for reduction of the cross track profile asymmetry of MR head during self servo-writing, IEEE Trans. on Mag., 34, 1901, 1998. 7. Bernard, W.R. and Buslik, W.S., Magnetic pattern recording, U.S. patent 03,869,711, 1975.

© 2002 by CRC Press LLC

8. Tanaka, S., et al., Characterization of magnetizing process for pre-embossed servo pattern of plastic hard disks, IEEE Trans. on Mag., 30, 4209, 1994. 9. Ho, H.T. and Doan, T., Distortion effects on servo position error transfer function, IEEE Trans. on Mag., 33, 2569, 1997. 10. Bergmans, J.W.M., Digital Baseband Transmission and Recording, Kluwar Academic Publishers, Dordrecht, 1996. 11. Forney, G.D., Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference, IEEE Trans. on Info. Thy., 18, 363, 1972. 12. Chevalier, D., Servo pattern for location and positioning of information in a disk drive, U.S. patent 05,253,131, 1993. 13. Leis, M.D., et al., Synchronous detection of wide bi-phase coded servo information for disk drive, U.S. patent 05,862,005, 1999. 14. Patapoutian, A., Vea, M.P., and Hung, N.C., Wide biphase digital servo information detection, and estimation for disk drive using servo Viterbi detector, U.S. patent 05,661,760, 1997. 15. Patapoutian, A., Optimal burst frequency derivation for head positioning, IEEE Trans. on Mag., 32, 3899, 1996. 16. Sacks, A.H., Position signal generation in magnetic disk drives, Ph.D. dissertation, Carnegie Mellon University, Pittsburgh, 1995. 17. Patapoutian, A., Signal space analysis of head positioning formats, IEEE Trans. on Mag., 33, 2412, 1997. 18. Cahalan, D. and Chopra, K., Effects of MR head track profile characteristics on servo performance, IEEE Trans. on Mag., 30, 4203, 1994. 19. Sacks, A.H. and Messner, W.C., MR head effects on PES generation: simulation and experiment, IEEE Trans. on Mag., 32, 1773, 1996. 20. Cheung, W.L., Digital demodulation of a complementary two-frequency servo PES pattern, U.S. patent 06,025,970, 2000. 21. Tuttle, G.T., et al., A 130 Mb/s PRML read/write channel with digital-servo detection, Proc. IEEE ISSCC’96, 64, 1996. 22. Fredrickson, L., et al., Digital servo processing in the Venus PRML read/write channel, IEEE Trans. on Mag., 33, 2616, 1997. 23. Yada, H. and Takeda, T., A coherent maximum-likelihood, head position estimator for PERM disk drives, IEEE Trans. on Mag., 32, 1867, 1996. 24. Kimura, H., et al., A digital servo architecture with 8.8 bit resolution of position error signal for disk drives, IEEE Globecom’97, 1268, 1997. 25. Patapoutian, A., Analog-to-digital converter algorithms for position error signal estimators, IEEE Trans. on Mag., 36, 395, 2000. 26. Reed, D.E., et al., Digital servo demodulation in a digital read channel, IEEE Trans. on Mag., 34, 13, 1998.

34.5 Modulation Codes for Storage Systems ∨

Brian Marcus and Emina Soljanin Introduction Modulation codes are used to constrain the individual sequences that are recorded in data storage channels, such as magnetic or optical disk or tape drives. The constraints are imposed in order to improve the detection capabilities of the system. Perhaps the most widely known constraints are the runlength limited (RLL(d,k)) constraints, in which ones are required to be separated by at least d and no more than k zeros. Such constraints are useful in data recording channels that employ peak detection: waveform peaks, corresponding to data ones, are detected independently of one another. The d-constraint helps

© 2002 by CRC Press LLC

increase linear density while mitigating intersymbol interference, and the k-constraint helps provide feedback for timing and gain control. Peak detection was widely used until the early 1990s. Although it is still used today in some magnetic tape drives and some optical recording devices, most high density magnetic disk drives now use a form of maximum likelihood (Viterbi) sequence detection. The data recording channel is modeled as a linear, discrete-time, communications channel with inter-symbol interference (ISI), described by its transfer N function and white Gaussian noise. The transfer function is often given by h(D) = (1 − D)(1 + D) , where N depends on and increases with the linear recording density. Broadly speaking, two classes of constraints are of interest in today’s high density recording channels: (1) constraints for improving timing and gain control and simplifying the design of the Viterbi detector for the channel and (2) constraints for improving noise immunity. Some constraints serve both purposes. Constraints in the first class usually take the form of a PRML (G, I) constraint: the maximum run of zeros is G and the maximum run of zeros, within each of the two substrings defined by the even indices and odd indices, is I. The G-constraint plays the same role as the k-constraint in peak detection, while the I-constraint enables the Viterbi detector to work well within practical limits of memory. Constraints in the second class eliminate some of the possible recorded sequences in order to increase the minimum distance between those that remain or eliminate the possibility of certain dominant error events. This general goal does not specify how the constraints should be defined, but many such constraints have been constructed; see [20] and the references therein for a variety of examples. Bounds on the capacities of constraints that avoid a given set of error events have been given in [26]. Until recently, the only known constraints of this type were the matched-spectral-null (MSN) constraints. They describe sequences whose spectral nulls match those of the channel and therefore increase its minimum distance. For example, a set of DC-balanced sequences (i.e., sequences of ±1 whose accumulated digital sums are bounded) is an MSN constraint for the channel with transfer function h(D) = 1 − D, which doubles its minimum distance [18]. During the past few years, significant progress has been made in defining high capacity distance enhancing constraints for high density magnetic recording channels. One of the earliest examples of such a constraint is the maximum transition run (MTR) constraint [28], which constrains the maximum run of ones. We explain the main idea behind this type of distance-enhancing codes in the subsection on “Constraints for ISI Channels.” Another approach to eliminating problematic error events is that of parity coding. Here, a few bits of parity are appended to (or inserted in) each block of some large size, typically 100 bits. For some of the most common error events, any single occurrence in each block can be eliminated. In this way, a more limited immunity against noise can be achieved with less coding overhead [5]. Coding for more realistic recording channel models that include colored noise and intertrack interference are discussed in the subsection on “Channels with Colored Noise and Intertrack Interference.” The authors point out that different constraints, which avoid the same prescribed set of differences, may have different performance on more realistic channels. This makes some of them more attractive for implementation. For a more complete introduction to this subject, the reader is referred to any one of the many expository treatments, such as [16,17,24].

Constrained Systems and Codes Modulation codes used in almost all contemporary storage products belong to the class of constrained codes. These codes encode random input sequences to sequences that obey the constraint of a labeled directed graph with a finite number of states and edges. The set of corresponding constrained sequences is obtained by reading the labels of paths through the graph. Sets of such sequences are called constrained systems or constraints. Figures 34.41 and 34.42 depict graph representations of an RLL constraint and a DC-balanced constraint. Of special interest are those constraints that do not contain (globally or at certain positions) a finite number of finite length strings. These systems are called systems of finite type (FT). An FT system X

© 2002 by CRC Press LLC

FIGURE 34.41

RLL (1,3) constraint.

FIGURE 34.42

DC-balanced constraint.

over alphabet A can always be characterized by a finite list of forbidden strings F = {w1,…, wN} of symbols A in A. Defined this way, FT systems will be denoted by X F . The RLL constraints form a prominent class of FT constraints, while DC-balanced constraints are typically not FT. Design of constrained codes begins with identifying constraints, such as those described in the Introduction, that achieve certain objectives. Once the system of constrained sequences is specified, information bits are translated into sequences that obey the constraints via an encoder, which usually has the form of a finite-state machine. The actual set of sequences produced by the encoder is called a constrained code and is often denoted C. A decoder recovers user sequences from constrained sequences. While the decoder is also implemented as a finite-state machine, it is usually required to have a stronger property, called sliding-block decodablility, which controls error propagation [24]. The maximum rate of a constrained code is determined by Shannon capacity. The Shannon capacity or simply capacity of a constrained system, denoted by C, is defined as

log 2 N ( n ) C = lim --------------------n n→ ∞

where N(n) is the number of sequences of length n. The capacity of a constrained system represented by a graph G can be easily computed from the adjacency matrix (or state transition matrix) of G (provided that the labeling of G satisfies some mildly innocent properties). The adjacency matrix of G with r states and aij edges from state i to state j, 1 ≤ i, j ≤ r, is the r × r matrix A = A(G) = {aij}r ×r . The Shannon capacity of the constraint is given by

C = log 2 λ ( A ) where λ(A) is the largest real eigenvalue of A. The state-splitting algorithm [1] (see also [24]) gives a general procedure for constructing constrained codes at any rate up to capacity. In this algorithm, one starts with a graph representation of the desired constraint and then transforms it into an encoder via various graph-theoretic operations including splitting and merging of states. Given a desired constraint and a desired rate p/q ≤ C, one or more rounds of state splitting are performed; the determination of which states to split and how to split them is q p governed by an approximate eigenvector, i.e., a vector x satisfying A x ≥ 2 x. Many other very important and interesting approaches are used to constrained code construction—far too many to mention here. One approach combines state-splitting with look-ahead encoding to obtain a very powerful technique which yields superb codes [14]. Another approach involves variable-length and time-varying variations of these techniques [2,13]. Many other effective coding constructions are described in the monograph [17].

© 2002 by CRC Press LLC

FIGURE 34.43

Two equivalent constraints: (a) F = {11} NRZI and (b) F = {101, 010} NRZ.

For high capacity constraints, graph transforming techniques, such as the state-splitting algorithm, may result in encoder/decoder architectures with formidable complexity. Fortunately, a block encoder/ decoder architecture with acceptable implementation complexity for many constraints can be designed by well-known enumerative [6], and other combinatorial [32] as well as heuristic techniques [25]. Translation of constrained sequences into the channel sequences depends on the modulation method. Saturation recording of binary information on a magnetic medium is accomplished by converting an input stream of data into a spatial stream of bit cells along a track where each cell is fully magnetized in one of two possible directions, denoted by 0 and 1. Two important modulation methods are commonly used on magnetic recording channels: non-return-to-zero (NRZ) and modified non-return-to-zero (NRZI). In NRZ modulation, the binary digits 0 and 1 in the input data stream correspond to 0 and 1 directions of cell magnetizations, respectively. In NRZI modulation, the binary digit 1 corresponds to a magnetic transition between two bit cells, and the binary digit 0 corresponds to no transition. For example, the channel constraint which forbids transitions in two neighboring bit cells, can be accomplished by either F = {11} NRZI constraint or F = {101, 010} NRZ constraint. The graph representation of these two constraints is shown in Fig. 34.43. The NRZI representation is, in this case, simpler.

Constraints for ISI Channels This subsection discusses a class of codes known as codes, which avoid specified differences. This is the only class of distance enhancing codes used in commercial magnetic recording systems. Two main reasons for this are: these codes simplify the channel detectors relative to the uncoded channel and even high rate codes in this class can be realized by low complexity encoders and decoders. Requirements A number of papers have proposed using constrained codes to provide coding gain on channels with high ISI (see, for example, [4,10,20,28]). The main idea of this approach can be described as follows [20]. Consider a discrete-time model for the magnetic recording channel with possibly constrained input ∞ a = {an} C ∈ {0,1} , impulse response {hn}, and output y = {yn} given by

yn = n

3

∑a m

h

m n−m

+ ηn

(34.32)

2

n

4

3

where h(D) = ∑ nhn D = (1 − D)(1 + D) (E PR4) or h(D) = ∑nhn D = (1 − D)(1 + D) (E PR4), ηn are 2 2 independent Gaussian random variables with zero mean and variance σ . The quantity 1/σ is referred to as the signal-to-noise ratio (SNR). The minimum distance of the uncoded channel (34.32) is 2 d min = ⑀min (D) ≠ 0 h ( D ) ⑀ ( D )

© 2002 by CRC Press LLC

2

i−l

where ⑀ (D) = Σ i=0 ⑀iD ,(⑀i ∈{−1,0,1}, ⑀0 = 1,⑀l−1 ≠ 0) is the polynomial corresponding to a normalized l−1 input error sequence ⑀ = { ⑀ i } i=0 of length l, and the squared norm of a polynomial is defined as the sum 2 of its squared coefficients. The minimum distance is bounded from above by ||h(D)|| , denoted by i

2

d MFB = h ( D )

2

(34.33)

This bound is known as the matched-filter bound (MFB) and is achieved when the error sequence of length l = 1, i.e., ⑀ (D) = 1, is in the set

min arg ⑀ ( D )≠0 h ( D ) ⑀ ( D )

2

(34.34)

2

2

For channels that fail to achieve the MFB, i.e., for which d min < ||h(D)|| , any error sequences ⑀(D) for which 2

2

d min ≤ h ( D ) ⑀ ( D ) < h ( D )

2

(34.35)

{ −1,0,1 }

are of length l ≥ 2 and may belong to a constrained system X L , where L is an appropriately chosen finite list of forbidden strings. For code C, the set of all admissible nonzero error sequences is written as ∞

E ( C ) = { ⑀ ∈ { – 1,0,1 } ⑀ ≠ 0, ⑀ = ( a – b ),a,b ∈ C } { −1,0,1 }

Given the condition E ( C ) ⊆ X L {0,1} can be identified so that

, the least restrictive finite collection F of blocks over the alphabet

{ 0,1 }

C ⊆ XF

{ −1,0,1 }

⇒ E ( C ) ⊆ XL

(34.36)

Definitions A constrained code is defined by specifying F, the list of forbidden strings for code sequences. Prior to that one needs to first characterize error sequences that satisfy (34.35) and then specify L, the list of forbidden strings for error sequences. Error event characterization can be done by using any of the methods described by Karabed, Siegel, and Soljanin in [20]. Specification of L is usually straightforward. A natural way to construct a collection F of blocks forbidden in code sequences based on the collection L of blocks forbidden in error sequences is the following. From the above definition of error sequences  = {⑀i} we see that ⑀i = 1 requires ai = 1 and ⑀i = −1 requires ai = 0, i.e., ai = (1 + ⑀i)/2. For each block wE ∈ L, construct a list F wE of blocks of the same length l according to the rule: l

i

i

i

F wE = { w C ∈ { 0,1 } w C = ( 1 + w E )/2 for all i for which w E ≠ 0 }. Then the collection F obtained as F =  wE∈L F wE satisfies requirement (34.36); however, the con{ 0,1 } strained system X F obtained this way may not be the most efficient. (Bounds on the achievable rates of codes which avoid specified differences were found recently in [26].) 2 The previous ideas are illustrated in the example of the E PR4 channel. Its transfer function is h(D) = 3 3 2 2 (1 − D)(1 + D) , and its MFB is ||(1 − D)(1 + D) ⋅ 1|| = 10. The error polynomial ⑀(D) = 1 − D + D is 3 2 the unique error polynomial for which ||(1 − D)(1 + D) ⑀(D)|| = 6, and the error polynomials ⑀(D) = i i l−1 2 5 6 7 1 − D + D + D − D + D and ⑀ ( D ) = Σ i=0 ( – 1 ) D for l ≥ 4 are the only polynomials for which 3 2 ||(1 − D)(1 + D) ⑀(D)|| = 8 (see, for example, [20]).

© 2002 by CRC Press LLC

FIGURE 34.44

Possible pairs of sequences for which error event + − +00 may occur.

It is easy to show that these error events are not in the constrained error set defined by the list of forbidden error strings L = {+ − + 00, + − + −}, where + denotes 1 and − denotes −1. To see this, note that an 2 error sequence that does not contain the string + − + 00 cannot have error polynomials ⑀(D) = 1 − D + D 2 5 6 7 or ε(D) = 1 − D + D + D − D + D , while an error sequence that does not contain string + − + − cannot i i l−1 have an error polynomial of the form ⑀ ( D ) = Σ i=0 ( – 1 ) D for l ≥ 4. Therefore, by the above procedure of defining the list of forbidden ode strings, we obtain the F = {+ − +} NRZ constraint. Its capacity is about 0.81, and a rate 4/5 c code into the constraint was first given in [19]. In [20], the following approach was used to obtain several higher rate constraints. For each of the error strings in L, we write all pairs of channel strings whose difference is the error string. To define F, look for the longest string(s) appearing in at least one of the strings in each channel pair. For the example above and the + − + 00 error string, a case-by-case analysis of channel pairs is depicted in Fig. 34.44. We can distinguish two types (denoted by A and B in the figure) of pairs of code sequences involved in forming an error event. In a pair of type A, at least one of the sequences has a transition run of length 4. In a pair of type B, both sequences have transition runs of length 3, but for one of them the run starts at an even position and for the other at an odd position. This implies that an NRZI constrained system that limits the run of 1s to 3 when it starts at an odd position, and to 2 when it starts at an even position, eliminates all possibilities shown bold-faced in Fig. 34.44. In addition, this constraint eliminates all error sequences containing the string + − + −. The capacity of the constraint is about .916, and rate 8/9 block codes with this constraint have been implemented in several commercial read channel chips. More about the constraint and the codes can be found in [4,10,20,28].

Channels with Colored Noise and Intertrack Interference Magnetic recording systems always operate in the presence of colored noise intertrack interference, and data dependent noise. Codes for these more realistic channel models are studied in [27]. The following is a brief outline of the problem. The data recording and retrieval process is usually modeled as a linear, continuous-time, communications channel described by its Lorentzian step response and additive white Gaussian noise. The most common discrete-time channel model is given by Eq. (34.32). Magnetic recording systems employ channel equaln N ization to the most closely matching transfer function h(D) = Σnhn D of the form h(D) = (1 − D)(1 + D) . This equalization alters the spectral density of the noise, and a better channel model assumes that the ηn in Eq. (34.32) are identically distributed, Gaussian random variables with zero mean, variance σ 2, and 2 normalized cross-correlation E{ηnηk}/σ = ρn−k. In practice, there is always intertrack interference (ITI), i.e., the read head picks up magnetization from an adjacent track. Therefore, the channel output is given by

yn =

∑a m

h

m n−m

+

∑x m

g

m n−m

+ ηn

(34.37)

where {gn} is the discrete-time impulse response of the head to the adjacent track, and x = {xn} ∈ C is the sequence recorded on that track. Assuming that the noise is white.

© 2002 by CRC Press LLC

In the ideal case (34.32), the probability of detecting b given that a was recorded is equal to Q(d()/σ), where d(⑀) is the distance between a and b given by 2

d () =

∑  ∑ ⑀ n

m

 m h n−m

2

(34.38)

Therefore, a lower bound, and a close approximation for small σ, to the minimum probability of an error-event in the system is given by Q(dmin,C /σ), where

d min, C = min  ∈⑀ c d (  ) is the channel minimum distance of code C. We refer to

d min =

min ∈ { −1,0,1 }



d()

(34.39)

as the minimum distance of the uncoded channel, and to the ratio dmin,C /dmin as the gain in distance of code C over the uncoded channel. In the case of colored noise, the probability of detecting b given that a was recorded equals to Q(∆()/σ), where ∆() is the distance between a and b given by

∑  ∑

2 2

⑀ h  2 ∆ (  ) = -----------------------------------------------------------------------------------------------∑ n∑ k  ∑ m⑀m hn−m ρn−k  ∑ m⑀m hk−m n

m m n−m

Therefore, a lower bound to the minimum probability of an error-event in the system is given by Q(∆min,C/σ), where

∆ min,C = min  ∈EC ∆ (  ) In the case of ITI (Eq. 34.37), an important factor is the probability of detecting sequence b given that sequence a was recorded on the track being read and sequence x was recorded on an adjacent track. This probability is

Q ( δ ( ,x )/ σ ) where δ (,x) is the distance between a and b in the presence of x given by [30]

1 2 δ ( ,x ) = ------------------------------------------------2 ∑ n  ∑ m⑀m hn−m

∑  ∑ n

2

 ∑ n ∑ x g  ∑ ⑀ h  m⑀ m h n−m +  m m n−m  m m n−m

2

Therefore, a lower bound to the minimum probability of an error-event in the system is proportional to Q(δmin,C /σ), where

δ min,C = min δ ( ,x )  ≠ 0 ,x ∈ C

© 2002 by CRC Press LLC

Distance δmin,C can be bounded as follows [30]:

δ min,C ≥ ( 1 – M )d min,C

(34.40)

where M = maxn,x∈C Σmxm gn−m, i.e., M is the maximum absolute value of the interference. Note that M = Σn|gn|. We will assume that M < 1. The bound is achieved if and only if there exists an , d() = dmin, C, for which Σm⑀mhn−m ∈{−1, 0, 1} for all n, and there exists an x ∈C such that Σmxmgn−m = + M whenever Σm⑀mhn−m = ±1.

An Example Certain codes provide gain in minimum distance on channels with ITI and colored noise, but not on the AWGN channel with the same transfer function. This is best illustrated using the example of the 2 partial response channel with the transfer function h(D) = (1 − D)(1 + D) known as EPR4. It is well 2 known that for the EPR4 channel d min = 4. Moreover, as discussed in the subsection on “Constraints for ISI Channels,” the following result holds: Proposition 1. Error events ⑀(D) such that 2

2

d ( ⑀ ) = d min = 4 take one of the following two forms: k−1

⑀(D) =

∑D

2j

k≥1

,

j=0

or l−1

⑀(D) =

∑ ( –1 ) D , i

i

l≥3

i=0

Therefore, an improvement of error-probability performance can be accomplished by codes which eliminate the error sequences ⑀ containing the strings −1 +1 −1 and +1 −1 +1. Such codes were extensively studied in [20]. In the case of ITI (Eq. 34.37), it is assumed that the impulse response to the reading head from an adjacent track is described by g(D) = αH(D), where the parameter α depends on the track to head distance. Under 2 2 2 this assumption, the bound (34.40) gives δ min ≥ d min ( 1 – 4 α ) . The following result was shown in [30]: Proposition 2. Error events ⑀(D) such that

min x ∈C

2

2

2

2

δ ( ,x ) = δ min = d min ( 1 – 4 α ) = 4 ( 1 – 4 α )

2

take the following form: l−1

⑀(D) =

∑ ( –1 ) D , i

i

l≥5

i=0 2

2

2

For all other error sequences for which d (⑀) = 4, we have minx苸C δ (, x) = 4(1 − 3α) .

© 2002 by CRC Press LLC

Therefore, an improvement in error-probability performance of this channel can be accomplished by limiting the length of strings of alternating symbols in code sequences to four. For the NRZI type of recording, this can be achieved by a code that limits the runs of successive ones to three. Note that the set of minimum distance error events is smaller than in the case with no ITI. Thus, performance improvement can be accomplished by higher rate codes that would not provide any gain on the ideal channel. Channel equalization to the EPR4 target introduces cross-correlation among noise samples for a range of current linear recording densities (see [27] and references therein). The following result was obtained in [27]: Proposition 3. Error events ⑀(D) such that 2

2

∆ (  ) = ∆ min take the following form: l−1

⑀(D) =

∑ ( –1 ) D , i

i=0

i

l ≥ 3, l odd

Again, the set of minimum distance error events is smaller than in the ideal case (white noise), and performance improvement can be provided by codes which would not give any gain on the ideal channel. For example, since all minimum distance error events have odd parity, a single parity check code can be used.

Future Directions Soft-Output Decoding of Modulation Codes Detection and decoding in magnetic recording systems is organized as a concatenation of a channel detector, an inner decoder, and an outer decoder, and as such should benefit from techniques known as erasure and list decoding. To declare erasures or generate lists, the inner decoder (or channel detector) needs to assess symbol/sequence reliabilities. Although the information required for this is the same one necessary for producing a single estimate, some additional complexity is usually required. So far, the predicted gains for erasure and list decoding of magnetic recording channels with additive white Gaussian noise were not sufficient to justify increasing the complexity of the channel detector and inner and outer decoder; however, this is not the case for systems employing new magneto-resistive reading heads, for which an important noise source, thermal asperities, is to be handled by passing erasure flags from the inner to the outer decoder. In recent years, one more reason for developing simple soft-output channel detectors has surfaced. The success of turbo-like coding schemes on memoryless channels has sparked the interest in using them as modulation codes for ISI channels. Several recent results show that the improvements in performance turbo codes offer when applied to magnetic recording channels at moderate linear densities are even more dramatic than in the memoryless case [12,29]. The decoders for turbo and low-density parity check codes (LDPC) either require or perform much better with soft input information which has to be supplied by the channel detector as its soft output. The decoders provide soft outputs which can then be utilized by the outer Reed–Solomon (RS) decoder [22]. A general soft-output sequence detection was introduced in [11], and it is possible to get information on symbol reliabilities by extending those techniques [21,31]. Reversed Concatenation Typically, the modulation encoder is the inner encoder, i.e., it is placed downstream of an error-correction encoder (ECC) such as an RS encoder; this configuration is known as standard concatenation (Fig. 34.45). This is natural since otherwise the ECC encoder might well destroy the modulation properties before

© 2002 by CRC Press LLC

FIGURE 34.45

Standard concatenation.

FIGURE 34.46

Reversed concatenation.

passing across the channel; however, this scheme has the disadvantage that the modulation decoder, which must come before the ECC decoder, may propagate channel errors before they can be corrected. This is particularly problematic for modulation encoders of very high rate, based on very long block size. For this reason, a good deal of attention has recently focused on a reversed concatenation scheme, where the encoders are concatenated in the reversed order (Fig. 34.46). Special arrangements must be made to ensure that the output of the ECC encoder satisfies the modulation constraints. Typically, this is done by insisting that this encoder be systematic and then re-encoding the parity information using a second modulation encoder (the “parity modulation encoder”), whose corresponding decoder is designed to limit error propagation; the encoded parity is then appended to the modulation-encoded data stream (typically a few merging bits may need to be inserted in between the two streams in order to ensure that the entire stream satisfies the constraint). In this scheme, after passing through the channel the modulationencoded data stream is split from the modulation-encoded parity stream, and the latter is then decoded via the parity modulation decoder before being passed on to the ECC decoder. In this way, many channel errors can be corrected before the data modulation decoder, thereby mitigating the problem of error propagation. Moreover, if the data modulation encoder has high rate, then the overall scheme will also have high rate because the parity stream is relatively small. Reversed concatenation was introduced in [3] and later in [23]. Recent interest in the subject has been spurred on by the introduction of a lossless compression scheme, which improves the efficiency of reversed concatenation [15], and an analysis demonstrating the benefits in terms of reduced levels of interleaving [8]; see also [9]. Research on fitting soft decision detection into reversed concatenation can be found in [7,33].

References 1. R. Adler, D. Coppersmith, and M. Hassner, “Algorithms for sliding-block codes,” IEEE Trans. Inform. Theory, vol. 29, no. 1, pp. 5–22, Jan. 1983. 2. J. Ashley and B. Marcus, “Time-varying encoders for constrained systems: an approach to limiting error propagation,” IEEE Trans. Inform. Theory, 46 (2000), 1038–1043. 3. W. G. Bliss, “Circuitry for performing error correction calculations on baseband encoded data to eliminate error propagation,” IBM Tech. Discl. Bull., 23 (1981), 4633–4634. 4. W. G. Bliss, “An 8/9 rate time-varying trellis code for high density magnetic recording,” IEEE Trans. Magn., vol. 33, no. 5, pp. 2746–2748, Sept. 1997. 5. T. Conway, “A new target response with parity coding for high density magnetic recording,” IEEE Trans. Magn., vol. 34, pp. 2382–2386, 1998. 6. T. Cover, “Enumerative source encoding,” IEEE Trans. Inform. Theory, pp. 73–77, Jan. 1973.

© 2002 by CRC Press LLC

7. J. Fan, “Constrained coding and soft iterative decoding for storage,” PhD Dissertation, Stanford University, 1999. 8. J. Fan and R. Calderbank, “A modified concatenated coding scheme, with applications to magnetic data storage,” IEEE Trans. Inform. Theory, 44 (1998), 1565–1574. 9. J. Fan, B. Marcus, and R. Roth, “Lossless sliding-block compression of constrained systems,” IEEE Trans. Inform. Theory, 46 (2000), 624–633. 10. K. Knudson Fitzpatrick, and C. S. Modlin, “Time-varying MTR codes for high density magnetic recording,” in Proc. 1997 IEEE Global Telecommun. Conf. (GLOBECOM ’97), Phoenix, AZ, Nov. 1997, pp. 1250–1253. 11. J. Hagenauer and P. Hoeher, “A Viterbi algorithm with soft–decision outputs and its applications,” in Proc. 1989 IEEE Global Telecommun. Conf. (GLOBECOM ’89), Dallas, TX, Nov. 1989, pp. 1680–1687. 12. C. Heegard, “Turbo coding for magnetic recording,” in Proc. 1998 Information Theory Workshop, San Diego, CA, Feb. 8–11, 1998, pp. 18–19. 13. C. D. Heegard, B. H. Marcus, and P. H. Siegel, “Variable-length state splitting with applications to average runlength-constrained (ARC) codes,” IEEE Trans. Inform. Theory, 37 (1991), 759–777. 14. H. D. L. Hollmann, “On the construction of bounded-delay encodable codes for constrained systems,” IEEE Trans. Inform. Theory, 41 (1995), 1354–1378. 15. K. A. Schouhamer Immink, “A practical method for approaching the channel capacity of constrained channels,” IEEE Trans. Inform. Theory, 43 (1997), 1389–1399. 16. K. A. Schouhamer Immink, P. H. Siegel, and J. K. Wolf, “Codes for Digital Recorders,” IEEE Trans. Infor. Theory, vol. 44, pp. 2260–2299, Oct. 1998. 17. K. A. Schouhamer Immink, Codes for Mass Data Storage, Shannon Foundation Publishers, The Netherlands, 1999. 18. R. Karabed and P. H. Siegel, “Matched spectral null codes for partial response channels,” IEEE Trans. Inform. Theory, 37 (1991), 818–855. 19. R. Karabed and P. H. Siegel, “Coding for higher order partial response channels,” in Proc. 1995 SPIE Int. Symp. on Voice, Video, and Data Communications, Philadelphia, PA, Oct. 1995, vol. 2605, pp. 115–126. 20. R. Karabed, P. H. Siegel, and E. Soljanin, “Constrained coding for binary channels with high intersymbol interference,” IEEE Trans. Inform. Theory, vol. 45, pp. 1777–1797, Sept. 1999. 21. K. J. Knudson, J. K. Wolf, and L. B. Milstein, “Producing soft–decision information on the output of a class IV partial response Viterbi detector,” in Proc. 1991 IEEE Int. Conf. Commun. (ICC ’91), Denver, CO, June 1991, pp. 26.5.1.–26.5.5. 22. R. Koetter and A. Vardy, preprint 2000. 23. M. Mansuripur, “Enumerative modulation coding with arbitrary constraints and post-modulation error correction coding and data storage systems,” Proc. SPIE, 1499 (1991), 72–86. 24. B. Marcus, R. Roth, and P. Siegel, “Constrained systems and coding for recording channels,” Chapter 20 of Handbook of Coding Theory, edited by V. Pless, C. Huffman, 1998, Elsevier. 25. D. Modha and B. Marcus, “Art of constructing low complexity encoders/decoders for constrained block codes,” IEEE J. Sel. Areas in Comm., (2001), to appear. 26. B. E. Moision, A. Orlitsky, and P. H. Siegel, “On codes that avoid specified differences,” IEEE Trans. Inform. Theory, vol. 47, pp. 433–441, Jan. 2001. 27. B. E. Moision, P. H. Siegel, and E. Soljanin, “Distance Enhancing Codes for High-Density Magnetic Recording Channel,” IEEE Trans. Magn., submitted, Jan. 2001. 28. J. Moon and B. Brickner, “Maximum transition run codes for data storage systems,” IEEE Trans. Magn., vol. 32, pp. 3992–3994, Sept. 1996. 29. W. Ryan, L. McPheters, and S. W. McLaughlin, “Combined turbo coding and turbo equalization for PR4-equalized Lorentzian channels,” in Proc. 22nd Annual Conf. Inform. Sciences and Systems, Princeton, NJ, March 1998.

© 2002 by CRC Press LLC

30. E. Soljanin, “On–track and off–track distance properties of Class 4 partial response channels,” in Proc. 1995 SPIE Int. Symp. on Voice, Video, and Data Communications, Philadelphia, PA, vol. 2605, pp. 92–102, Oct. 1995. 31. E. Soljanin, “Simple soft-output detection for magnetic recording channels,” in 1998 IEEE Int. Symp. Inform. Theory (ISIT’00), Sorrento, Italy, June 2000. 32. A. J. van Wijngaarden and K. A. Schouhamer Immink “Combinatorial construction of high rate runlength-limited codes,” Proc. 1996 IEEE Global Telecommun. Conf. (GLOBECOM ’96), London, U.K., pp. 343–347, Nov. 1996. 33. A. J. van Wijngaarden and K. A. Schouhamer Immink, “Maximum run-length limited codes with error control properties,” IEEE J. Select. Areas Commun., vol. 19, April 2001. 34. A. J. van Wijngaarden and E. Soljanin, “A combinatorial technique for constructing high rate MTR–RLL codes,” IEEE J. Select. Areas Commun., vol. 19, April 2001.

34.6 Data Detection ∨

Miroslav Despotovi´c and Vojin Senk Introduction Digital magnetic recording systems transport information from one time to another. In communication society jargon, it is said that recording and reading information back from a (magnetic) medium is equivalent to sending it through a time channel. There are differences between such channels. Namely, −5 −6 in communication systems, the goal is a user error rate of 10 or 10 . Storage systems, however, often −12 require error rates of 10 or better. On the other hand, the common goal is to send the greatest possible amount of information through the channel used. For storage systems, this is tantamount to increasing recording density, keeping the amount redundancy as low as possible, i.e., keeping the bit rate per recorded pulse as high as possible. The perpetual push for higher bit rates and higher storage densities spurs a steady increment of the amplitude distortion of many types of transmission and storage channels. When recording density is low, each transition written on the magnetic medium results in a relatively isolated peak of voltage, and peak detection method is used to recover written information; however, when PW50 (pulse width at half maximum response) becomes comparable with the channel bit period, the peak detection channel cannot provide reliable data detection, due to intersymbol interference (ISI). This interference arises because the effects of one readback pulse are not allowed to die away completely before the transmission of the next. This is an example of a so-called baseband transmission system, i.e., no carrier modulation is used to send data. Impulse dispersion and different types of induced noise at the receiver end of the system introduce combination of several techniques (equalization, detection, and timing recovery) to restore data. This chapter section gives a survey of most important detection techniques in use today assuming ideal synchronization. Increasing recording density in new magnetic recording products necessarily demands enhanced detection techniques. First detectors operated at densities at which pulses were clearly separated, so that very simple, symbol-by-symbol detection technique was applied, the so-called peak detector [30]. With increased density, the overlap of neighboring dispersed pulses becomes so severe (i.e., large intersymbol interference—ISI) that peak detector could not combat with such heavy pulse shape degradation. To accomplish this task, it was necessary to master signal processing technology to be able to implement more powerful sequence detection techniques. This chapter section will both focus on this type of detection already applied in commercial products and give advanced procedures for searching the detection trellis to serve as a tutorial material for research on next generation products.

Partial-Response Equalization In the classical peak detection scheme, an equalizer is inserted whose task is just to remove all the ISI so that an isolated pulse is acquired, but the equalization will also enhance and colorize the noise (from readback process) due to spectral mismatch. The noise enhancement obtained in this manner will increase

© 2002 by CRC Press LLC

y (t ) Received signal

FIGURE 34.47

Whitened Matched Filter

yn

PR

¢

t=nT

Channel

Equalizer

sequence

yn PR sequence

in

Viterbi Algorithm

Estimated input sequence

Maximum-likelihood sequence detector.

with recording density and eventually become intolerable. Namely, since such a full equalization is aimed at slimming the individual pulse, so that it does not overlap with adjacent pulses, it is usually too aggressive and ends up with huge noise power. Let us now review the question of recording density, also known as packing density. It is often used to specify how close two adjacent pulses stay to each other and is defined as PW50/T (see Chapter 34.1 for definition). Whatever tricks are made with peak detection systems, they barely help at PW50/T ratios above 1. Section 34.6 discusses two receiver types that run much less rapidly out of steam. These are the partialresponse equalizer (PRE) and the decision-feedback equalizer (DFE). Both are rooted in old telegraph tricks and, just as is the case with peak detector, they take instantaneous decisions with respect to the incoming data. Section 34.6 will focus mainly on these issues, together with sequence detection algorithms that accompany partial-response (PR) equalization. What is PR equalization? It is the act of shaping the readback magnetic recording signal to look like the target signal specified by the PR. After equalization the data are detected using a sequence detector. Of course, quantization by an analog-to-digital converter (ADC) occurs at some point before the sequence detector. The common readback structure consists of a linear filter, called a whitened matched filter, a symbolrate sampler (ADC), a PRE, and a sequence detector, Fig. 34.47. The PRE in this scheme can also be put before the sampler, meaning that it is an analog, not a digital equalizer. Sometimes part of the equalizer is implemented in the analog, the other part in the digital domain. In all cases, analog signal, coming from the magnetic head, should have a certain and constant level of amplification. This is done in a variable gain amplifier (VGA). To keep a signal level, VGA gets a control signal from a clock and gain recovery system. In the sequel, we will assume that VGA is already (optimally) performed. In the design of equalizers and detectors, low power dissipation and high speed are both required. The error performances need to be maintained as well. So far, most systems seek for the implementations in the digital domain, as is the case in Fig. 34.47, but it has been shown that ADC may contribute to the high-frequency noise during the PR target equalization, causing a long settling time in clock recovery loop, as well as degrading performance [33]. In addition, the ADC is also usually the bottleneck for the low-power highspeed applications. On the other hand, the biggest problem for an analog system is the imperfection of circuit elements. The problems encountered with analog systems include nonideal gain, mismatch, nonlinear hold step, offset, etc. Let us now turn to the blocks shown in Fig. 34.47. The first of them, the whitened matched filter, has the following properties [7]: Simplicity: a single filter producing single sample at the output is all that is needed. The response of the filter is either chosen to be causal and hence realizable, or noncausal, meaning some delay has to be introduced, yielding better performance. Sufficiency: the filter is information lossless, in the sense that its sampled outputs are a set of sufficient statistics for estimation of the input sequence. Whiteness: the noise components of the sampled outputs are independent identically distributed Gaussian random variables. The whiteness and sufficiency property follow from the fact that the set of waveforms at the output of the matched filter is an orthonormal basis for the signal space.

© 2002 by CRC Press LLC

The next block is PRE. What is PR? Essential to PR techniques is that the PR sequence is obtained from the channel sequence via a simple linear filter. More specifically, the impulse response of this filter is such that the overall response is modeled as having only a few small integer-valued coefficients, the condition actually considered crucial for the system to be called PR. This condition subsequently yields relatively simple sequence detectors. The correlative level coding [3], also known as PR [31] is adopted in digital communication applications for long time. Kobayashi [9] suggested in 1971 that this type of channels can be treated as a linear finite state machine, and thus can be represented by the state diagram and its time instant labeled counterpart, trellis diagram. Consequently, its input is best inferred using some trellis search technique, the best of them (if we neglect complexity issues) being the Viterbi algorithm [2] (if one is interested in maximizing the likelihood of the whole sequence; otherwise, a symbol-bysymbol detector is needed). Kobayashi also indicated that the magnetic recording channel could be regarded as the PR channel due to the inherent differentiation property in the readback process [8]. This is both present in inductive heads and in magnetoresistive (MR) heads, though the latter are directly sensitive to magnetization and not to its change (this is due to the fact that the head has to be shielded). In other words, the pulse will be read only when the transition of opposite magnet polarities is sensed. Basic to the PR equalization is the fact that a controlled amount of ISI is not suppressed by the equalizer, but rather left for a sequence detector to handle. The nature of the controlled ISI is defined by a PR. A proper match of this response to the channel permits noise enhancement to remain small even when amplitude distortion is severe. In other words, PR equalization can provide both well-controlled ISI and spectral match. PR equalization is based on two assumptions: • The shape of readback signal from an isolated transition is exactly known and determined. • The superposition of signals from adjacent transitions is linear. Furthermore, it is assumed that the channel characteristics are fixed and known, so that equalization need not be adaptive. The resulting PR channel can be characterized using D-transform of the sequences M−1 i that occur, X(D) = I(D)H(D) [7] where H(D) = Σ i=0 hiD , D represents the delay factor in D-transform and M denotes the order of the PR signals. When modeling differentiation, H(D) = 1 − D. The finite state machine (FSM) of this PR channel is known as the dicode system since there are only two states in the transition diagram. The most unclear signal transformation in Fig. 34.47 is equalization. What does it mean that the pulse of voltage should look like the target signal specified by the PR (the so-called PR target)? To answer this question let us consider the popular Class IV PR, or PRIV system. For magnetic recording systems with PW50/T approximately equal to 2, comparatively little equalization is required to force the equalized channel to match a class-4 PR (PR4) channel where H(D) = (1 − D)(1 + 2 D) = 1 − D . Comparing to the Lorentzian model of Chapter 34.1, PR4 channel shows more emphasis in the high frequency domain. The equalizer with the PR4 as the equalization target thus suppresses the low frequency components and enhances the high frequency ones, degrading the performance of alldigital detectors since the quantization noise, that is mainly placed at higher frequencies, is boosted up. The isolated pulse shape in a PR4 system is shown in Fig. 34.48. The transition is written at time instant t = 0, where T is the channel bit period. The shape is oscillating and the pulse values at integer number of bit periods before the transition are exactly zeros. Obviously, it is this latter feature that should give us future advantage; however, at t = 0 and at t = T, i.e., one bit period later, the values of the pulse are equal to “1”. The pulse of voltage reaches its peak amplitude of 1.273 at one half of the bit period. Assume that an isolated transition is written on the medium and the pulse of voltage shown in Fig. 34.48 comes to the PRML system. The PR4 system requires that the samples of this pulse should correspond to the bit periods. Therefore, samples of the isolated PR4 pulse will be 00…011000 … (of course, “1” is used for convenience, and in reality it corresponds to some ADC level). Because the isolated transition has two nonzero samples, when the next transition is written, the pulses will interfere. Thus, writing two pulses adjacent to each other will introduce superposition between them,

© 2002 by CRC Press LLC

FIGURE 34.48

Capacity of PR4 channel.

FIGURE 34.49

Capacity of EPR4 channel.

usually called a dipulse response, as shown in Fig. 34.49. Here, the samples are […,0,0,1,0,−1,0,0,…], resulting from 0001 1 000 + 0 0 0 0 −1 −1 0 0 \

from the first transition from the second transition

0 0 0 1 0 −1 0 0

Now, there is no concern about linear ISI; once the pulses can be reduced to the predetermined simple shape, the data pattern is easily recovered because superposition of signals from adjacent transitions is known.

© 2002 by CRC Press LLC

In the above example, we see that sample “1” is suppressed by “−1” from the next transition. It is a simple matter to check that all possible linear combinations of the samples result in only three possible values {−1, 0, +1} (naturally, it is that all parts of the system are working properly, i.e., equalization, gain, and timing recovery, and that the signal is noise free). A positive pulse of voltage is always followed by a negative pulse and vice versa, so that the system can be regarded as an alternative mark inversion (AMI) code. The higher bit capacity of the PR4 channel can best be understood from Fig. 34.48. It is observed that PR4 channel provides a 50% enhancement in the recording density as compared with the peak detection (fully equalized) one, since the latter requires isolation of single bits from each other. In the next figure, we see that the EPR4 channel (explained later) adds another 33% to this packing density. PR4 has another 2 advantage over all the other PR systems; since H(D) = 1 − D , the current symbol is correlated to the second previous one, allowing the system to be modeled as two interleaved dicode channels, implying the use of simple dicode detectors for even and odd readback samples. RLL coding is necessary in this case, since nonideal tracking and timing errors result in a residual intermediate term (linear in D) that induces correlation between two interleaved sequences, and thus degrades systems that rely on decoupled detection of each of them. RLL codes are widely used in conjunction with PR equalization in order to eliminate certain data strings that would render tracking and synchronization difficult. If PR4 target is used, a special type of RLL coding is used, characterized by (0,G/I). Here, G and I denote the maximum number of consecutive zeros in the overall data string, and in the odd/even substrings, respectively. The latter parameter ensures proper functioning of the clock recovery mechanism if deinterleaving of the PR4 channel into two independent dicode channels is performed. The most popular is the (0,4/4) code, whose data rate is 7/8, i.e., whose data loss is limited to 12.5%. Other PR targets are used besides PR4. The criterion of how to select the appropriate PR target is based on spectral matching, to avoid introducing too much equalization noise. For instance, for PW50/T ≈ 2.25, it is better to model ISI pattern as the so-called EPR4 (i.e. extended class-4 partial response) 2 2 3 channel with H(D) = (1 + D) (1 − D) = 1 + D − D − D . As the packing density goes up, more low frequency components are being introduced (low compared to 1/T, that also increases as T is shortened, in reality those frequencies are higher than those met for lower recording densities, respectively greater T). This is the consequence of the fact that intersymbol interference blurs the boundary between individual pulses, flattening the overall response (in time domain). The additional 1 + D term in the target PR effectively suppresses the unwanted high frequencies. EPR4 enables even higher capacities of the magnetic recording systems than PRIV, observing the difference of 33% in the recording density displayed in Fig. 34.49; however, a practical implementation of EPR4 is much more complex than is the case with PR4. First, the deinterleaving idea used for PR4 cannot be implemented. Second, the corresponding state diagram (and consequently trellis) now has eight states instead of four (two if deinterleaving is used). Furthermore, its output is five-leveled, instead of ternary for the PR4 and the dicode channel, so that a 4.4 dB degradation is to be expected with a threshold detector. Naturally, if sequence detector is used, such as Viterbi algorithm (VA), this loss does not exist, but its elimination is obtained at the expense of a significantly increased complexity of the detector. Furthermore, if such a detector can be used, EPR4 has a performance advantage over PR4 due to less equalization noise enhancement, cf. Fig. 34.50. Let us reconsider the PR equalizer shown in Fig. 34.47. Following the approach from Reference 44, j2π Ω j2π Ω j2π Ω j2π Ω 2 its aim is to transform the input spectrum Y′(e ) into a spectrum Y(e ) = Y′(e )|C(e )| , where j2π Ω j2π Ω j2π Ω j2π Ω 2 j2π Ω C(e ) is the transfer function of the equalizer. The spectrum Y(e ) = I(e )|H(e )| + N(e ) where H (D) is the PR target. For instance, duobinary PR target (H(D) = 1 + D) enhances low frequencies and suppresses those near the Nyquist frequency Ω = 0.5, whereas dicode H(D) = (1 − D) does the opposite: it suppresses low frequencies and enhances those near Ω = 0.5. j2π Ω In principle, the spectral zeros of H(e ) can be undone via a linear (recursive) filter, but this would excessively enhance any noise components added. The schemes for tracking the input sequence to the system based on the PR target equalized one will be reviewed later in this chapter section. For instance,

© 2002 by CRC Press LLC

FIGURE 34.50

Equalization noise enhancement in PR channels. in

xn

hn

+ +

in

in−2

D2

(a) in

dn

x

d n−2

hn

xn

MM

in

D2

(b) FIGURE 34.51

(a) PR4 recursive restoration of information sequence and (b) precoder derived from it.

for a PR4 system, a second-order recursive filter can in principle be used to transform its input into an estimate of the information sequence, Fig. 34.51. Unfortunately, if an erroneous estimate is produced at any moment, all subsequent estimates will be in error (in fact, they will no longer be in the alphabet {−1, 0, 1}, enabling error monitoring and simple forms of error correction [31]). To avoid this catastrophic error propagation, resort can be taken to a precoder. 2 Let us analyze the functioning of this precoder in the case of the PR4 channel (1 − D ) (generalization to other PR channels is trivial). Its function is to transform in into a binary sequence dn = indn−2 to which the PR transformation is applied, Fig. 34.51(b). This produces a ternary sequence

x n = ( d ∗ h ) n = d n – d n−2 = i n d n−2 – d n−2 = ( i n – 1 )d n−2 Because dn−2 cannot be zero, xn is zero iff in − 1 = 0, i.e., in = 1. Thus, the estimate iˆn of in can be formed by means of the memoryless mapping (MM)

1, x n = 0 iˆn = 0, else  This decoding rule does not rely on past data estimates and thereby avoids error propagation altogether. In practice, the sequences in and dn are in the alphabet {0, 1} rather than {−1, 1}, and the multiplication in Fig. 34.51(b) becomes a modulo-2 addition (where 0 corresponds to 1, and 1 to −1). The precoder does not affect the spectral characteristics of an uncorrelated data sequence. For correlated data, however, precoding need not be spectrally neutral. It is instructive to think of the precoder as a first-order recursive filter with a pole placed so as to cancel the zero of the partial response. The filter uses a modulo-2 addition instead of a normal addition and as a result the cascade of filter and PR, while memoryless, has a nonbinary output. The MM serves to repair this “deficiency.”

© 2002 by CRC Press LLC

Feedback detector

in

iˆn

xn + +

hk

iˆn − 2 (a)

D2

Feedback detector

in

hn (b)

FIGURE 34.52

iˆn

xn + +

P(D)

Feedback detector.

Catastrophic error propagation can be avoided without precoder by forcing the output of the recursive filter of Fig. 34.52 to be binary (Fig. 34.52(a)). An erroneous estimate în−2= −in−2 leads to a digit

ιˆn = x n + ιˆn−2 = i n – i n−2 + ιˆn−2 = i n – 2i n−2 whose polarity is obviously determined by in−2. Thus, the decision ιˆn that is taken by the slicer in Fig. 34.52(a) will be correct if in happens to be the opposite of in−2. If data is uncorrelated, this will happen with probability 0.5, and error propagation will soon cease, since the average number of errors in a burst is 2 1 + 0.5 + (0.5) + … = 2. Error propagation is thus not a serious problem. The feedback detector of Fig. 34.52 is easily generalized to arbitrary partial response H(D). For purposes of normalization, H(D) is assumed to be causal and monic (i.e., hn = 0 for n < 0 and h0 = 1). The nontrivial taps h1, h2 ,…together form the “tail” of H(D). This tail can be collected in P(D), with pn = 0 for n ≤ 0 and pn = hn for n ≥ 1. Hence, hn = δn + pn, where the Kronecker delta function δn represents the component h0 = 1. Hence

xn = ( i ∗ h )n = ( i ∗ ( δ + p ) )n = in + ( i ∗ p )n . The term (i ∗ p)n depends exclusively on past digits in−1, in−2,… that can be replaced by decisions ιˆn−1, ιˆn−2 ,…. Therefore, an estimate ιˆn of the current digit in can be formed according to ιˆn = xk( ιˆ ∗ p)n as in Fig. 34.52(b). As before, a slicer quantizes ιˆn into binary decisions ιˆn so as to avoid catastrophic error propagation. The average length of bursts of errors, unfortunately, increases with the memory order of H(D). Even so, error propagation is not normally a serious problem [21]. In essence, the feedback detector avoids noise enhancement by exploiting past decisions. This viewpoint is also central to decision-feedback equalization, to be explained later. Naturally, all this can be generalized to nonbinary data; but in magnetic recording, so far, only binary data are used (the so-called saturation recording). The reasons for this are elimination of hysteresis and the stability of the recorded sequence in time. Let us consider now the way the PR equalizer from Fig. 34.47 is constructed. In Fig. 34.53, a discretej2π Ω time channel with transfer function F(e ) transforms in into a sequence yn = (i ∗ f )n + un, where un is j2π Ω the additive noise with power spectral density U(e ), and yn represents the sampled output of a j2π Ω whitened matched filter. We might interpret F(e ) as comprising two parts: a transfer function j2π Ω H(e ) that captures most of the amplitude distortion of the channel (the PR target) and a function

© 2002 by CRC Press LLC

F (e j 2π Ω ) in

H (e j 2πhΩk )

FIGURE 34.53 j2π Ω

xn

un

Fr (e j 2π Ω)

Equalizer

+

yn

−1

C (e j 2π Ω ) = Fr (e j 2π Ω )

xˆ n

Interpretation of PR equalization. j2π Ω

j2π Ω

Fr(e ) = F(e )/H(e ) that accounts for the remaining distortion. The latter distortion has only a small amplitude component and can thus be undone without much noise enhancement by a linear equalizer with transfer function

C(e

j2 π Ω

j2 π Ω

H(e ) 1 ) = --------------------- = --------------------j2 π Ω j2 π Ω F(e ) ) Fr ( e

This is precisely the PR equalizer we sought for. It should be stressed that the subdivision in Fig. 34.53 is only conceptual. The equalizer output is a noisy version of the “output” of the first filter in Fig. 34.53 and is applied to the feedback detector of Fig. 34.52, to obtain decision variables ιˆn′ and ιˆn. The precoder and MM of Fig. 34.51 are, of course, also applicable and yield essentially the same performance. The choice of the coefficients of the PRE in Fig. 34.47 is the same as for full-response equalization and is explained in the subsection on “Adaptive Equalization and Timing Recovery.” Interestingly, zeroforcing here is not as bad as is the case with full-response signaling and yields approximately the same result as minimum mean-square equalization. To evaluate the performance of the PRE, let us assume that all past decisions that affect ιˆn are correct and that the equalizer is zero forcing (see “Adaptive Equalization and Timing Recovery” for details). The only difference between ιˆn′ and ιˆn is now the filtered noise component (u ∗ c)n with variance

2

σ ZFPRE =



0.5

– 0.5

U(e

j2 π Ω

) C(e

j2π Ω

j2 π Ω

2

) dΩ

=

j2 π Ω

j2 π Ω

2

U(e ) H(e ) ----------------------------------------------dΩ j2 π Ω 2 – 0.5 ) F(e



0.5

j2π Ω

Because |H(e )| was selected to be small wherever |F(e )| is small, the integrand never becomes very large, and the variance will be small. This is in marked contrast with full-response equalization. Here, j2π Ω H(e ) = 1 for all Ω, and the integrand in the above formula can become large at frequencies where j2π Ω j2π Ω |F(e )| is small. Obviously, the smallest possible noise enhancement occurs if H(e ) is selected so that the integrand is independent of frequency, implying that the noise at the output of the PRE is white. j2π Ω This is, in general, not possible if H(e ) is restricted to be PR (i.e., small memory-order, integer-valued). The generalized feedback detector of Fig. 34.52, on the other hand, allows a wide variety of causal j2π Ω responses to be used, and here |H(e )| can be chosen at liberty. Exploitation of this freedom leads to decision feedback equalization (DFE).

Decision Feedback Equalization This subsection reviews the basics of decision feedback detection. It is again assumed that the channel characteristics are fixed and known, so that the structure of this detector need not be adaptive. Generalizing to variable channel characteristics and adaptive detector structure is tedious, but straightforward. A DFE detector shown in Fig. 34.54, utilizes the noiseless decision to help remove the ISI. There are two types of ISI: precursor ISI (ahead of the detection time) and postcursor (behind detection time). Feedforward equalization (FFE) is needed to eliminate the precursor ISI, pushing its energy into the postcursor domain. Supposing all the decisions made in the past are correct, DFE reproduces exactly the modified

© 2002 by CRC Press LLC

Prefilter yn

cn

in

+ +

P(D)

FIGURE 34.54

Decision feedback equalizer.

FIGURE 34.55 Precursor and postcursor ISI elimination with DFE (a) sampled channel response, (b) after feedforward filter and (c) slicer output.

postcursor ISI (with extra postcursor ISI produced by the FFE during the elimination of precursor ISI), thus eliminating it completely, Fig. 34.55. If the length of the FFE can be made infinitely long, it should be able to completely suppress the precursor ISI, redistributing its energy into the postcursor region, where it is finally cancelled by feedback decision part. No spectrum inverse is needed for this process, so noise boosting is much less than is the case with linear equalizers. The final decision of the detector is made by the memoryless slicer, Fig. 34.54. The reason why a slicer can perform efficient sequence detection can be explained with the fact that memory of the DFE system is located in two equalizers, so that only symbol-by-symbol detection can suffice. In terms of performance, the DFE is typically much closer to the maximum likelihood sequence detector than to the LE. If the equalization target is not the main cursor, but a PR system, a sequence detection algorithm can be used afterwards. A feasible way to implement this with minimum additional effort is the tree search algorithm used instead of VA [6]. The simple detection circuitry of a DFE, consisting of two equalizers and one slicer, makes implementation possible. The DFE may be regarded as a generalization of the PRE. In the DFE, the trailing portion of the ISI is not suppressed by a forward equalizer but rather canceled by a feedback filter that is excited by past decisions. Fortunately, error propagation is typically only a minor

© 2002 by CRC Press LLC

Equalizer errror _ Sampled channel data

Forward Filter

+ Recovered data

+_

D D

RAM address

.....

RAM containing estimates of ISI

D

FIGURE 34.56

Block diagram of a RAM-based DFE.

problem and it can, in fact, be altogether avoided through a technique that is called Tomlinson/Harashima precoding. Performance differences between zero-forcing and minimum mean-square equalizers tend to be considerably smaller in the DFE case than for the LE, and as a result it becomes more dificult to reap SNR benefits from the modulation code. It can be proved that DFE is the optimum receiver with no detection delay. If delay is allowed, it is better to use trellis-based detection algorithms. RAM-Based DFE Detection Decision feedback equalization or RAM-based DFE is the most frequent alternative to PRML detection. Increase of bit density leads to significant nonlinear ISI in the magnetic recording channel. Both the linear DFE [12,26] and PRML detectors do not compensate for the nonlinear ISI. Furthermore, the implementation complexity of a Viterbi detector matched to the PR channel grows exponentially with the degree of channel polynomial. Actually, in order to meet requirements for a high data transfer rate, high-speed ADC is also needed. In the RAM-based DFE [19,24], the linear feedback section of the linear DFE is replaced with a look-up table. In this way, detector decisions make up a RAM address pointing to the memory location that contains an estimate of the post cursor ISI for the particular symbol sequence. This estimate is subtracted from the output of the forward filter forming the equalizer output. Look-up table size is manageable and typically is less than 256 locations. The major disadvantage of this approach is that it requires complicated architecture and control to recursively update ISI estimates based on equalizer error.

Detection in a Trellis A trellis-based system can be simply described as a FSM (Finite State Machine) whose structure may be displayed with the aid of a graph, tree, or trellis diagram. A FSM maps input sequences (vectors) into output sequences (vectors), not necessarily of the same length. Although the system is generally nonlinear and time-varying, linear fixed trellis based systems are usually met. For them,

F ( a ⋅ i [0,∞) ) = a ⋅ F ( i [0,∞) ) ′ + i ″[0,∞) ) = F ( i [0,∞) ′ ) + F ( i [0,∞) ″ ) F ( i [0,∞) where a is a constant, i[0,∞) is any input sequence and F(i[0,∞)) is the corresponding output sequence. It is assumed that input and output symbols belong to a subset of a field. Also, for any d > 0, if x[0,∞) = ′ ′ , = 0[0,d). It is easily = 0[0,d) then F(i ′[0,∞) ) = x ′[0,∞) , where x ′l = xl−d, x [0,d) F(i[0,∞)) and i ′l = il−d, i [0,d) verified that F(⋅) can be represented by the convolution, so that x = i ∗ h, where h is the system impulse

© 2002 by CRC Press LLC

response (this is also valid for different lengths of x and i with a suitable definition of h). If h is of finite duration, M denotes the system memory length. Let us now consider a feedforward FSM with memory length M. At any time instant (depth or level) l, the FSM output xl depends on the current input il and M previous inputs il −1,…, il −M. The overall M functioning of the system can be mapped on a trellis diagram, whereon a node represents one of q encoder states (q is the cardinality of the input alphabet including the case when the input symbol is actually a subsequence), while a branch connecting two nodes represents the FSM output associated to the transition between the corresponding system states. A trellis, which is a visualization of the state transition diagram with a time element incorporated, is characterized by q branches stemming from and entering each state, except in the first and last M branches (respectively called head and tail of the trellis). The branches at the lth time instant are labeled by sequences xl ∈ X. A sequence of l information symbols, i[0,l) specifies a path from the root node to a node at the lth level and, in turn, this path specifies the output sequence x[0,l) = x0 • x1 • … • xl −1, where • denotes concatenation of two sequences. The input can, but need not, be separated in frames of some length. For framed data, where the length of each input frame equals L branches (thus L q-ary symbols) the length of the output frame is L + M branches (L + M output symbols), where the M known symbols (usually all zeros) are added at the end of the sequence to force the system into the desired terminal state. It is said that such systems suffer a fractional rate loss by L/(L + M). Clearly, this rate loss has no asymptotic significance. In the sequel, the detection of the input sequence, i(0,∞), will be analyzed based on the corrupted output sequence y[0,∞) = x[0,∞) + u[0,∞). Suppose there is no feedback from the output to the input, so that

P [ y n x 0 ,…,x n−1 , x n , y 0 ,…,y n−1 ] = P [ y n x n ] and N

P [ y 1 ,…,y N x 1 ,…,x N ] =

∏ P[y n=1

n

xn ]

Usually, u(0,∞) is a sequence that represents additive white Gaussian noise sampled and quantized to enable digital processing. The task of the detector that minimizes the sequence error probability is to find a sequence which maximizes the joint probability of input and output channel sequences

P [ y [0,L+M) , x [0,L+M) ] = P [ y [0,L+M) x [0,L+M) ]P [ x [0,L+M) ] Since usually the set of all probabilities P[x[0,L+M)] is equal, it is sufficient to find a procedure that maximizes P[y[0,L+M)| x[0,L+M)], and a decoder that always chooses as its estimate one of the sequences that maximize it or

µ ( y [0,L+M) x [0,L+M) ) = A log 2 P [ y [0,L+M) x [0,L+M) ] L+M

– f ( y [0,L+M) ) = A

∑ log ( P [ y l=0

2

l

xl ] – f ( yl ) )

(where A ≥ 0 is a suitably chosen constant, and f(⋅) is any function) is called a maximum-likelihood decoder (MLD). This quantity is called a metric, µ . This type of metric suffers one significant disadvantage because it is suited only for comparison between paths of the same length. Some algorithms, however,

© 2002 by CRC Press LLC

employ a strategy of comparing paths of different length or assessing likelihood of such paths with the aid of some thresholds. The metric that enables comparison for this type of algorithms is called the Fano metric. It is defined as

P [ y [0,l) , x [0,l) ] µ F ( y [0,l) x [0,l) ) = A log 2 -------------------------------P [ y [0,l) ] l

= A





[y x ] ∑  log P-------------------- – R  n=0

n

2



n

P [ yn ]



If the noise is additive, white, and Gaussian (an assumption that is not entirely true, but that usually yields systems of good performances), the probability distribution of its sample is 2 1  ( yn – xn )  p [ y n x n ] = ----------------2 exp  – ---------------------- 2 2 πσ 2σ  

The ML metric to be used in conjunction with such a noise is the logarithm of this density, and thus 2 proportional to −(yn − xn) , i.e., to the negative squared Euclidean distance of the readback and supposed written signal. Thus, maximizing likelihood amounts to minimizing the squared Euclidean distance of the two sequences, leading to minimizing the squared Euclidean distance between two sampled sequences 2 given by Σ n (y n – x n ) . The performance of a trellis-based system, as is the case with PR systems, depends on the detection algorithm employed and on the properties of the system itself. The distance spectrum is the property of the system that constitutes the main factor of the event error probability of a ML (optimum) detector, 45 if the distance is appropriately chosen for the coding channel used. For PR channels with additive white Gaussian noise, it is the squared Euclidean distance that has to be dealt with. Naturally, since the noise encountered is neither white, nor entirely Gaussisan, this is but an approximation to the properly chosen distance measure. As stated previously, the aim of the search procedure is to find a path with the highest possible likelihood, i.e., metric. There are several possible classifications of detecting procedures. This classification is in-line with systematization made in coding theory, due to fact that algorithms developed for decoding in a trellis are general so that it could be applied to problem of detection in any trellis-based system as well. According to detector’s strategies in extending the most promising path candidates we classify them into breadthfirst, metric-first, and depth-first, bidirectional algorithms, and into sorting and nonsorting depending on whether the procedure performs any kind of path comparison (sifting or sorting) or not. Moreover, detecting algorithms can be classified into searches that minimize the sequence or symbol error rate. The usual measure of algorithm efficiency is its complexity (arithmetic and storage) for a given probability of error. In the strict sense, arithmetic or computational complexity is the number of arithmetic operations per detected symbol, branch, or frame; however, it is a usual practice to track only the number of node computations, which makes sense because all such computations require approximately the same number of basic machine instructions. A node computation (or simply computation) is defined as the total number of nodes extended (sometimes it is the number of metrics computed) per detected branch or information frame i[0,L+M). One single computation consists of determining the state in which the node is computing the metrics of all its successors. For most practical applications with finite frame length, it is usually sufficient to observe node computations since a good prediction of search duration can be precisely predicted. Nevertheless, for asymptotic behavior it is necessary to track the sorting requirements too. Another important aspect of complexity is storage (memory or space), which is the amount of auxiliary storage that is required for detecting memory, processors working in parallel, etc. Thus, space complexity of an algorithm is the size (or number) of resources that must be reserved for

© 2002 by CRC Press LLC

its use, while the computational, or more precisely time complexity, reflects the number of accesses to this resources taking into account that any two operations done in parallel by the spatially separated processors should be counted as one. The product of these two, the time-space complexity, is possibly the best measure of the algorithm cost for it is insensitive to time-space tradeoff such as parallelization or the use of precomputed tables, although it also makes sense to keep the separate track of these two. Finally, for selecting which algorithm to use, one must consider additional details that we omit here, but which can sometimes cause unexpected overall performance or complicate the design of a real-time detector. They include complexity of the required data structure, buffering needs, and applicability of available hardware components. Basic Breadth-First Algorithms The Viterbi Algorithm (VA) The VA was introduced in 1967 as a method of decoding convolutional codes. Forney showed in 1972 [7] that the VA solves the maximum-likelihood sequence detection (MLSD) problem in the presence of ISI and additive white noise. Kobayashi and Tang [8] recognized that this algorithm is possible to apply in magnetic recording systems for detection purposes. Strategy to combine Viterbi detector with PR equalization in magnetic recording channel resulted with many commercial products. The VA is an optimal decoding algorithm in the sense that it always finds the nearest path to the noisy modification of the FSM output sequence x[0, L + M), and it is quite useful when FSM has a short memory. The key to Viterbi (maximum-likelihood, ML) decoding lies in the Principle of Nonoptimality [17]. If the ′ and i ″ terminate at the same state of the trellis and paths i [0,l [0,l ) )

µ  y [0,l ) , x ′[0,l ) > µ  y [0,l ) , x ″[0,l ) ″ cannot be the first l branches of one of the paths i[0, L+M) that maximize the overall sequence then i [0,l ) metric. This principle which some authors call the Principle of Optimality literally specifies the most efficient MLD procedure for decoding/detecting in the trellis. To apply VA as an ML sequence detector for a PR channel, we need to define the channel trellis describing the amount of controlled ISI. Once we define the PR channel polynomial, it is an easy task. 2 An example of such trellis for PR4 channel with P(D) = 1 − D is depicted in Fig. 34.57. The trellis for this channel consists of four states according to the fact that channel input is binary and channel memory is 2, so that there are four possible state values (00, 10, 01, 11). Generally, if the channel input sequence can take q values, and the PR channel forms the ISI from the past M input symbols, then the PR channel M can be described by a trellis with q states. Branches joining adjacent states are labeled with the pair of 2 expected noiseless symbols in the form channel_output/channel_ input. Equalization to P(D) = 1 − D results in ternary channel output, taking values {0, ±1}. Each noiseless output channel sequence is obtained by reading the sequence of labels along some path through the trellis.

FIGURE 34.57

PR4 channel trellis.

© 2002 by CRC Press LLC

Now the task of detecting i[0,∞) is to find x[0,∞) that is closest to y[0,∞) in the Euclidean sense. Recall that we stated as an assumption that channel noise is AWGN, while in magnetic recording systems after equalization the noise is colored so that the minimum-distance detector is not an optimal one, and additional post-processing is necessary, which will be addressed later in this chapter. The Viterbi algorithm is a classical application of dynamic programming. Structurally, the algorithm M contains q lists, one for each state, where the paths whose states correspond to the label indices are stored, compared, and the best one of them retained. The algorithm can be described recursively as follows: 1. Initial condition: Initialize the starting list with the root node (the known initial state) and set its metric to zero, l = 0. 2. Path extension: Extend all the paths (nodes) by one branch to yield new candidates, l = l + 1, and find the sum of the metric of the predecessor node and the branch metric of the connecting branch M (ADD). Classify these candidates into corresponding q lists (or less for l < M). Each list (except in the head of the trellis) contains q paths. 3. Path selection: For each end-node of extended paths determine the maximum/minimum* of these sums (COMPARE) and assign it to the node. Label the node with the best path metric to it, selecting (SELECT) that path for the next step of the algorithm (discard others). If two or more paths have the same metric, i.e., if they are equally likely, choose the best one at random. Find the best of all ′ δ. the survivor paths, x′[0,l ), and its corresponding information sequence i ′[0,l ) and release the bit i l− Go to step 2. In the description of the algorithm we emphasized three Viterbi-characteristic operations—add, compare, select (ADC)—that are performed in every recursion of the algorithm. So today’s specialized signal processors have this operation embedded optimizing its execution time. Consider now the amount of M “processing” done at each depth l, where all of the q states of the trellis code are present. For each state it is necessary to compare q paths that merge in that state, discard all but the best path, and then compute and send the metrics of q of its successors to the depth l + 1. Consequently, the computational complexity of the VA exponentially increases with M. These operations can be easily parallelized, but then the number of parallel processors rises as the number of node computations decreases. The total time-space complexity of the algorithm is fixed and increases exponentially with the memory length. The sliding window VA decodes infinite sequences with delay of δ branches from the last received one. In order to minimize its memory requirements (δ + 1 trellis levels), and achieve bit error rate only insignificantly higher than with finite sequence VA, δ is chosen as δ ≈ 4M. In this way, the Viterbi detector introduces a fixed decision delay. Example Assume that a recorded channel input sequence x, consisting of L equally likely binary symbols from the alphabet {0, 1}, is “transmitted” over PR4 channel. The channel is characterized by the trellis of Fig. 34.57, i.e., all admissible symbol sequences correspond to the paths traversing the trellis from l = 0 to l = L, with one symbol labeling each branch, Fig. 34.58. Suppose that the noisy sequence of samples at the channel output is y = 0.9, 0.2, –0.6, –0.3, 0.6, 0.9, 1.2, 0.3,… If we apply a simple symbol-by-symbol detector to this sequence, the fifth symbol will be erroneous due to the hard quantization rule for noiseless channel output estimate

 – 1 y k < – 0.5  yˆ k =  1 y k > 0.5  0 otherwise 

*It depends on whether the metric or the distance is accumulated.

© 2002 by CRC Press LLC

FIGURE 34.58

Viterbi algorithm detection on the PR4 trellis.

The Viterbi detector will start to search the trellis accumulating branch distance from sequence y. In the first recursion of the algorithm, there are two paths of length 1 at the distance 2

d ( y,0 ) = ( 0.9 – 0 ) = 0.81 2

d ( y,1 ) = ( 0.9 – 1 ) = 0.01 from y. Next, each of the two paths of length 1 are extended in two ways forming four paths of length 2 at squared Euclidean distance from the sequence y 2

d ( y, ( 0,0 ) ) = 0.81 + ( 0.2 – 0 ) = 0.85 2

d ( y, ( 0,1 ) ) = 0.81 + ( 0.2 – 1 ) = 1.45 2

d ( y, ( 1,0 ) ) = 0.01 + ( 0.2 – 0 ) = 0.05 2

d ( y, ( 1,1 ) ) = 0.01 + ( 0.2 – 1 ) = 0.65 and this accumulated distance of four paths labels the four trellis states. In the next loop of the algorithm each of the paths are again extended in two ways to form eight paths of length 3, two paths to each node at level (depth) 3. Node 00 2

d ( y, ( 0,0,0 ) ) = 0.85 + ( −0.6 – 0 ) = 1.21 2

d ( y, ( 1,0, – 1 ) ) = 0.05 + ( −0.6 + 1 ) = 0.21

surviving path

Node 10 2

d ( y, ( 0,0,1 ) ) = 0.85 + ( −0.6 – 1 ) = 3.41 2

d ( y, ( 1,0,0 ) ) = 0.05 + ( −0.6 – 0 ) = 0.41

surviving path

Node 01 2

d ( y, ( 0,1,0 ) ) = 1.45 + ( −0.6 – 0 ) = 1.81 2

d ( y, ( 1,1, – 1 ) ) = 0.65 + ( −0.6 + 1 ) = 0.81

© 2002 by CRC Press LLC

surviving path

Viterbi I 1-D

odd Input 2

Output

even Viterbi II 1-D

FIGURE 34.59 Implementation of 1-D Viterbi detector with two half-rate, 1-D detectors.

1/0 0/0

000

1 0/-

111

1/1

1/1 001

110

0/0

011 0/-1

0/-2

100

1/2

1/0 1/2 0/-2

100

1/0

001

011

0/1

1/0

0/-1

000

1/1

110 0/-1

-1/0

FIGURE 34.60

111

1/0

(1,7) coded EPR4 channel.

Node 11 2

d ( y, ( 0,1,1 ) ) = 1.45 + ( −0.6 – 1 ) = 4.01 2

d ( y, ( 1,1,0 ) ) = 0.65 + ( −0.6 – 0 ) = 1.01

surviving path

Four paths of length 3 are selected as the surviving most likely paths to the four trellis nodes. The procedure is repeated and the detected sequence is produced after a delay of 4M = 8 trellis sections. Note, Fig. 34.58, that the symbol-by-symbol detector error is now corrected. Contrary to this example, a 4-state PR4ML detector is implemented with two interleaved 2-state dicode, (1 − D), detectors each operating at one-half the symbol rate of one full-rate PR4 detector [35]. The sequence is interleaved, such that the even samples go to the first and the odd to the second dicode detector, Fig. 34.59, so the delay D in the interleaved detectors is actually twice the delay of the PR4 detector. A switch at the output resamples the data to get them out in the correct order. For other PR channels this type of decomposition is not possible, so that their complexity can become great for real-time processing. In order to suppress some of the states in the corresponding trellis diagram of those PR systems, thus simplifying the sequence detection process, some data loss has to be introduced. For instance, in conjunction with precoding (1,7) code prohibits two states in EPR4 trellis: [101] and [010]. This can be used to reduce the 8-state EPR4 trellis to 6-state trellis depicted in Fig. 34.60 and the number of add-compare-select units in the VA detector to 4. The data rate loss is 33% in this case. Using the (2,7) code eliminates two more states, paying the complexity gain by a 50% data rate loss. Because VA involves addition, multiplication, compare and select functions, which require complex circuitry at the read side, simplifications of the receiver for certain PRs were sought. One of them is the dynamic threshold technique [22]. This technique implies generating a series of thresholds. The readback samples are compared with them, just as for the threshold detector, and are subsequently included in their modification. While preserving the full function of the ML detector, this technique saves a substantial fraction of the necessary hardware. Examples of dynamic threshold detectors are given in [30] and [6].

© 2002 by CRC Press LLC

Noise-Predictive Maximum Likelihood Detectors (NPLD) Maximum likelihood detection combined with PR equalization is a dominant type of detection electronics in today’s digital magnetic recording devices. As described earlier, in order to simplify hardware realization of the receiver, the degree of the target PR polynomial is chosen to be small with integer coefficients to restrict complexity of Viterbi detection trellis. On the other hand, if the recording density is increased, to produce longer ISI, equalization to the same PR target will result in substantial noise enhancement and detector performance degradation. Straightforward solution is to increase the duration of the target PR polynomial decreasing the mismatch between channel and equalization target. Note that this approach leads to undesirable increase in detector complexity fixing the detector structure in a sense that its target polynomial cannot be adapted to changing channel density. The (NPML) detector [20,32] is an alternative data detection method that improves reliability of the PRML detector. This is achieved by embedding a noise prediction/whitening process into the branch metric computation of a Viterbi detector. Using reduced-state sequence-estimation [43] (see also the description of the generalized VA in this chapter), which limits the number of states in the detector trellis, compensates for added detector complexity. A block diagram of a NPML system is shown in Fig. 34.61. The input to the channel is binary sequence, i, which is written on the disk at a rate of 1/T. In the readback process data are recovered via a lowpass filter as an analog signal y(t), which can be expressed as y(t) = Σ n i n h(t − nT) + u(t), where h(t) denotes the pulse response and u(t) is the additive white Gaussian noise. The signal y(t) is sampled periodically at times t = nT and shaped into the PR target response by the digital equalizer. The NPML detector then performs sequence detection on the PR equalized sequence y and provides an estimate of the binary information sequence i. Digital equalization is performed to fit the overall system transfer function to some PR target, e.g., the PR4 channel. M The output of the equalizer yn + in + Σ i=1 fi xn−i + wn consists of the desired response and an additive total distortion component wn, i.e., the colored noise and residual interference. In conventional PRML detector, an estimate of the recorded sequence is done by the minimum-distance criteria as described for the Viterbi detector. If the mismatch between channel and PR target is significant, the power of distortion component wn can degrade the detector performance. The only additional component compared to the Viterbi detector, NPML noise-predictor, reduces the power of the total distortion by NPVA detector

u(t)

i

Magnetic Recording Channel

+

AGC

Lowpass Filter

y(t)

PR digital equalizer

y

i

Viterbi detector

FIGURE 34.61

Predictor P(D)

.....

.....

t=nT

Block diagram of NPVA detector. N

Whitened PR equalizer output :

yn − ∑ wn−i pi i =1

j

= ( xn −1 xn − 2 )

.....

Channel memory state

xn ( S k ) − xn−2 ( S j ) / xn ( S k )

FIGURE 34.62

NPML metric computation for PR4 trellis.

© 2002 by CRC Press LLC

S k = ( xn xn−1 )

whitening the noise prior to the Viterbi detector. The whitened total distortion component of the PR equalized output yn is N

w n – wˆ n = w n –

∑w i=1

p

n−i i

1 2 N where the N-coefficient MMSE predictor transfer polynomial is P(D) = p1D + p2D + … + pN D . Note ˆ that an estimate of the current noise sample w n is formed based on estimates of previous N noise samples. Assuming the PR4 equalization of sequence y, the metric of the Viterbi detector can be modified in order to compensate for distortion component. In this case, the equalizer output is yn = xn − xn−2 + wn and the NPML distance is

  yn – 

2

N

∑w i=1

 =  yn – 

 p  – ( x n ( S k ) – x n−2 ( S j ) ) 

n−i i N

2

 ( y n−i – xˆ n−i ( S j ) – xˆ n−i−2 ( S j ) )p i – ( x n ( S k ) – x n−2 ( S j ) ) i=1 



where xˆ n−i ( S j ), xˆ n−i−2 ( S j ) represent past decisions taken from the Vitrebi survivor path memory associated with state Sj. The last expression gives the flavor of this technique, but it is not suitable for implementation so that the interested reader can find details in [20] how to modify this equation for RAM look-up realization. Furthermore, in the same paper, a description of the general procedure to compute the predictor coefficients based on the autocorrelation of the total distortion wn at the output of a finite-length PR equalizer is given. Postprocessor As explained earlier, Viterbi detector improves the performance of a read channel by tracing the correct path through the channel trellis [8]. Further performance improvement can be achieved by using soft output Viterbi algorithm (SOVA) [14]. Along with the bit decisions, SOVA produces the likelihood of these decisions, that combined create soft information. In principle, soft information can be passed to hard drive controller and used in RS decoder that resides there, but at the present time soft decoding of RS codes is still too complex to be implemented at 1 Gb/s speeds. Alternatively, much shorter inner code is used. Because of the nonlinear operations on bits performed by the modulation decoder logic, the inner code is used in inverse concatenation with modulation encoder in order to simplify calculation of bit likelihood. Due to the channel memory and noise coloration, Viterbi detector produces some error patterns more often than others [5], and the inner code is designed to correct these so-called dominant error sequences or error events. The major obstacle for using soft information is the speed limitations and hardware complexity required to implement SOVA. Viterbi detector is already a bottleneck and the most complex block in a read channel chip, occupying most of the chip area, and the architectural challenges in implementing even more complex SOVA would be prohibitive. Therefore, a postprocessor architecture is used [18]. The postprocessor is a block that resides after Viterbi detector and comprises the block for calculating error event likelihood and an inner-soft error event correcting decoder. The postprocessor is designed by using the knowledge on the set of dominant error sequences E = {e i } 1≤i and their occurrence probabilities P = (p i ) 1≤i . The index i is referred to as an error type, while the position of the error event end within a codeword is referred as an error position. The relative frequencies of error events will strongly depend on recording density [36]. The detection is based on the fact that we can calculate the likelihoods of each of dominant error sequences at each point in time. The parity bits detect the errors, and provide localization in error type and time. The likelihoods are then used to choose the most likely error events for corrections.

© 2002 by CRC Press LLC

The error event likelihoods are calculated as the difference in the squared Euclidean distances between the signal and the convolution of maximum likelihood sequence estimate and the channel PR, versus that between the signal and the convolution of an alternative data pattern and the channel PR. During each clock cycle, the best M of them are chosen, and the syndromes for these error events are calculated. Throughout the processing of each block, a list is maintained of the N most likely error events, along with their associated error types, positions and syndromes. At the end of the block, when the list of candidate N error events is finalized, the likelihoods and syndromes are calculated for each of ( L ) combinations of Lset candidate error events that are possible. After disqualifying those L-sets of candidates, which overlap in the time domain, and those candidates and L-sets of candidates, which produce a syndrome which does not match the actual syndrome, the candidate or L-set of candidates, which remains and which has the highest likelihood is chosen for correction. Finding the error event position and type completes decoding. The decoder can make two types of errors: it fails to correct if the syndrome is zero, or it makes a wrong correction if the syndrome is nonzero, but the most likely error event or combination of error events does not produce the right syndrome. A code must be able to detect a single error from the list of dominant error events and should minimize the probability of producing zero syndrome when more than one error event occurs in a codeword. Consider a linear code given by an (n − k) × n parity check matrix H. We are interested in capable of correcting or detecting dominant errors. If all errors from a list were contiguous and shorter than m, a cyclic n − k = m parity bit code could be used to correct a single error event [16]; however, in reality, the error sequences are more complex, and occurrence probabilities of error events of lengths 6, 7, 8 or more are not negligible. Furthermore, practical reasons (such as decoding delay, thermal asperities, etc.) dictate using short codes, and consequently, in order to keep the code rate high, only a relatively small number of parity bits is allowed, making the design of error event detection codes nontrivial. The code redundancy must be used carefully so that the code is optimal for a given E. The parity check matrix of a code can be created by a recursive algorithm that adds one column of H at a time using the criterion that after adding each new column, the code error-event-detection capabilities are still satisfied. The algorithm can be described as a process of building a directed graph whose vertices are labeled by the portions of parity check matrix long enough to capture the longest error event, and whose edges are labeled by column vectors that can be appended to the parity check matrix without violating the error event detection capability [4]. To formalize code construction requirements, for each T error event from E, denote by si,l a syndrome of error vector σl (ei) (si,l = σl (ei) · H ), where σl (ei) is an l-time shifted version of error event ei. The code should be designed in such a way that any shift of any dominant error sequence produces a nonzero syndrome, i.e., that s i,l ≠ 0 for any 1 ≤ i ≤ I and 1 ≤ l ≤ n. In this way, a single error event can be detected (relying on error event likelihoods to localize the error event). The correctable shifts must include negative shifts as well as shifts larger than n in order to cover those error events that straddle adjacent codewords, because the failure to correct straddling events significantly affects the performance. A stronger code could have a parity check matrix that guaranties that syndromes of any two-error event-error position pairs ((i1, l1), (i2, l2)) are different, i.e., s i1 ,l1 ≠ s i2 ,l2 . This condition would result in a single error event correction capability. The codes capable of correcting multiple error events can be defined analogously. We can even strengthen this property and require that for any two shifts and any two dominant error events, the Hamming distance between any pair of syndromes is larger than δ ; however, by strengthening any of these requirements the code rate decreases. If Li is a length of the ith error event, and if L is the length of the longest error event from E, (L = max 1≤ i ≤ I { L i }), then it is easy to see that for a code capable of detecting an error event from E that ends at position j, the linear combination of error events and the columns of H from j − L + 1 to j has to be nonzero. More precisely, for any i and any j (ignoring the codeword boundary effects)



1≤m≤L i

T

e i,m ⋅ h j−Li +m ≠ 0

where ei,m is the mth element of the error event ei, and hj is the jth column of H.

© 2002 by CRC Press LLC

Advanced Algorithms and Algorithms under Investigation This subsection gives a brief overview of less complex procedures for searching the trellis. It is intended to give background information that can be used in future development if it shows up that NPVA detectors and postprocessing are not capable of coping with ever-increasing storage densities and longer PRs needed for them. In such cases, a resort has to be made to some sort of reduced complexity suboptimal algorithms, whose performance is close to optimal. Explained algorithms are not yet implemented in commercial products, but all of them are a natural extension of already described procedures for searching the trellis. Other Breadth-First Algorithms The M-Algorithm Since most survivors in the VA usually possess much smaller metrics than does the best one, all the states or nodes kept are not equally important. It is intuitively reasonable to assume that unpromising survivors can be omitted with a negligible probability of discarding the best one. The M-algorithm [10] is one M such modification of the VA; all candidates are stored in a single list and the best M ≤ q survivors are selected from the list in each cycle. The steps of the M-algorithm are: 1. Initial condition: Initialize the list with the root node and set its metric to zero. 2. Path extension: Extend all the paths of length l by one branch and classify all contenders (paths of length l + 1) into the list. If two or more paths enter the same state keep the best one. 3. Path selection: From the remaining paths find the best M candidates and delete the others. If l = L + M, take the only survivor and transfer its corresponding information sequence to the output (terminated case, otherwise use the sliding window variation). Otherwise, go to step 2. Defined in this way, the M-algorithm performs trellis search, while, when the state comparison in step 2 is omitted, it searches the tree, saving much time on comparisons but with slightly increased error probability. When applied to decoding/detecting infinitely long sequences, it is usual that comparisons performed in step 2 are substituted with the so-called ambiguity check [10] and a release of one decoded branch. In each step this algorithm performs M node computations, and employing any sifting procedure (since the paths need not be sorted) perform ∼Mq metric comparisons. If performed, the Viterbi-type 2 discarding of step 2 requests ∼M q state and metric comparisons. This type of discarding can be performed with ∼M log2 M comparisons (or even linearly) but than additional storage must be provided. The space complexity grows linearly with the information frame length L and parameter M. The Generalized Viterbi Algorithm In contrast to the VA, which is a multiple-list single survivor algorithm, the M-algorithm is a single-list multiple-survivor algorithm. The natural generalization to a multiple-list multiple-survivor algorithm was first suggested by Hashimoto [39]. Since all the lists are not equally important, this algorithm, M originally called the generalized Viterbi algorithm (GVA), utilizes only q 1 lists (labels), where M1 ≤ M. M−M 1 +1 In each list from all q paths, it retains the best M1 candidates. The algorithm can be described as follows: 1. Initial condition: Initialize the starting label with the root node and set its metric to zero. 2. Path extension: Extend all the paths from each label by one branch and classify all successors into the appropriate label. If two or more paths enter the same state keep the best one. 3. Path selection: From the remaining paths of each label find the best M1 and delete the others. If l = L + M, take the only survivor and transfer its information sequence to the output (for the terminated case, otherwise use the sliding window variant). Go to step 2. When M1 = M, and M1 = 1, the GVA reduces to the VA, and for M1 = 0, M1 = M it reduces to the M-algorithm. Like the M-algorithm, GVA in each step performs M1 node computations per label, and employing any sifting procedure ∼M1 q metric comparisons. If performed, the Viterbi-type discarding of 2 step 2 requests ∼ M 1q or less state and metric comparisons per label.

© 2002 by CRC Press LLC

Metric-First Algorithms Metric-first and depth-first sequential detection is a name for a class of algorithms that compare paths according to their Fano metric (one against another or with some thresholds) and on that basis decide which node to extend next, which to delete in metric first procedures or whether to proceed with current branch or go back. These algorithms generally extend fewer nodes for the same performance, but have increased sorting requirements. Sequential detecting algorithms have a variable computation characteristic that results in large buffering requirements, and occasionally large detecting delays and/or incomplete detecting of the received sequence. Sometimes, when almost error-free communication is required or when retransmission is possible, this variable detecting effort can be an advantage. For example, when a detector encounters an excessive number of computations, it indicates that a frame is possibly very corrupted meaning that the communication is insufficiently reliable and can ultimately cause error patterns in detected sequence. In such situations the detector gives up detecting and simply requests retransmission. These situations are commonly called erasures, and detecting incomplete. A complete decoder such as the Viterbi detector/decoder would be forced to make an estimate, which may be wrong. The probability of buffer overflow is several orders of magnitude larger than the probability of incorrect decision when the decoder operates close to the so-called (computational) cutoff rate. The performance of sequential detecting has traditionally been evaluated in terms of three characteristics: the probability of sequence error, the probability of failure (erasure), and the Pareto exponent associated with detecting effort. The Stack Algorithm The stack (or ZJ) algorithm was for the first time suggested by Zigangirov [1] and later independently by Jelinek [1]. As its name indicates, the algorithm contains a stack (in fact, a list) of already searched paths of varying lengths, ordered according to their metric values. At each step, the path at the top of the stack (the best one) is replaced by its q successors extended by one branch, with correspondingly augmented metrics. The check whether two or more paths are in the same state is not performed. This algorithm has its numerous variations and we first consider the basic version that is closest to Zigangirov’s: 1. Initial condition: Initialize the stack with the root node and set its Fano metric to zero (or some large positive number to avoid arithmetic with negative numbers, but low enough to avoid overflow). 2. Path extension: Extend the best path from the stack by one branch, delete it, sort all successors, and then merge them with the stack so that it is ordered according to the path metrics. 3. Path selection: Retain the best Z paths according to the Fano metric. If the top path has the length l = L + M branches, transfer its information sequence to the output (terminated case; otherwise, a sliding window version has to be used); otherwise, go to step 2. It is obvious that this algorithm does not consider path merging since the probability that the paths of the same depth and the same state can be stored in the stack simultaneously is rather small. Nonetheless, K some authors [1] propose that a following action should be added to the step 2: If any of the 2 new paths merges with a path already in the stack, keep the one with the higher metric. The stack algorithm is based on the nonselection principle [17]. If the paths i ′[ 0,L+M ) and i ″[ 0,L+M ) through the tree diverge at depth j and

min{u  x ′[ 0,l ) , y [ 0,l ) } > min{u  x ″[ 0,l ) , y [ 0,l ) }   l ∈ [ j+1,L+M )   l ∈[ j+1,L+M ) then i ″[ 0, L+M ) cannot be the path at the top of the stack when the stack algorithm stops.

© 2002 by CRC Press LLC

The computational complexity of the stack algorithm is almost unaffected by the code memory length, but well depends on the channel performance. Its computational complexity is a random variable and so is its stack size if not otherwise limited. The upper bound on the computational complexity is given by

P[C ≥ η] < Aη

−ρ

0 n − k + 1 we would contradict Corollary 1. 䊐 Codes meeting the Singleton bound are called maximum distance separable (MDS). In fact, except for trivial cases, binary codes are not MDS. In order to obtain MDS codes, we will define codes over larger fields, like the so-called Reed Solomon codes, to be described later in the chapter. A second bound is also given relating the redundancy and the minimum distance of an [n, k, d] code the so-called Hamming or volume bound. Let us denote by V(r) the number of elements in a sphere of n radius r whose center is an element in GF(2) . It is easy to verify that r

V(r) =

n

∑  i 

(34.52)

i=0

We then have: Lemma 4 (Hamming bound) Let C be a linear [n, k, d] code, then

n – k ≥ log 2 V ( ( d – 1 )/2 )

(34.53)

Proof: Notice that the 2 spheres with the 2 codewords as centers and radius ( d – 1 )/2 are disjoint. k The total number of vectors contained in these spheres is 2 V ( ( d – 1 )/2 ) . This number has to be smaller than or equal to the total number of vectors in the space, i.e., k

k

n

k

2 ≥ 2 V ( ( d – 1 )/2 ) Inequality (34.53) follows immediately from Eq. (34.54).

(34.54) 䊐

A perfect code is a code for which inequality Eq. (34.53) is in effect equality. Geometrically, a perfect k code is a code for which the 2 spheres of radius ( d – 1 )/2 and the codewords as centers cover the whole space.

© 2002 by CRC Press LLC

Not many perfect codes exist. In the binary case, the only nontrivial linear perfect codes are the Hamming codes (to be presented in the next subsection) and the [23,12,7] Golay code. For details, the reader is referred to [4].

Syndrome Decoding, Hamming Codes, and Capacity of the Channel This subsection studies the first important family of codes, the so-called Hamming codes. As will be shown, Hamming codes can correct up to one error. Let C be an [n, k, d] code with parity check matrix H. Let u be a transmitted vector and r a possibly corrupted received version of u . We say that the syndrome of r is the vector s of length n − k given by

s = rH

T

(34.55)

Notice that, if no errors occurred, the syndrome of r is the zero vector. The syndrome, however, tells us more than a vector being in the code or not. For instance, as before, that u was transmitted and r was received, where r = u ⊕ e , e an error vector. Notice that, T

T

T

T

s = r H = ( u ⊕ e )H = u H ⊕ e H = e H

T

because u is in C. Hence, the syndrome does not depend on the received vector but on the error vector. In the next lemma, we show that to every error vector of weight ≤(d − 1)/2 corresponds a unique syndrome. Lemma 5 Let C be a linear [n, k, d] code with parity check matrix H. Then, there is a 1-1 correspondence between errors of weight ≤(d − 1)/2 and syndromes. T

Proof: Let e 1 and e 2 be two distinct error vectors of weight ≤(d − 1)/2 with syndromes s 1 = e 1H and s 2 = e HT. If s 1 = s 2 , then s = ( e ⊕ e )Η T = s 1 ⊕ s 2 = 0 , hence e ⊕ e ∈ C . But e ⊕ e has weight 2 1 2 1 2 1 2 ≤d − 1, a contradiction. 䊐 Lemma 5 gives the key for a decoding method that is more efficient than exhaustive search. We can construct a table with the 1-1 correspondence between syndromes and error patterns of weight ≤(d − 1)/2 and decode by look-up table. In other words, given a received vector, we first find its syndrome and then we look in the table to which error pattern it corresponds. Once we obtain the error pattern, we add it to the received vector, retrieving the original information. This procedure may be efficient for small codes, but it is still too complex for large codes. Example 3 Consider the code whose parity matrix H is given by (34.51). We have seen that this is a [5, 2, 3] code. We have six error patterns of weight ≤1. The 1-1 correspondence between these error patterns and the syndromes can be immediately verified to be

00000 ↔ 000 10000 ↔ 011 01000 ↔ 110 00100 ↔ 100 00010 ↔ 010 00001 ↔ 001 T

For instance, assume that we receive the vector r = 10111. We obtain the syndrome s = r H = 100. Looking at the table above, we see that this syndrome corresponds to the error pattern e = 00100. Adding this error pattern to the received vector, we conclude that the transmitted vector was r ⊕ e = 10011. 䊐

© 2002 by CRC Press LLC

r

r

Given a number r of redundant bits, we say that a [2 − 1, 2 − r − 1, 3] Hamming code is a code r having an r × (2 − 1) parity check matrix H such that its columns are all the different nonzero vectors of length r. A Hamming code has minimum distance 3. This follows from its definition and Corollary 1. Notice that any two columns in H, being different, are linearly independent. Also, if we take any two different columns and their sum, these three columns are linearly dependent, proving our assertion. A natural way of writing the columns of H in a Hamming code, is by considering them as binary numbers on base 2 in increasing order. This means, the first column is 1 on base 2, the second column r T is 2, and so on. The last column is 2 − 1 on base 2, i.e., (1, 1,…, 1) . This parity check matrix, although nonsystematic, makes the decoding very simple. In effect, let r be a received vector such that r = v ⊕ e, where v was the transmitted codeword and e T is an error vector of weight 1. Then, the syndrome is s = e H , which gives the column corresponding to the location in error. This column, as a number on base 2, tells us exactly where the error has occurred, so the received vector can be corrected. Example 4 Consider the [7, 4, 3] Hamming code C with parity check matrix

 0 0 0 1 1 1 1   H =  0 1 1 0 0 1 1  1 0 1 0 1 0 1  

(34.56)

T

Assume that vector r = 1100101 is received. The syndrome is s = r H = 001, which is the binary representation of the number 1. Hence, the first location is in error, so the decoder estimates that the transmitted vector was v = 0100101. 䊐 We can obtain 1-error correcting codes of any length simply by shortening a Hamming code. This procedure works as follows: assume that we want to encode k information bits into a 1-error correcting r r code. Let r be the smallest number such that k ≤ 2 − r − 1. Let H be the parity check matrix of a [2 − 1, r r 2 − r − 1, 3] Hamming code. Then construct a matrix H′ by eliminating some 2 − r − 1 − k columns from H. The code whose parity check matrix is H′ is a [k + r, k, d] code with d ≥ 3, hence it can correct one error. We call it a shortened Hamming code. For instance, the [5,2,3] code whose parity check matrix is given by (34.51) is a shortened Hamming code. In general, if H is the parity check matrix of a code C, H ′ is a matrix obtained by eliminating a certain number of columns from H and C ′ is the code with parity check matrix H ′, we say that C ′ is obtained by shortening C. r r r r A [2 − 1, 2 − r − 1, 3] Hamming code can be extended to a [2 , 2 − r − 1, 4] Hamming code by r adding to each codeword a parity bit, that is, the exclusive-OR of the first 2 − 1 bits. The new code is called an extended Hamming code. So far, we have not talked about probabilities of errors. Assume that we have a binary symmetric channel (BSC), i.e., the probability of a 1 becoming a 0 or of a 0 becoming a 1 is p < .5. Let Perr be the probability of error after decoding using a code, i.e., the output of the decoder does not correspond to the originally transmitted information vector. A fundamental question is the following: given a BSC with bit error probability p, does it exist a code of high rate that can arbitrarily lower Perr? The answer, due to Shannon, is yes, provided that the code has rate below a parameter called the capacity of the channel, as defined next. Definition 2 Given a BSC with probability of bit error p, we say that the capacity of the channel is

C ( p ) = 1 + p log 2 p + ( 1 – p ) log 2 ( 1 – p )

© 2002 by CRC Press LLC

(34.57)

Theorem 1 (Shannon) For any ⑀ > 0 and R < C(p), there is an [n, k] binary code of rate k/n ≥ R with Perr < ⑀. For a proof of Theorem 1 and some of its generalizations, the reader is referred to [5], or even to Shannon’s original paper [6]. Theorem 1 has enormous theoretical importance. It shows that reliable communication is not limited in the presence of noise, only the rate of communication is. For instance, if p = .01, the capacity of the channel is C(.01) = .9192. Hence, there are codes of rate ≥.9 with Perr arbitrarily small. It also tells us not to look for codes with rate .92 making Perr arbitrarily small. The proof of Theorem 1, though, is based on probabilistic methods and the assumption of arbitrarily large values of n. In practical applications, n cannot be too large. The theorem does not tell us how to construct efficient codes, it just asserts their existence. Moreover, when we construct codes, we want them to have efficient encoding and decoding algorithms. In the last few years, coding methods approaching the Shannon limit have been developed, the so-called turbo codes. Although great progress has been made towards practical implementations of turbo codes, in applications like magnetic recording their complexity is still a problem. A description of turbo codes is beyond the scope of this introduction. The reader is referred to [2].

Codes over Bytes and Finite Fields So far, we have considered linear codes over bits. Next we want to introduce codes over larger symbols, ν mainly over bytes. A byte of size ν is a vector of ν bits. Mathematically, bytes are vectors in GF(2) . Typical cases in magnetic and optical recording involve 8-bit bytes. Most of the general results in the previous sections for codes over bits easily extend to codes over bytes. It is trivial to multiply bits, but we need a method to multiply bytes. To this end, the theory of finite fields has been developed. Next we give a brief introduction to the theory of finite fields. For a more complete treatment, the reader is referred to chapter 4 of [4]. We know how to add two binary vectors, we simply exclusive-OR them componentwise. What we need now is a rule that allows us to multiply bytes while preserving associative, distributive, and multiplicative inverse properties, i.e., a product that gives to the set of bytes of length ν the structure of a field. To this end, we will define a multiplication between vectors that satisfies the associative and commutative properties, it has a 1 element, each nonzero element is invertible and it is distributive with respect to the sum operation. Recall the definition of the ring Zm of integers modulo m: Zm is the set {0, 1, 2,…, m − 1}, with a sum and product of any two elements defined as the residue of dividing by m the usual sum or product. It is not difficult to prove that Zm is a field if and only if m is a prime number. Using this analogy, we will ν give to (GF(2)) the structure of a field. ν Consider the vector space (GF(2)) over the field GF(2). We can view each vector as a polynomial of degree ≤ν − 1 as follows: the vector a = (a0, a1,…, aν −1) corresponds to the polynomial a(α) = a0 + a1α ν−1 +…+ αν−1 α . ν ν The goal is to give to (GF(2)) the structure of a field. We will denote such a field by GF(2 ). The sum ν ν in GF(2 ) is the usual sum of vectors in (GF(2)) . We need now to define a product. Let f(x) be an irreducible polynomial (i.e., it cannot be expressed as the product of two polynomials of smaller degree) of degree ν whose coefficients are in GF(2). Let a(α) and b(α) be two elements of ν ν GF(2 ). We define the product between a(α) and b(α) in GF(2 ) as the unique polynomial c(α) of degree ≤ν − 1 such that c(α) is the residue of dividing the product a(α)b(α) by f(α) (the notation g(x)  h(x) (mod f(x)) means that g(x) and h(x) have the same residue after dividing by f(x), i.e., g(α) = h(α)). ν The sum and product operations defined above give to GF(2 ) a field structure. The role of the irreducible polynomial f(x) is the same as the prime number m when Zm is a field. In effect, the proof ν that GF(2 ) is a field when m is irreducible is essentially the same as the proof that Zm is a field when m ν is prime. From now on, we denote the elements in GF(2 ) as polynomials in α of degree ≤ν − 1 with coefficients in GF(2). Given two polynomials a(x) and b(x) with coefficients in GF(2), a(α)b(α) denotes

© 2002 by CRC Press LLC

3

TABLE 34.3

The Finite Field GF(8) Generated by 1 + x + x

Vector

Polynomial

Power of α

Logarithm

000 100 010 001 110 011 111 101

0 1 α α2 1+α α + α2 2 1 + α+ α 2 1+α

0 1 α α2 α3 α4 α5 α6

−∞ 0 1 2 3 4 5 6

ν

the product in GF(2 ), while a(x)b(x) denotes the regular product of polynomials. Notice that, for the ν irreducible polynomial f(x), in particular, f(α) = 0 in GF(2 ), since f(x)  0(mod f(x)). ν So, the set GF(2 ) given by the irreducible polynomial f(x) of degree ν is the set of polynomials of degree ≤ν − 1, where the sum operation is the regular sum of polynomials, and the product operation is the residue of dividing by f(x) the regular product of two polynomials. Example 5 Construct the field GF(8). Consider the polynomials of degree ≤2 over GF(2). Let f(x) = 1 + 3 x + x . Since f(x) has no roots over GF(2), it is irreducible (notice that such an assessment can be made 3 only for polynomials of degree 2 or 3). Let us consider the powers of α modulo f(α). Notice that α = 3 4 3 2 5 4 2 α + f(α ) = 1 + α. Also, α = αα = α (1 + α) = α + α . Similarly, we obtain α = αα = α(α + α ) = α 2 + α 3 = 1 + α + α 2, and α 6 = αα 5 = α + α 2 + α 3 = 1 + α 2. Finally, α 7 = αα 6 = α + α 3 = 1. Note that every nonzero element in GF(8) can be obtained as a power of the element α. In this case, α is called a primitive element and the irreducible polynomial f(x) that defines the field is called a primitive polynomial. It can be proven that it is always the case that the multiplicative group of a finite field is cyclic, so there is always a primitive element. A convenient description of GF(8) is given in Table 34.3. The first column in Table 34.3 describes the element of the field in vector form, the second one as a polynomial in α of degree ≤2, the third one as a power of α, and the last one gives the logarithm (also called Zech logarithm): it simply indicates the corresponding power of α. As a convention, we denote by − ∞ the logarithm corresponding to the element 0. 䊐 It is often convenient to express the elements in a finite field as powers of α ; when we multiply two ν of them, we obtain a new power of α whose exponent is the sum of the two exponents modulo 2 − 1. ν Explicitly, if i and j are the logarithms of two elements in GF(2 ), then their product has logarithm i + j ν (mod (2 − 1)). In the example above, if we want to multiply the vectors 101 and 111, we first look at their logarithms. They are 6 and 5, respectively, so the logarithm of the product is 6 + 5(mod 7) = 4, corresponding to the vector 011. In order to add vectors, the best way is to express them in vector form and add coordinate to coordinate in the usual way.

Cyclic Codes In the same way we defined codes over the binary field GF(2), we can define codes over any finite field ν ν n GF(2 ). Now, a code of length n is a subset of (GF(2 )) , but since we study only linear codes, we require that such a subset is a vector space. Similarly, we define the minimum (Hamming) distance and the generator and parity check matrices of a code. Some properties of binary linear codes, like the Singleton bound, remain the same in the general case. Others, such as the Hamming bound, require some modifications. ν Consider a linear code C over GF(2 ) of length n. We say that C is cyclic if, for any codeword (c0, c1, …, cn−1) 苸 C, then (cn−1, c0, c1,…, cn−2) 苸 C. In other words, the code is invariant under cyclic shifts to the right.

© 2002 by CRC Press LLC

ν

If we write the codewords as polynomials of degree −2 .

PSUB.sss Ra, Ra, Rb PADD.sss Ra, Ra, Rb

Clip ai to within the arbitrary range 15 [vmin, vmax], where –2 < vmin < vmax < 15 2 − 1.

Clip the signed integer ai to an unsigned integer within the range [0, vmax], 16 where vmax < 2 − 1. ci = max(ai, bi) Packed maximum operation

15

Rb contains the value (−2 + vmin). If ai < vmin, this 15 operation clips ai to −2 at the low end. ai is at least vmin . 15

PADD.sss Ra, Ra, Rb

Rb contains the value (2 − 1 − vmax). This operation 15 clips ai to 2 − 1 on the high end.

PSUB.sss Ra, Ra, Rd

Rd contains the value (2 − 1 − vmax + 2 − vmin). This 15 operation clips ai to −2 at the low end. 15 Re contains the value (−2 + vmin). This operation clips ai to vmax at the high end and to vmin at the low end.

PADD.sss Ra, Ra, Re Clip the signed integer ai to an unsigned integer within the range [0, vmax], 15 where 0 < vmax < 2 − 1.

15

15

15

15

PADD.sss Ra, Ra, Rb

Rb contains the value (2 − 1 − vmax). This operation 15 clips ai to 2 − 1 at the high end.

PSUB.uus Ra, Ra, Rb

This operation clips ai to vmax at the high end and to zero at the low end. If ai < 0, then ai = 0 else ai = ai. If ai was negative, it gets clipped to zero, else remains same. If ai > bi, then ci = (ai − bi) else ci = 0.

PADD.uus Ra , Ra , 0

PSUB.uuu Rc , Ra , Rb PADD Rc , Rb , Rc

If ai > bi, then ci = ai else ci = bi.

ci = |ai − bi|

PSUB.uuu Re, Ra, Rb

If ai > bi, then ei = (ai − bi) else ei = 0.

Packed absolute difference operation

PSUB.uuu Rf , Rb, Ra PADD Rc , Re , Rf

If ai bi, then ci = |ai − bi|, else ci = |bi − ai|.

Note: ai and bi are the subwords in the registers Ra and Rb, respectively, where i = 1, 2, …, k, and k denotes the number of subwords in a register. Subword size n, is assumed to be two bytes (i.e., n = 16) for this table.

TABLE 39.2 Summary of the Integer Register, Subword Sizes, and Subtraction Options Supported by the Different Architectures Architectural Feature Size of integer registers (bits) Supported subword sizes (bytes) Modular arithmetic Supported saturation options

IA-64

MAX-2

MMX

SSE-2

AltiVec

64 1, 2, 4 Y sss, uuu, uus for 1, 2 byte

64 2 Y sss, uus for 2 byte

64 1, 2, 4 Y sss, uuu for 1, 2 byte

128 1, 2, 4, 8 Y sss, uuu for 1, 2 byte

128 1, 2, 4 Y uuu, sss for 1, 2, 4 byte

Packed Average Packed average instructions are very common in media applications such as pixel averaging in MPEG-2 encoding, motion compensation, and video scaling. In a packed average, the pairs of corresponding subwords in the two source registers are added to generate intermediate sums. Then, the intermediate sums are shifted right by one bit, so that any overflow bit is shifted in on the left as the most significant bit. The beauty of the average operation is that no overflow can occur, and two operations (add followed by a one bit right shift) are performed in one operation. In a packed average instruction, 2n operations are performed in a single cycle, where n is the number of subwords. In fact, even more operations are performed in a packed average instruction, if the rounding applied to the least significant end of the result is considered. Here, two different rounding options have been used:

© 2002 by CRC Press LLC

TABLE 39.3

Summary of the packed add and packed subtract Instructions and Variants

Integer Operations

IA-64

MAX-2

MMX

SSE-2

√ √ √ √ √ √

√ √ √ √ √

√ √ √ √



c i = ai + b i ci = ai + bi (with saturation) c i = ai − b i ci = ai − bi (with saturation) ci = average(ai , bi) ci = average(ai , − bi) [c2i, c2i+1] = [a2i + a2i+1, b2i + b2i+1] lsbit(ci) = carryout(ai + bi) lsbit(ci) = carryout(ai − bi) ci = compare(ai , bi) Move mask ci = max(ai , bi) ci = min(ai , bi)

√ √

√ a √

c = Σ ai – bi





a



3DNow!

AltiVec



√ √ √ √ √

√ √

√ √ √ √

√ a

a

√ √ √

√ √ √





√ √

This operation is realized by using saturation arithmetic.

FIGURE 39.9

PAVG Rc, Ra, Rb: Packed average instruction using the round away from zero option.

• Round away from zero: A one is added to the intermediate sums, before they are shifted to the right by one bit position. If carry bits were generated during the addition operation, they are inserted into the most significant bit position during the shift right operation (see Fig. 39.9). • Round to odd: Instead of adding one to the intermediate sums, a much simpler OR operation is used. The intermediate sums are directly shifted right by one bit position, and the last two bits of each of the subwords of the intermediate sums are ORed to give the least significant bit of the final result. This makes sure that the least significant bit of the final results are set to 1 (odd) if at least one of the two least-significant bits of the intermediate sums are 1 (see Fig. 39.10).

© 2002 by CRC Press LLC

FIGURE 39.10 PAVG Rc, Ra, Rb: Packed average instruction using the round to odd option. (From Intel, IA-Architecture Software Developer’s Manual, Vol. 3, Instruction Set Reference, Rev. 1.1, July 2000. With permission.)

This rounding mode also performs unbiased rounding under the following assumptions. If the intermediate result is uniformly distributed over the range of possible values, then half of the time the bit shifted out is zero, and the result remains unchanged with rounding. The other half of the time the bit shifted out is one: if the next least significant bit is one, then the result loses –0.5, but if the next least significant bit is a zero, then the result gains +0.5. Because these cases are equally likely with a uniform distribution of the result, the round to odd option tends to cancel out the cumulative averaging errors that may be generated with repeated use of the averaging instruction. Accumulate Integer Sometimes, it is useful to add adjacent subwords in the same register. This can, for example, facilitate the accumulation of streaming data. An accumulate integer instruction performs an addition of the subwords in the same register and places the sum in the upper half of the target register, while repeating the same process for the second source register and using the lower half of the target register (Fig. 39.11). Save Carry Bits This instruction saves the carry bits from a packed add operation, rather than the sums. Figure 39.12 shows such a save carry bits instruction in AltiVec: a packed add is performed and the carry bits are written to the least significant bit of each result subword in the target register. A similar instruction saves the borrow bits generated when performing packed subtract instead of packed add. Packed Compare Instructions Sometimes, it is necessary to compare pairs of subwords. In a packed compare instruction, pairs of subwords are compared according to the relation specified by the instruction. If the condition is true for a subword pair, the corresponding field in the target register is written with a 1-mask. If the condition is

© 2002 by CRC Press LLC

FIGURE 39.11

ACC Rc, Ra, Rb: Accumulate integer working on registers with two subwords.

FIGURE 39.12

Save carry bits instruction.

false, the corresponding field in the target register is written with a 0-mask. Alternatively, a true or false bit is generated for each subword, and this set of bits is written into the least significant bits of the result register. Some of the architectures have compare instructions that allow comparison of two numbers for 3 all of the 10 possible relations, whereas others only support a subset of the most frequent relations. A typical packed compare instruction is shown in Fig. 39.13 for the case of four subwords. When a mask of bits is generated as in Fig. 39.13, often a move mask instruction is also provided. In a move mask instruction, the most significant bits of each of the subwords are picked, and these bits are placed into the target register, in a right aligned field (see Fig. 39.14). In different algorithms, either the subword mask format generated in Fig. 39.13 or the bit mask format generated in Fig. 39.14 is more useful. Two common comparisons used are finding the larger of a pair of numbers, or the smaller of a pair of numbers. In the packed maximum instruction, the greater of the subwords in the compared pair gets written to the corresponding subword in the target register (see Fig. 39.15). Similarly, in the packed minimum instruction, the smaller of the subwords in the compared pair gets written to the corresponding subword in the target register. As described in the earlier section on saturation arithmetic, instead of special instructions for packed maximum and packed minimum, MAX-2 performs packed maximum and 3

Two numbers a and b can be compared for one of the following 10 possible relations: equal, less-than, less-thanor-equal, greater-than, greater-than-or-equal, not-equal, not-less-than, not-less-than-or-equal, not-greater-than, not-greater-than-or-equal. Typical notation for these relations are as follows respectively: =, =, !=, !=.

© 2002 by CRC Press LLC

FIGURE 39.13

Packed compare instruction. Bit masks are generated as a result of the comparisons made.

FIGURE 39.14

Move mask Rb, Ra.

FIGURE 39.15

Packed maximum instruction.

© 2002 by CRC Press LLC

FIGURE 39.16

SAD Rc, Ra, Rb: Sum of absolute differences instruction.

packed minimum operations by using packed add and packed subtract instructions with saturation arithmetic (see Fig. 39.8). An ALU can be used to implement comparisons, maximum and minimum instructions with a subtraction operation; comparisons for equality or inequality is usually done with an exclusive-or operation, also available in most ALUs. Sum of Absolute Differences A more complex, multi-cycle instruction is the sum of absolute differences (SAD) instruction (see Fig. 39.16). This is used for motion estimation in MPEG-1 and MPEG-2 video encoding, for example. In a SAD instruction, the two packed operands are subtracted from one another. Absolute values of the resulting differences are then summed up. Although useful, the SAD instruction is a multi-cycle instruction with a typical latency of three cycles. This can complicate the pipeline control of otherwise single cycle integer pipelines. Hence, minimalist multimedia instruction sets like MAX-2 do not have SAD instructions. Instead, MAX-2 uses generic packed add and packed subtract instructions with saturation arithmetic to perform the SAD operation (see Fig. 39.8(b) and Table 39.1).

Packed Multiply Instructions Multiplication of Two Packed Integer Registers The main difficulty with packed multiplication of two n-bit integers is that the product is twice as long as each operand. Consider the case where the register size is 64 bits and the subwords are 16 bits. The result of the packed multiplication will be four 32-bit products, which cannot be accommodated in a single 64-bit target register. One solution is to use two packed multiply instructions. Figure 39.17 shows a packed multiply high instruction, which places only the more significant upper halves of the products into the target register.

© 2002 by CRC Press LLC

FIGURE 39.17

Packed multiply high instruction.

FIGURE 39.18

Packed multiply low instruction.

Figure 39.18 shows a packed multiply low instruction, which places only the less significant lower halves of the products into the target register. IA-64 generalizes this with its packed multiply and shift right instruction (see Fig. 39.19), which does a parallel multiplication followed by a right shift. Instead of being able to choose either the 4 upper or the lower half of the products to be put into the target register, it allows multiple different 16-bit fields from each of the 32-bit products to be chosen and placed in the target register. Ideally, saturation 4

In IA-64 the right-shift amounts are limited to 0, 7, 15, or 16 bits, so that only 2 bits in the packed multiply and shift right instruction are needed to encode the four shift amounts.

© 2002 by CRC Press LLC

FIGURE 39.19

The generalized packed multiply and shift right instruction.

arithmetic is applied to the shifted products, to guard for the loss of significant “1” bits in selecting the 16-bit results. IA-64 also allows the full product to be saved, but for only half of the pairs of source subwords. Either the odd or the even indexed subwords are multiplied. This makes sure that only as many full products as can be accommodated in one target register are generated. These two variants, the packed multiply left and packed multiply right instructions, are depicted in Figs. 39.20 and 39.21. Another variant is the packed multiply and accumulate instruction. Normally, a multiply and accumulate operation requires three source registers. The PMADDWD instruction in MMX requires only two source registers by performing a packed multiply followed by an addition of two adjacent subwords (see Fig. 39.22). Instructions in the AltiVec architecture may have up to three source registers. Hence, AltiVec’s packed multiply and accumulate uses three source registers. In Fig. 39.23, the instruction packed multiply high and accumulate starts just like a packed multiply instruction, selects the more significant halves of the products, then performs a packed add of these halves and the values from a third register. The instruction packed multiply low and accumulate is the same, except that only the less significant halves of the products are added to the subwords from the third register. Multiplication of a Packed Integer Register by an Integer Constant Many multiplications in multimedia applications are with constants, instead of variables. For example, in the inverse discrete cosine transform (IDCT) used in the compression and decompression of JPEG images and MPEG-1 and MPEG-2 video, all the multiplications are by constants. This type of multiplication can be further optimized for simpler hardware, lower power, and higher performance simultaneously by using

© 2002 by CRC Press LLC

FIGURE 39.20 Packed multiply left instruction where only the odd indexed subwords of the two source registers are multiplied.

FIGURE 39.21 Packed multiply right instruction where only the even indexed subwords of the two source registers are multiplied.

packed shift and add instructions [14,15,20]. Shifting a register left by n bits is equivalent to multiplying n it by 2 . Since a constant number can be represented as a binary sequence of ones and zeros, using this number as a multiplier is equivalent to a left shift of the multiplicand of n bits for each nth position where there is a 1 in the multiplier and an add of each shifted value to the result register. As an example, consider multiplying the integer register Ra with the constant C = 11. The following instruction sequence performs this multiplication. Assume Ra initially contains the value 6. Initial values: C = 11 = 10112 and Ra = 6 = 01102 Instruction Shift left 1 bit Rb,Ra Add Rb,Rb,Ra Shift left 3 bit Rc,Ra Add Rb,Rb,Rc

© 2002 by CRC Press LLC

Operation

Result

Rb = Ra n] Packed multiply left [c2i, c2i+1] = a2i ∗ b2i Packed multiply right [c2i, c2i+1] = a2i+1 ∗ b2i+1 Packed multiply and accumulate [c2i, c2i+1] = a2i ∗ b2i + a2i+1 ∗ b2i+1 di = upper_half(ai ∗ bi) + ci di = lower_half(ai ∗ bi) + ci b Packed shift left and add ci = (ai 200 Hz) resolution, and can utilize a self-contained pen that remembers everything written. Special bar code paper provides absolute position and page tracking. Optical methods based on CMOS technology lend themselves to low-power, low-cost, and highly integrated designs. These features suggest that optical tracking will play a significant role in future pen systems.

Handwriting Recognition Handwriting is a very well-developed skill that humans have used for over 5,000 years as means of communicating and recording information. With the widespread acceptance of computers, the future role of handwriting in our culture might seem questionable. However, as we discussed in introduction

© 2002 by CRC Press LLC

FIGURE 42.35 The handwriting recognition problem. The image of a handwritten character, word, or phrase is classified as one of the symbols, or symbol strings, from a known list. Some systems use knowledge about the language in the form of dictionaries (or Lexicons) and frequency information (i.e., Language Models) to aid the recognition process. Typically, a score is associated with each recognition result.

FIGURE 42.36 Different handwriting styles. In (a), Latin characters are used, from top to bottom, according to TM the presumed difficulty in recognition (adapted from Tappert,1984). In (b), the Graffiti unistroke (i.e., written with a single pen trace) alphabet which restricts characters to a unique pre-specified way that simplifies automatic recognition, the square dot indicates starting position of the pen.

section, a number of applications exist where the pen can be more convenient than a keyboard. This is particularly so in the mobile computing space where keyboards are not ergonomically feasible. Handwriting recognition is fundamentally a pattern classification task; the objective is to take an input graphical mark, the handwritten signal collected via a digitizing device, and classify it as one of a prespecified set of symbols. These symbols correspond to the characters or words in a given language encoded in a computerized representation such as ASCII (see Fig. 42.35). In this field, the term online has been used to refer to systems devised for the recognition of patterns captured with digitizing devices that preserve the pen trajectory; the term offline refers to techniques, which instead take as input a static twodimensional image representation, usually acquired by means of a scanner. Handwriting recognition systems can be further grouped according to the constraints they impose on the user with respect to writing style (see Fig. 42.36(a)). The more restricted the allowed handwritten input is, the easier the recognition task and the lower the required computational resources. At the highest restricted end of the spectrum, boxed-discrete style, users write one character at a time within predefined areas; this removes one difficult step, segmentation, from the recognition process. Very high levels of recognition accuracy can be achieved by requiring users to adhere to rules that restrict character shapes so as to minimize letter similarity (see Fig. 42.36(b)). Of course, such techniques require users to learn a “new” alphabet. At the least restricted end of the spectrum, mixed style, users are allowed to write words or phrases the same way they do on paper—in their own personal style—whether they print, write in cursive, or use a mixture of the two. Recognition of mixed-style handwriting is a difficult task due to ambiguity in segmentation (partitioning the word into letters), and large variation at the letter level. Segmentation is complex because it is often possible to wrongly break up letters into parts that are in turn meaningful (e.g., the cursive letter “d” can be subdivided into letters “c” and “I”). Variability in letter shape is mostly due to co-articulation

© 2002 by CRC Press LLC

(the influence of one letter on another), and the presence of ligatures, which frequently give rise to unintended (“spurious”) letters being detected in the script. In addition to the writing style constraints, the complexity of the recognition task is also determined by dictionary-size and writer-adaptation requirements. The size of dictionary can vary from very small (for tasks such as state name recognition) to open (for tasks like proper name recognition). In open vocabulary recognition, any sequence of letters is a plausible recognition result and this is the most difficult scenario for a recognizer. In the writer-adaptation dimension, systems capable of out-of-the box recognition are called writer-independent, i.e., they can recognize the writing of many writers; this gives a good average performance across different writing styles; however, there is considerable improvement in recognition accuracy that can be obtained by customizing the letter models of the system to a writer’s specific writing style. Recognition in this case is called writer-dependent. Despite these challenges, significant progress has been made in the building of writer-independent systems capable of handling unconstrained text and using dictionary sizes of over 20,000 words [1,2,3]. Some of these systems are now commercially available. For a comprehensive survey of the basic concepts behind written language recognition algorithms see [14,21,37]. User Interfaces on Mobile Devices In Fig. 42.37, examples of user interfaces for handwritten text input are presented that are representative of those found on today’s mobile devices. Because of the limited CPU and memory resources available on these platforms, handwritten input is restricted to the boxed-discrete style. Additional highlights of the user interface on these text input methods are: Special Input Area. Users are not allowed the fredom of writing anywhere on the screen. Instead, there is an area of the screen specially designated for the handwriting user interface, whether for text input or control. This design choice offers the following advantages:

FIGURE 42.37 Character-based text input method on today’s mobile devices. In (a), user interface for English character input on a cellular phone. In (b), user interface for Chinese character input on a 2-way pager.

© 2002 by CRC Press LLC

• No toggling between edit/control and ink mode. Pen input inside the input method area is treated as ink to be recognized by the recognizer; pen input outside this area is treated as mouse events, e.g., selection, etc. Without this separation, special provisions, sometimes unnatural, have to be taken to distinguish among the two pen modes. • Better user control. Within the specially designated writing window it is possible to have additional GUI elements that help the user with the input task. For instance, there might be buttons for common edit keys such as backspace, newline, and delete. Similarly, a list of recognition alternates can be easily displayed and selected from. This is particularly important because top-n recognition accuracy—a measure of how often the correct answer is among the highest ranked n results, is generally much higher than top-1 accuracy. • Consistent UI metaphor. Despite its ergonomic limitations, an on-screen keyboard is generally available as one of the text input methods on the device. Using a special input area for handwriting makes the user interface of alternative text entry methods similar. Modal Input. The possibilities of the user’s input are selectively limited in order to increase recognition accuracy. Common modes include “digits,” “symbols,” “upper-case letters,” and “lower-case letters” in English, or “traditional” versus “simplified” in Chinese. By limiting the number of characters against which a given input ink is matched, the opportunities for confusion and mis-recognition are decreased, and recognition accuracy is improved. Writing modes represent another tradeoff between making life simpler for the system or simpler for the user. Natural Character Set. It is possible to use any character writing style commonly used in the given language, no need to learn a special alphabet. Characters can be multi-stroke, i.e., written with more than one pen trace. Multi-boxed Input. When multi-stoke input is allowed, end of writing is generally detected by use of a timer that is set after each stroke is completed; the input is deemed concluded if a set amount of time elapses before any more input is received in the writing area. This “timeout” scheme is sometimes confusing to users. Multiple boxes give better performance because a character in one box can be concluded if input is received in another, removing the need to wait for the timer to finish. Of all the restrictions imposed on users by these character-based input methods, modality is the one where user feedback has been strongest: people want modeless input. The challenge lies in that distinguishing between letters which have very similar forms across modes is virtually impossible without additional information. In English orthography, for instance, there are letters for which the lower case version of the character is merely a smaller version of the upper case version; examples include “Cc,” “Kk,” “Mm,” “Oo,” “Ss,” “Uu,” “Ww,” etc. Simple attempts at building modeless character recognizers can result in a disconcerting user experience because upper case letters, or digits, might appear inserted into the middle of lower case words. Such m1Xed CaSe w0rdS (mixed case words) look to users to be gibberish. In usability studies, the authors have further found that as the text data entry needs on wireless PIA devices shifts from short address book or calendar items to longer notes or e-mail messages, users deem writing one letter at a time to be inconvenient and unnatural. More Natural User Interfaces One known way of dealing with the character confusion difficulties described in section “User Interfaces on Mobile Devices” is to use contextual information in the recognition process. At the simplest level this means recognizing characters in the context of their surrounding characters and taking advantage of visual clues derived from word shape. At a higher level, contextual knowledge can be in the form of lexical constraints, e.g., a dictionary of known words in the language is used to restrict interpretations of the input ink. These ideas naturally lead to the notion of a word-based text input method. By “word” we mean a string of characters which, if printed in text using normal conventions, would be surrounded by white-space characters (see Fig. 42.38(a)).

© 2002 by CRC Press LLC

FIGURE 42.38 Word-based text input method for mobile devices. In (a), a user interface prototype. In (b), an image of the mixed-case word “Wow,” where relative size information can be used for distinguishing among the letters in the “Ww” pair. In (c), an image of the digit string “90187” where ambiguity in the identity of the first three letters can be resolved after identifying the last two letters as digits.

Consider the notion of a mixed-case word recognition context where the size and position of a character, relative to other characters in the word, is taken into account during letter identification (see Fig. 42.38(b)). Such additional information would allow us to disambiguate between the lower case and upper case version of letters that otherwise are very much alike. Similarly, Fig. 42.38(b) illustrates that relative position within the word would enable us to correctly identify trailing punctuation marks such as periods and commas. A different kind of contextual information can be used to enforce some notion of “consistency” among the characters within a word. For instance, we could have a digit-string recognition context that favors word hypotheses where all the characters can be viewed as digits; in the image example of Fig. 42.38(c), the recognizer would thus rank string “90187” higher than string “gol87.” In addition to the modeless input enabled by a word-based input method, there is a writing throughput advantage over character-based ones. In Fig. 42.39 we show the results of a timing experiment where eight users where asked to transcribe a 42-word paragraph using our implementation of both kinds of input methods on a keyboardless PIA device. The paragraph was derived from a newspaper story and contained mixed-case words and a few digits, symbols, and punctuation marks. The length of the text was assumed to be representative of a long message that users might want to compose on such devices. For comparison purposes, users were also timed with a standard on-screen (software) keyboard. Each timing experiment was repeated three times. The median times were 158, 185, and 220 s for the keyboard, word-based, and character-based input methods, respectively. Thus, entering text with the word-based input method was, on average, faster than using the character-based method. Our current implementation of the word-based input method does not have, on average, a time advantage over the soft keyboard; however, the user who was fastest with the word-based input method (presumably someone for whom the recognition accuracy was very high and thus had few corrections to do) was able to complete the task in 141 s, which is below the median soft keyboard time.

© 2002 by CRC Press LLC

FIGURE 42.39 Boxplots of time to enter a 42-word message using three different text input methods on a PIA device: an on-screen QWERTY keyboard, a word-based handwriting recognizer and a character-based handwriting recognizer. Median writing throughput were 15.9, 13.6, and 11.4 words-per-minute respectively.

FIGURE 42.40 Write-anywhere text input method for mobile devices. Example of an Address Book application with the Company field appearing with focus. Hand written input is not restricted to a delimited area of the screen but rather can occur anywhere. The company name “Data Warehouse” has been written.

Furthermore, the authors believe that the time gap between these two input methods will be reduced in the case of European languages that have accents, since they require additional key presses in the case of the soft keyboard. As one can expect, the modeless and time advantage of a word-based input method over a character-based one comes at the expense of additional computational resources. Currently, the word-based recognition engine requires a 10 × increment in MIPS and memory resources compared to the character-based engine. One should also say that, as evidenced by the range of the timing data shown in the above plots, there isn’t a single input method that works best for every user. It is thus important to offer users a variety of input methods to experiment with and choose from. Write-Anywhere Interfaces In the same way that writing words, as opposed to writing one letter at a time, constitutes an improvement in terms of “naturalness” of the user experience, we must explore recognition systems capable of handling continuous handwriting such as phrases. For the kind of mobile devices we’ve been considering, with very limited screen real estate, this idea leads to the notion of a “write-anywhere” interface where the user is allowed to write anywhere on the screen, i.e., on top of any application and system element on the screen (see Fig. 42.40).

© 2002 by CRC Press LLC

FIGURE 42.41 Inherent ambiguity in continuous handwriting recognition. In (a), a sample image of a handwritten word. In (b), possible recognition results, strings not in the English lexicon are in italics.

A write-anywhere text input method is also appealing because there is no special inking area covering up part of the application in the foreground; however, a special mechanism is needed to distinguish pen movement events intended to manipulate user interface elements such as buttons, scrollbars, and menus (i.e., edit/control mode) and pen events corresponding to longhand (i.e., ink mode). The solution typically involves a “tap and hold” scheme wherein the pen has to be maintained down without dragging it for a certain amount of time in order to get the stylus to act temporarily as a mouse. An additional user interface issue with a write-anywhere text input paradigm is that there are usually no input method control elements visible anywhere on the screen. For instance, access to recognition alternates might require a special pen gesture. As such, a write-anywhere interface will generally have more appeal to advanced users. Furthermore, recognition in the write-anywhere case is more difficult because there is no implicit information on the word separation, orientation, or size of the text. Recognition-Aware Applications Earlier in this section, the authors discussed how factors such as segmentation ambiguity, letter co-articulation, and ligatures make exact recognition of continuous handwritten input a very difficult task. To illustrate this point consider the image shown in Fig. 42.41 and the set of plausible interpretations given for it. Can we choose with certainty one of these recognition results as the “correct” one? Clearly, additional information, not contained within the image, is required to make such a selection. One such source of information that we have already mentioned is the dictionary, or lexicon, for constraining the letter strings generated by the recognizer. At a higher-level, information from the surrounding words can be used to decide, for example, among a verb and a noun word possibility. It is safe to say that the more constraints that are explicitly available during the recognition process the more ambiguity in the input that can be automatically resolved. Less ambiguity results in higher recognition accuracy and thus, improved user experience. For many common applications in PIA devices, e.g., contacts, agenda, and web browser, it is possible to specify the words and patterns of words that can be entered in certain data fields. Examples of structured data fields are telephone numbers, zip codes, city names, dates, times, URLs, etc. In order for recognitionbased input methods to take advantage of this kind of contextual information, the existing text input framework on PIA devices needs to be modified. Currently, no differentiation is made between text input made by tapping the “keyboard” and that of using handwriting recognition; i.e., applications are, for the most part, not aware of the recognition process. A possible improvement over the current state of the art for UI implementation would be for applications to make an encoding of the contextual constraints associated with a given field available to the recognition engine. One typically uses a grammar to define the permitted strings in a language, e.g., the language of valid telephone numbers. A grammar consists of a set of rules or productions specifying the sequences of characters or lexical items forming allowable strings in the defined language. Two common classes of grammars are BNF grammar or context-free grammar and regular grammar (see [6] for a formal treatment). Grammars are widely used in the field of speech recognition and recently the W3C Voice

© 2002 by CRC Press LLC



0



9



(





)



-





FIGURE 42.42 Example of an XML grammar defining telephone numbers and written as per the W3C Voice Working Group Specification. There are four “private” rule definitions that are combined to make the main rule called phone-num.

Browser Working Group has suggested an XML-based syntax for representing BNF-like grammars [7]. In Fig. 42.42 we show a fragment of a possible grammar for defining telephone number strings. In the extended text input framework that we are advocating, this grammar, together with the handwritten ink, should be passed along to the recognition engine when an application knows that the user is expected to enter a telephone number. Information about how the ink was collected, such as resolution and sampling rate of the capture device, whether writing guidelines or other writing size hints were used, spatial relationships to nearby objects in the application interface, etc., should also be made available to the recognition engine for improved recognition accuracy.

Ink and the Internet Digital ink does not always need to be recognized in order for it to be useful. Two daily life applications where users take full advantage of the range of graphical representations that are possible with a pen are messaging, as when we leave someone a post-it note with a handwritten message, and annotation, as when we circle some text in a printed paragraph or make a mark in an image inside of a document. This subsection discusses Internet-related applications that will enable similar functionality. Both applications draw attention to the need for a standard representation of digital ink that is appropriate in terms of efficiency, robustness, and quality. Ink Messaging Two-way transmission of digital ink, possibly wireless, offers PIA users a compelling new way to communicate. Users can draw or write with a stylus on the PIA’s screen to compose a note in their own handwriting. Such an ink note can then be addressed and delivered to other PIA users, e-mail users, or fax machines. The recipient views the message as the sender composed it, including text in any mix of languages and drawings (see Fig. 42.43). In the context of mobile-data communications it is important for the size of such ink messages to be small. There are two distinct modes for coding digital ink: raster scanning and curve tracing [8,9]. Facsimile coding algorithms belong to the first mode, and exploit the correlations within consecutive scan lines. Chain Coding (CC), belonging to the second mode, represents the pen trajectory as a sequence of transitions between successive points in a regular lattice. It is known that curve tracing algorithms result in a higher coding efficiency if the total trace length is not too long. Furthermore, use of a rasterbase technique implies the loss of all time-dependent information.

© 2002 by CRC Press LLC

FIGURE 42.43 Example of ink messaging application for mobile devices. Users can draw or write with a stylus on the device screen to compose an e-mail in their own handwriting; no automatic recognition is necessarily involved.

Message sizes of about 500 bytes have been recently reported for messages composed in a typical PIA screen size, using a CC-based algorithm known as multi-ring differential chain coding (MRDCC) [10]. MRDCC is attractive for transmission of ink messages in terms of data syntax, decoding simplicity, and transmission error control; however, MRDCC is lossy, i.e., the original pen trajectory cannot be fully recovered. If exact reconstructability is important, a lossless compression technique is required. This might be the case when the message recipient might need to run verification or recognition algorithms on the received ink, e.g., if the ink in the message corresponds to a signature that is to be used for computer authentication. One example of a lossless curve tracing algorithm proposed by the ITU is zone coding [11]. Our internal evaluation of zone coding, however, reveals there is ample room for improvement. Additional requirements for an ink messaging application include support for embedded ASCII text, support for embedded basic shapes (such as rectangles, circles, and lines), and support for different pentrace attributes (such as color and thickness). Ink and SMIL SMIL, pronounced smile, stands for synchronized multimedia integration language. It is a W3C recommendation ([12]) defining an XML compliant language that allows a spatially and temporally synchronized description of multimedia presentations. In other words, it enables authors to choreograph multimedia presentations where audio, video, text, and graphics are combined in real-time. A SMIL document can also interact with a standard HTML page. SMIL documents might become very common on the web thanks to streaming technologies. The basic elements in a SMIL presentation are (for a complete introduction see [13]): a root-layout, which defines things like the size and color of the background of the document; a region, which defines where and how a media element such as an image can be rendered, e.g., location, size, overlay order, scaling method; one or more media elements such as text, img, audio, and video; means for specifying a timeline of events, i.e., seq and par indicate a block of media elements that will all be shown sequentially or in parallel, respectively, dur gives an explicit duration, begin delays the start of an element relative to when the document began or the end of other elements; means for skipping some part of an audio or a video (clip-begin and clip-end); means for adapting the behavior of the presentation to the end-user system capabilities (switch); means for freezing a media element after its end (fill); and a mean for hyperlinking (a). Digital ink is not currently supported as a SMIL native media type. One option would be to convert the ink into a static image, say in GIF format, and render it as an img element; however, this would preclude the possibility of displaying the ink as a continuous media (like an animation). Another option

© 2002 by CRC Press LLC









(b)

FIGURE 42.44 Example of the role of digital ink in SMIL documents. In (a), a diagram or photo taken with a digital camera can be annotated with a pen; the digital ink can be coordinated with a spoken commentary. In (b), a corresponding short SMIL document fragment assuming the existence of an appropriate MIME content-type called “ink” and a subtype called “unipen” for representing the ink.

is using the SMIL generic media reference ref (see Fig. 42.44); this option requires the existence of an appropriate MIME content-type/subtype. In the near future, it is expected that a standard will be designed to allow SMIL documents to use animated or static digital ink content as a media component.

Extension of Pen-and-Paper Metaphor Use of the pen-and-paper paradigm dates back to almost 3000 BC. Paper, as we know of today, dates back to around 200 AD. Hence, the notion of writing with a pen on paper is a extremely natural way of entering handwritten information. Archival and retrieval are two primary actions performed on handwritten information captured using traditional pen and paper. The problem, however, with regular pen and paper is that the process of retrieving information can be extremely inefficient. Retrieving information typically involves visually scanning the documents, which can be inefficient when the size of handwritten information becomes large. One way to make the process efficient is to tag the information in a useful way. For example, a yellow sticker on pages that relate to a certain topic, or entering information about different topics into different notebooks, can make the process of looking for information on a topic efficient, when using normal paper notebooks. The goal here is to extend the same functionality to electronic pen-and-paper systems. Extending the pen-and-paper metaphor, one of the main applications for digital ink capture systems, aims to provide users with efficient ink archival/retrieval capabilities by providing users the tools to tag information captured on the devices in a useful way. The need for efficient ink archival/retrieval is accentuated by devices like the IBM ThinkPad TransNote [27] and Anoto [30], which provide users the capability of using normal or special paper for capturing handwritten information. With paper, users of such devices tend to capture more handwritten information, which in turn increases the need for efficient ink archival/retrieval capabilities. Ink Archival and Retrieval An example of a digital ink capture system that provides users the ability to efficiently archive/retrieve handwritten information is the ThinkPad TransNote system from IBM. The system combines a regular notebook PC with a digital notepad. Slider controls provided on the digital notepad allow users to assign handwritten notes to a particular page and to a specific topic. In addition, controls are provided on the digital notepad to mark blocks of handwritten ink as a keyword. Functions of the sliders and controls can be modified depending on the needs of the application.

© 2002 by CRC Press LLC

FIGURE 42.45 Ink retrieval using keywords. Example of an application that uses the ASCII tags associated with handwritten ink to retrieve information from handwritten documents.

FIGURE 42.46 of documents.

Ink searching example. Users can search for an ink pattern inside a longer ink document, or collection

Ink management software on the notebook PC allows users to archive handwritten notes and retrieve them, using either the time of creation of the handwritten notes or the tags associated with keywords. The tags are typically text strings created using a handwriting recognition system. Figure 42.45 shows an example of a piece of the ink management software that displays blocks of ink marked as keywords in the middle column and their tags in the left column. Users can retrieve handwritten documents by clicking on the keywords or typing a word in the search text box in the upper righthand-top corner of the application. In the application shown in Fig. 42.45, all the tags are text strings; however, one can easily extend the retrieval paradigm to use graphical queries and retrieve documents containing graphics, using features extracted from the graphical query. An example of this is shown in Fig. 42.46. Gesture Recognition A gesture is a set of handwritten ink that implies a certain action. In many cases, a gesture can be used to represent an action much more efficiently compared to enumerating the action through a set of keyboard events. An example is the task of moving a portion of text from one position to another. Using a keyboard would involve selecting the portion of ink to be moved, copying it into a clipboard, deleting the selection, moving the cursor to the place in the document where the user would like to place the ink, and finally pasting the ink. Using a pen would allow users to indicate the same action by drawing a selection area around the ink to be moved and an arrow indicating the position to move the selection to. An example of this is shown in Fig. 42.47.

© 2002 by CRC Press LLC

FIGURE 42.47 Example of pen-and-paper-like editing. Users can perform erasing by scribbling directly on the unwanted text, moving text by circling and dragging, and transposing text by common gesturing.

FIGURE 42.48 Segmentation and recognition of on-line documents. Example of a typical handwritten page with text, tables and drawings; and the desired segmentation interpretation.

Smart Ink One can extend the gesture recognition system to allow users to associate more complex actions with groups of pen strokes as shown in Fig. 42.48. The handwritten document on the left side is a typical handwritten page with text, tables and drawings, and the one of the right side is a version of the same document after being automatically interpreted by a smart ink recognition scheme. This association allows users to work with handwritten documents in more efficient ways, which turn an electronic pen into a more effective way of entering information than a keyboard and mouse. A successful implementation of these ideas led to the development of a tool, called SILK [35], which allows graphic designers to quickly sketch a user-interface with a electronic pen and stylus. The tool addresses the needs of designers who prefer to sketch early interface ideas on paper or whiteboard and concentrate on behavior. Existing interactive user interface construction tools make it

© 2002 by CRC Press LLC

hard for a user-interface designer to illustrate the behavior of an interface; these tools focus on specifying widgets and making it easy to manipulate details such as colors, alignment, and fonts, i.e., they can show what the interface will look like, but make it hard to show what it will do.

Pen Input and Multimodal Systems A multimodal interface is one that integrates multiple kinds of input simultaneously to achieve a desired result. With the increased availability of significant computing power on mobile devices and of efficient wireless connectivity enabling distributed systems, development of pen-based multimodal interfaces is becoming more and more feasible. The motivation is simple: create more robust, flexible, and userfriendly interfaces by integrating the pen with other input modalities such as speech. Higher robustness is achievable because cross-modal redundancy can be used to compensate for imperfect recognition on each individual mode. Higher flexibility is possible because users can choose from among various modes of achieving a task, or issuing commands, the mode that is most appropriate at the time. Higher userfriendliness will result from having computer interfaces that better resemble the multi-modality naturally present in human communication. In this section we review some successful multimodal systems that take advantage of the pen to produce very synergistic interfaces, highlighting their common features. Cohen et al. [15] combined speech and pen gestures to interact with a 2D representation, like a map, of the entities in a 3D scene such as the one generated with a battlefield simulator. An interactive map is displayed on a handheld device where the user can draw or speak to control the system. For example, while holding the pen at some location in the map, the user may say “XXXX platoon”; this command will result in the creation of a platoon simulation element labelled “XXXX” at the desired location. The user can then assign a task to the new platoon by uttering a command like “XXXX platoon follow this route” while drawing a line on the map. Heyer et al. [16] combined speech, pen gestures, and handwriting recognition in a travel planning application. Users interact with the map of a city, possibly displayed on a PIA device, to find out information about hotels, restaurants, and tourist sites. This information is accessed from a public database through the Internet. Pen and voice may be used by speaking a query such as “what is the distance from here to Fisherman’s Wharf ” while making a mark on the map. Pen-only gestures can also be used for control actions, such as moving the viewing area. Similarly, voice-only commands are allowed as in “show me all hotels with a pool.” Tue Vo et al. [17] prototyped a multimodal calendar application called Jeanie. This is a very common application on PIA devices and one having several tasks that can be simplified by the multimodal method of pointing to or circling objects on the screen in addition to speaking commands. For example, a command combining spoken and handwritten input is “reschedule this on Tuesday,” uttered while holding the pen on a meeting entry in the appointment list. An example of a pen-only command is drawing an X on a meeting entry to cancel it. Suhm et al. [18] have explored the benefits of multimodal interaction in the context of error correction. Specifically, they have integrated handwriting recognition in an automatic dictation system. Users can switch from continuous speech to pen-based input to correct errors. This work capitalizes on the fact that words that might be confused in one modality (e.g., sound similar) are not necessarily so in another one (e.g., with similar handwritten shape). Their study concluded that multimodal error correction is more accurate and faster than unimodal correction by re-speaking. Multimodal applications such as these ones are generally built using a distributed “agent” framework. The speech recognizer, the handwriting recognizer, the gesture recognizer, the natural language understanding module, the database access module, etc., might each be a different agent; a computing process that provides a specific service and which runs either locally on the PIA device or remotely. These agents cooperate and communicate in order to accomplish tasks for the user. One publicly available software environment offering facilitated agent communication is the open agent architecture (OAA) from SRI [19]. A special agent is needed for integrating information from all input sources to arrive at a correct understanding of complete multimodal commands. Such a unification agent is sometimes implemented using semantic frames, a knowledge representation scheme from the early A.I. days [36], consisting of

© 2002 by CRC Press LLC

slots specifying pieces of information about the command. Recognition results from each modality agent are parsed into partially filled frames, which are then merged together to produce a combined interpretation. In the merging process information from different input modes is weighted, meaningless command hypotheses are filtered out, and additional feedback from the user might be requested.

Summary As more electronic devices with pen interfaces have and continue to become available for entering and manipulating information, applications need to be more effective at leveraging this method of input. Pen is a mode of input that is very familiar for most users since everyone learns to write in school. Hence, users will tend to use this as a mode of input and control when available. Providing enhanced userinterfaces that will make it easier for users to use the pen interface in effective ways will make it easier for them to work with such devices. Section 42.6 has given an overview of the pen input devices available today along with some of the applications that use the electronic pen either in isolation or in conjunction with other modes of input such as speech and the keyboard. The community has made great strides in addressing a number of the userinterface issues for capturing and manipulating information from electronic pens. A number of challenges still need to be addressed before such devices truly meet the needs of a user to a higher level of satisfaction.

To Probe Further Pen Computing. http://hwr.nici.kun.nl/pen-computing. A Web site hosted at the Nijmegen University with links related to practical issues in pen and mobile computing. Handhelds. http://handhelds.org. A Compaq-hosted Web site created to encourage and facilitate the creation of open source software for use on handheld and wearable computers.

Acknowledgment The authors thank Thomas G Zimmerman, Research Staff Member with the Human/Machine Interface Gadgets at IBM Research, for his input on the subsection on “Pen Input Hardware,” and Carlos McEvilly, Research Staff Member with the Motorola Human Interface Labs, for proofreading this manuscript.

References 1. G. Seni, T. Anastasakos. Non-cumulative character scoring in a forward search for online handwriting recognition. In IEEE Conf. on Acoustics, Speech and Signal Processing, Istanbul, 2000. 2. K.S. Nathan, H.S.M. Beigi, J. Subrahmonia, G.J. Clary, M. Maruyama. Real-time on-line unconstrained handwriting recognition using statistical methods. In IEEE Conf. on Acoustics, Speech and Signal Processing, Michigan, 1995. 3. S. Jaeger, S. Manke, A. Waibel. NPEN++: An online handwriting recognition system. In Proc. Workshop on Frontiers in Handwriting Recognition, Amsterdam, The Netherlands, 2000. 4. www.mimio.com. 5. www.e-pen.com. 6. H.R. Lewis, C.H. Papadimitriou. Elements of the Theory of Computation. Prentice-Hall, Englewood Cliffs, NJ, 1981. 7. Speech Recognition Grammar Specification for the W3C Speech Interface Framework. W3C Working Draft. www.w3.org/TR/grammar-spec. 2001. 8. H. Freeman. Computer processing of line-drawing data. Computer Surveys. March 1974. 9. T.S. Huang. Coding of two-tone images. IEEE Trans. COM-25. November 1977. 10. J. Andrieux, G. Seni. On the coding efficiency of multi-ring and single-ring differential chain coding for telewriting application. To appear in IEE Proceedings—Vision, Image and Signal Processing. 11. ITU-T Recommendation T.150. Terminal Equipment and Protocols for Telematic Services. 1993. 12. Synchronized Multimedia Integration Language (SMIL) 1.0 specification. W3C Recommendation. www.w3c.org/TR/REC-smil. 1998.

© 2002 by CRC Press LLC

13. L. Hardman. A Smil Tutorial. www.cwi.nl/⬃media/SMIL/Tutorial. 1998. 14. R. Plamondon, S.N. Srihari. On-line and off-line handwriting recognition: a comprenhensive survey. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(1), 2000. 15. P.R. Cohen, D. McGee, S.L. Oviatt, L. Wu, J. Clow, R. King, S. Julier, L. Rosenblum. Multimodal interactions for 2D and 3D environments. IEEE Computer Graphics and Applications. July/August 1999. 16. A. Heyer, L. Julia. Multimodal Maps: An agent-based approach. In Multimodal Human-Computer Communication, Lecture Notes in Artificial Intelligence. 1374. Bunt/Beun/Borghuis Eds. Springer 1998. 17. M. Tue Vo, C. Wood. Building an application framework for speech and pen input integration in multimodal learning interfaces. In IEEE Conf. on Acoustics, Speech and Signal Processing (ICASSP), 1996. 18. B. Suhm, B. Myers, A. Waibel. Model-based and empirical evaluation of multimodal interactive error correction. In Proc. of the CHI 99 Conference, Pittsburgh, PA, May 1999. 19. www.ai.sri.com/~oaa. 20. C.C. Tappert. Adaptive on-line handwriting recognition. In 7th Intl. Conf. on Pattern Recognition, Montreal, Canada, 1984. 21. C.C. Tappert, C.Y. Suen, T. Wakahara. The state of the art in on-line handwriting recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12, 1990. 22. www.wacom.com/productinfo. 23. www.mutoh.com. 24. T. Zimmerman and F. Hoffmann, IBM Research, patent pending, 1995. 25. J. Subrahmonia, T. Zimmerman. Pen computing: challenges and applications. In IEEE Conf. on Pattern Recognition, Barcelona, Spain, 2000. 26. N. Yamaguchi, H. Ishikawa, Y. Iwamoto, A. Iida. Ultrasonic coordinate input apparatus. U.S. Patent 5,637,839, June 10, 1997. 27. www.ibm.com. 28. O. Kinrot and U. Kinrot. Interferometry: encoder measures motion through interferometry. In Laser Focus Worlds, March 2000. 29. www.goulite.com. 30. www.anoto.com. 31. M. Lazzouni, A. Kazeroonian, D. Gholizadeh, O. Ali. Pen and paper information recording system. US Patent 5,652,412. July, 1997. 32. S. Nabeshima, S. Yamamoto, K. Agusa, T. Taguchi. Memo-pen: A new input device. In Proc. of the CHI 95 Conference, 1995. 33. www.cross-pcg.com. 34. www.erc.caltech.edu/research/reports/munich1full.html. 35. M.W. Newman, J. Landay. Sitemaps, storyboards, and specifications: a sketch of Web site design practice. In Designing Interactive Systems DIS, NY, August 2000. 36. E. Charniak, D. McDermott. Introduction to Artificial Intelligence. Addison-Wesley, Reading, MA, 1987. 37. R. Plamandon, D. Lopresti, L. Schomaker, R. Srihari, Online Handwriting Recognition, Wiley. Encyclopedia of Electrical and Electronics Engineering. J.G. Webster Eds. John Wiley & Sons, 1999.

42.7 What Makes a Programmable DSP Processor Special? Ingrid Verbauwhede Introduction A programmable DSP processor is a processor “tuned” towards its application domain. Its architecture is very different from a general-purpose von Neumann architecture to accommodate the demands of real-time signal processing. When first developed in the beginning of the 1980s, the main application was filtering. Since then, the architectures have evolved together with the applications. Currently, the

© 2002 by CRC Press LLC

most stringent demands for low-power embedded DSP processors come from wireless communication applications: second, 2.5, and third generation (2G, 2.5G, and 3G) cellular standards. The demand for higher throughput and higher quality of source and channel coding keeps growing while power consumption has to be kept as low as possible to increase the lifetime of the batteries. In this chapter section, first the application domain and its historical evolution will be described in the subsection on “DSP Application Domain.” Then, in “DSP Architecture,” the overall architecture will be described. In “DSP Data Paths,” the specifics of the DSP data paths will be given. In “DSP Memory and Address Calculation Units,” the memory architecture and its associated address generation units are described. In “DSP Pipeline,” the specifics of the DSP pipeline will be explained. Finally, in “Conclusions and Future Trends,” the conclusions will be given followed by some future trends.

DSP Application Domain DSP processors were originally developed to implement traditional signal processing functions, mainly filters, such as FIRs and IIRs [5]. These applications decided the main properties of the programmable DSP architecture: the inclusion of a multiply-accumulate unit (MAC) as separate data path unit and a Harvard or modified Harvard memory architecture instead of a von Neumann architecture. Original Motivation: FIR Filtering The fundamental properties of these applications were (and still are): • Throughput driven calculations and real-time operation. Signal processing applications, such as speech and video, can be represented as an “infinite stream” of data samples that need to be processed at a rate determined by the application [20]. The sample rate is a fundamental property of the application. It determines at what rate the consecutive samples arrive for processing. For speech processing, this is the rate of speech samples (kHz range), for video processing this might be the frame rate or the pixel rate (MHz range) [3]. The DSP has to process these samples at this given rate. Therefore, a DSP operates under worst-case conditions. This is fundamentally different from general-purpose processing on a micro processor, which operates on an average case base, but which has an unpredictable worst-case behavior. • Large amounts of computations, few amounts of control flow operations. DSP processors were developed to process large amounts of data in a very repetitive mode. For instance, speech filtering, speech coding, pixel processing, etc., require similar operations on consecutive samples, pixels, frames, etc. The DSP processor has adapted to this, by providing means of implementing these algorithms in a very efficient way. It includes zero-overhead looping, very compact instruction code, efficient parallel memory banks, and associated address generation units. • Large amount of data, usually organized in a regular or almost regular pattern. Because of the realtime processing and the associated “infinite” amount of data that is processed, DSP processors usually have several parallel memory banks; each bank has its own address generation unit and parallel reads and writes are supported by the DSP. • Embedded applications. DSP processors are developed for embedded applications, ranging from cellular phones, disk drives, cable modems, etc. The result is that all the program codes have to reside on the processor (no external memory, no cache hierarchy). Thus, the code size has to be minimized, as a result of which, till today there is a lot of assembly code written. Secondly, the power consumption has to be minimized since many of these applications run from batteries or have tight cooling requirements such as the usage of cheap plastic packages or enclosed boxes. Modern Applications: Digital Wireless Communications New applications drive the design of new DSP processors. State-of-the-art DSP processors will have more than one MAC, acceleration for Viterbi decoding, specialized instructions for Turbo decoding, and so on. Indeed, DSP processors have become the main workhorse for wireless communications for both the handsets and the base station infrastructure [22].

© 2002 by CRC Press LLC

Receiver

Demodulator

Synthesizer Transmitter

Point A Speech Speaker CODEC Microphone

Channel CODEC Modulator

Point B

FIGURE 42.49

Fundamental building blocks in a communication system. 1 TDMA Frame (40 ms) 8 bits (µ-Law)

Point A

125 µs (8 kHz) -> 2,560 bits/40 ms -> 64 kbps(8 bits* 8 kHz) Speech: 138 bits/40 ms -> 3.45 kbps FEC: 86 bits/40 ms -> 2.15 kbps Total: 5.6 kbps

Point B 42 kbps

User 6

User 1

User 2

User 3

User 4

User 5

User 6

User 1

1 slot (~6.7 ms, 7 kbps)

FIGURE 42.50

Relationship between speech signal and the transmitted signal.

Second generation (2G) cellular standards required the introduction of optimized instructions for speech processing and for communication algorithms used in the channel coding and modulation/demodulation. The fundamental components of a wireless system are shown on Fig. 42.49. Speech Coding The source coder/decoder in 2G cellular standards (GSM, IS-136, IS-95 CDMA, Japanese PDC) is mainly a speech coder/decoder. The main function of a speech coder is to remove the redundancy and compress the speech signal and hence, reduce the bandwidth requirements for storage or transmission over the air. The required reduction in bit rate is illustrated in Fig. 42.50 for the Japanese PDC standard. At point A, a “toll quality” digital speech signal requires the sampling of the analog speech waveform at 8 kHz. Each sample requires 8 bits of storage (µ-law compressed) thus resulting in a bit rate of 64 kbits/s or 2560 bits for one 40 ms TDMA frame. This speech signal needs to be compressed to increase the capacity of the channel. One TDMA frame, which has a basic time period of 40 ms, is shared by six users. The bit rate at point B is 42 kbits/s. Thus, one user slot gets only 7 kbits/s. The 2560 bits have to be reduced to 138 bits, to which 86 bits are added for forward error correction (FEC), resulting in a total of 5.6 kbits/s. The higher the compression ratio and the higher the quality of the speech coder, the more calculations, usually expressed in MIPS, are required. This is illustrated in Fig. 42.51. The first generation GSM digital cellular standard employs the Regular Pulse Excitation-Long Term Prediction (RPE-LTP) algorithm and requires a few thousand MIPS to implement it on a current generation DSP processor. For instance, it requires 2000 MIPS on the lode processor [21]. The Japanese half-rate coder Pitch Synchronous InnovationCode Excited Linear Prediction (PSI-CELP) requires at least ten times more MIPS. Viterbi Decoding The function of the channel codec is to add controlled redundancy to the bit stream on the encoder side and to decode, detect, and correct transmission errors on the receiver side. Thus, channel encoding and decoding is a form of error control coding. The most common decoding method for convolutional codes

© 2002 by CRC Press LLC

Number of Operations

[MIPS] 20 15 10

PSI-CELP 13 kbps VSELP 11.2 kbps VSELP

5

RPE-LTP

1 0

FIGURE 42.51 (a)

FIGURE 42.52

10

20

[kbps]

Bit rate

MIPS requirement of several speech coders. 000

11

000 11

000

000

001

001

001

001

010

010

010

011

011

011

011

100

100

100

100

101

101

101

101

110

110

110

110

111

111

111

111

01

010

(b)

i

+a -a -a

2i 2i+1

+a i+s/2

Viterbi trellis diagram and one butterfly.

is the Viterbi algorithm [4]. It is a dynamic programming technique as it tries to emulate the encoder’s behavior in creating the transmitted bit sequence. By comparing to the received bit sequence, the algorithm determines the difference between each possible path through the encoder and the received bit sequence. The decoder outputs the bit sequence that has the smallest deviation, called the minimum distance, compared to the received bit sequence. Most practical convolutional encoders are rate 1/n, meaning that one input bit generates n coded output bits. A convolutional encoder of constraint length K can be represented by a finite state machine K−1 (FSM) with K − 1 memory bits. The FSM has 2 possible states, also called trellis states. If the input is binary, two next states are possible starting from the current state and the input bit. The task of the Viterbi algorithm is to reconstruct the most likely sequence of state transitions based on the received bit sequence. This approach is called the “most likelihood sequence estimation.” These state transistions are represented by a trellis diagram. The kernel of the trellis diagram is the Viterbi butterfly as shown in Fig. 42.52(b). Next Generation Applications Current generation DSP processors are shaped by 2G cellular standards, the main purpose of which is voice communication. 3G cellular standards will introduce new features: increased focus on data communication, e-mail, web browsing, banking, navigation, and so on. 2G standards can support short messages, such as the popular SMS messages in the GSM standard, but are limited to about 10 to 15 kbits/s. In the 2.5G cellular standards, provisions are made to support higher data rates. By combining GSM time slots, generalized packet radio services (GPRS) can support up to 115 kbits/s. But the 3G standards are being developed specifically for data services. Wideband CDMA (WCDMA) will support up to 2 Mbits/s in office environments, lowered to 144 kbits/s for high mobile situations [6,13]. The increased focus on data services has large consequences for the channel codec design. The traditional Viterbi decoder does not provide a low enough bit error rate to support data services. Therefore, turbo

© 2002 by CRC Press LLC

SoftDecision

DeInterleaver

SoftDecision

Information Bits

1st Code MUX

Decoder 1 Parity

Encoded Output

Constituent Decision Interleaver

Constituent

Decoder 1

Decoder 2

Info Bi ts

Constituent Bits

Interleaver

Soft-

Parity Bits

Parity Constituent Bits

Interleaver

Decoder 2

DeInterleaver Decoded

Parity Bits 2nd Code

FIGURE 42.53

Output

Turbo encoder and decoder.

codes are considered [2]. Turbo decoding (shown in Fig. 42.53) is a collaborative structure of softinput/soft-output (SISO) decoders with the inclusion of interleaver memories between decoders to scatter burst errors [2]. Either soft-output viterbi algorithm (SOVA) [7] or maximum a posteriori (MAP) [1] can be used as SISO decoders. Within a turbo decoder, the two decoders can operate on the same or different codes. Turbo codes have been shown to provide coding performance to within 0.7 dB of the Shannon limit (after a number of iterations). The log MAP algorithm can be implemented in a manner very similar to the standard Viterbi algorithm. The most important difference between the algorithms is the use of a correction factor on the “new path metric” value (the alpha, beta, and log-likelihood ratio values in Log MAP). This correction factor depends on the difference between the values being compared in the add-compare-select butterfly (as shown in Fig. 42.52). This means that the Viterbi acceleration units, that implement this add-compareselect operation, need to be modified. Turbo coding is one member of a large class of iterative decoding algorithms. Recently low density parity check codes (LDPC) that have gained renewed attention as another important class, which are potentially more easily translated to efficient implementations. Other trends seem to place an even larger burden on the DSP processor. The Japanese i-Mode system includes e-mail, web browsing, banking, location finding in combination with the car navigation system, etc. Next generation phones will need to support video and image processing, and so on. Applications and upgrades will be downloadable from the Internet. But at the same time, consumers are used to longer talk times (a couple of hours) and very long standby times (days or weeks). Thus, they will not accept a reduction of talk time nor standby time in exchange for more features. This means that these increased services have to be delivered with the same power budget because the battery size is not expected to grow nor is the battery technology expected to improve substantially.

DSP Architecture The fundamental property of a DSP processor is that it uses a Harvard or modified Harvard architecture instead of a von Neumann architecture. This difference is illustrated in Fig. 42.54. A von Neumann architecture has one unified address space, i.e., data and program, are assigned to the same memory space. In a Harvard architecture, the data memory map is split from the program memory map. This means that the address busses and data busses are doubled. Together with specialized address calculation units, this will increase the memory bandwidth available for signal processing applications. This concept will be illustrated by the implementation of a simple FIR filter. The basic equation for an N tap FIR equation is the following: i=N−1

y(n) =

∑ c(i) ⋅ x(n – i) i=0

© 2002 by CRC Press LLC

Instruction Processing Unit

16 x 16 mpy

16 x 16 mpy

Instruction Processing Unit

ALU

ALU

Address Bus

Address Bus

Data Bus

Data Bus

Data Bus 2

Program Memory

Memory

FIGURE 42.54

Address Bus 2

Data Memory

von Neumann architecture and Harvard/modified Harvard architecture. x(n-1)

x(n) Z-1

Z -1

Z-1

x(n-(N-1)) (50 TAPS)

c(0)

FIGURE 42.55

X

X

X

+

+

c(N-1)

X

+

y(n)

Finite impulse response filter.

Expansion of this equation results in the following pseudo code statements: y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + … + c(N - 1)x(1 - N); y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + … + c(N - 1)x(2 - N); y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + … + c(N - 1)x(3 - N);

… y(n) = c(0)x(n) + c(1)x(n - 1) + c(2)x(n - 2) + … + c(N - 1)x(n - (N - 1)); When this equation is executed in software or assembly code, output samples y(n) are computed in sequence. To implement this on a von Neumann architecture, the following operations are needed. Assume that the von Neumann has a multiply and accumulate instruction (not necessarily the case). Assume also that pipelining allows to execute the multiply and accumulate in parallel with the read or write operations. Then one tap needs four cycles: 1. 2. 3. 4.

Read multiply-accumulate instruction. Read data value from memory. Read coefficient from memory. Write data value to the next location in the delay line (because to start the computation of the next output sample, all values are shifted by one location).

Thus even if the von Neumann architecture includes a single cycle multiply-accumulate unit, it will take four cycles to compute one tap. Implementing the same FIR filter on a Harvard architecture will reduce the number of cycles to three because it allows the fetch of the instruction in parallel with the fetch of one of the data items. This was

© 2002 by CRC Press LLC

a fundamental property that distinguished the early DSP processors. On the TMS 320C1x, released in the early ’80s, it took 2N cycles for a N tap filter (without the shift of the delay line) [5]. The modified Harvard architecture improves this idea even further. It is combined with a “repeat” instruction and a specialized addressing mode, the circular addressing mode. In this case, one multiplyaccumulate instruction is fetched from program memory and kept in the one instruction deep instruction “cache.” Then the data access cycles are performed in parallel: the coefficient is fetched from the program memory in parallel with the data sample being fetched from data memory. This architecture is found in all early DSP processors and is the foundation for all following DSP architectures. The number of memory accesses for one tap are reduced to two and these occur in the same cycle. Thus, one tap can execute in one cycle and the multiply-accumulate unit is kept occupied every cycle. Newer generation of DSP processors have even more memory banks, accompanying address generation units and control hardware, such as the repeat instruction, to support multiple parallel accesses. The execution of a 32-tap FIR filter on the dual Mac architecture of the Lucent DSP 1621, shown in Fig. 42.56, will take only 19 cycles. The corresponding pseudo code is the following: do 14 { //one instruction ! a0=a0+p0+p1 p0=xh*yh p1=xl*yl y=*r0++ x=*pt0++ } This code can be executed in 19 clock cycles with only 38 bytes of instruction code. The inner loop takes one cycle to execute and as can be seen from the assembly code, seven operations are executed in parallel: one addition, two multiplications, two memory reads, and two address pointer updates. Note that the second pointer update, *pt0++, updates a circular address pointer. Two architectures which speed up the FIR calculation to 0.5 cycle per tap are shown in Fig. 42.56. The first one is the above mentioned Lucent DSP16210. The second one is an architecture presented in [9]. It has a multiply accumulate unit that operates at double the frequency from the memory accesses. The difficult part in the implementation of this tight loop is the arrangement of the data samples in memory. To supply the parallel Mac data paths, two 32-bit data items are read from memory and stored in the X and Y register, as shown in Fig. 42.56. A similar split in lower and higher halfs occurs in the POINTER Y (PY)

POINTER X (PX)

XDB(32) I D B (32 ) Y(32) Y(32)

1 MACHINE CYCLE

X(32) X(32)

16 x 16 mpy

16 x 16 mpy

p0 (32)

p1 (32)

Shift/Sat.

MEMORY Y MEMORY X (MY) (MX) 16-bit 16-bit 16-bit 16-bit ODD ODD EVEN EVEN SIDE SIDE SIDE SIDE TEMP REG TEMP REG 16-bit

16-bit 16-bit

Shift/Sat.

A-BUS B-BUS

16-bit

MULTIPLIER AL U ALU

ADD

BM U BMU

1/2 MACHINE CYCLE

32-bit PIPELINE REG

ADDER ACC File 8 x 40

1/2 MACHINE CYCLE

40-bit ACC MAC UNIT BARREL SHIFTER

(a) Lucent DSP16210 architecture

(b) MAC at double frequency [14]

FIGURE 42.56

DSP architectures for 0.5 cycle per FIR tap.

© 2002 by CRC Press LLC

c (i )

DB1(16) DB0(16) x (n -i +1) LREG x (n -i i )

X

X

+

+

MAC1

MAC0 A0

FIGURE 42.57

c (i )

y (n +1)

y (n)

A1

Dual Mac architecture with delay register of the Lode DSP core.

Intel/ADI Frio core [10]. Then the data items are split in an upper half and a low half and supplied to the two 16 × 16 multipliers in parallel or the left half and the right half of the TEMP registers in Fig. 42.56(b). It requires a correct alignment of the data samples in memory, which is usually a tedious work done by the programmer, since compilers are not able to handle this efficiently. Note that a similar problem exists when executing SIMD instructions on general purpose micro-processors. Memory accesses are a major energy drain. By rearranging the operations to compute the filter outputs, the amount of memory accesses can be reduced. Instead of working on one output sample at a time, two or more output samples are computed in parallel. This is illustrated in the pseudo code below: y(0) = c(0)x(0) + c(1)x(-1) + c(2)x(-2) + … + c(N-1)x(1-N); y(1) = c(0)x(1) + c(1)x(0) + c(2)x(-1) + … + c(N-1)x(2-N); y(2) = c(0)x(2) + c(1)x(1) + c(2)x(0) + … + c(N-1)x(3-N);

… y(n) = c(0)x(n) + c(1)x(n-1) + c(2)x(n-2) + … + c(N-1)x(n-(N-1)); In the lode architecture [21] a delay register is introduced between the two Mac units as shown in Fig. 42.57. This halves the amount of memory accesses. Two output samples are calculated in parallel as indicated in the pseudo code of Table 42.3. One data bus will read the coefficients, c(i), the other data bus will read the data samples, x(N − i), from memory. The first Mac will compute a multiply-accumulate for output sample y(n). The second multiply-accumulate will compute in parallel on y(n + 1). It will use a delayed value of the input sample. In this way, two output samples are computed at the same time. This concept of inserting a delay register can be generalized. When the datapath has P Mac units, P − 1 delay registers can be inserted and only 2N/(P + 1) memory accesses are needed for one output sample. These delay registers are pipeline registers and hence if more delay registers are used, more initialization and termination cycles need to be introduced. The idea of working on two output samples at one time is also present in the dual Mac processor of TI, the TIC55x. This processor has a dual Mac architecture with three 16-bit data busses. To supply both Macs with coefficient and data samples, the same principle of computing two output samples at the same time is used. One data bus will carry the coefficient and supply this to both Macs, the other two data busses will carry two different data samples and supply this to the two different Macs. A summary of the different approaches is given in Table 42.2. Note that most energy savings are obtained from reducing the amount of memory accesses and secondly, from reducing the number of instruction cycles. Indeed the energy associated with the Mac operations can be considered as “fundamental” energy without it, no N tap FIR filter can be implemented. Modern processors have multiple address busses, multiple data busses and multiple memory banks, including both single and dual port memory. They also include mechanisms to assign parts of the physical memory to either memory space, program, or data. For instance for the C542 processor the on-chip dual

© 2002 by CRC Press LLC

TABLE 42.2 Data Accesses, Mac Operations, Instruction Cycles, and Instructions for an N Tap FIR Filter Data Memory Accesses

MAC Operations

Instruction Cycles

3N 3N 2N

N N N

4N 3N N

2N 1.5N N 2N/(P + 1)

N N N N

N/2 N/2 N/2 N/(P + 1)

DSP von Neumann Harvard Modified Harvard with modulo arithmetic Dual Mac or double frequency Mac Dual Mac with 3 data busses Dual Mac with 1 delay registers Dual Mac with P delay registers

Instructions 2N 3N 2 (repeat instruction) 2 (same) 2 2 2

BusA 16

BusB

16

Multiplier 32 (Product Register) 32

40 Add/Subtract 40 40 Accumulator Reg.

16 Bus

FIGURE 42.58

Multiply accumulate unit.

access RAM can be assigned to the data space or to the data/program space, by setting a specific control bit (the OVLY bit) in a specific control register (the PMST register) [19].

DSP Data Paths The focus of the previous section was on the overall architecture of a DSP processor and its fundamental properties to increase the memory bandwidth. This will keep the data paths of the DSP operating every clock cycle. In this section, some essential properties of the DSP data paths will be described. Multiply-Accumulate Unit The unit that is most associated with the DSP is the Mac. It is shown in Fig. 42.58. The most important properties of the Mac unit are summarized below: • The multiplier takes two 16-bit inputs and produces a 32-bit multiplier output. Internally the multiplication might be implemented as a 17 × 17 bit multiplier. This way the multiplier can implement both two’s complement and unsigned numbers.

© 2002 by CRC Press LLC

(a)

DB1(16)

MAC1

MAC0 +

+

A0

A1

+

Min()

ALU Comp

decision bit

TRN reg A3

ALU

Accumulator

to memory

FIGURE 42.59

TREG +

decision bit

ALU

DB1(16) DB0(16)

(b)

DB0(16)

A2

MSW/LSW Select

to memory

Two data path variations to implement the add-compare-select operation.

• The product register is a pipelined register to speed up the calculation of the multiply-accumulate operation. As a result the Mac operation executes in most processors in one cycle effectively although the latency can be two cycles. • The accumulator registers are usually 40 bits long. Eight bits are designated as “guard” bits [11]. 8 This allows the accumulation of 2 products before there is a need of scaling, truncation, or saturation. These larger word lengths are very effective in implementing DSP functions such as filters. The disadvantage is that special registers such as these accumulators are very hard to handle by a compiler. Viterbi Acceleration Unit Convolutional decoding and more specifically the Viterbi algorithm, has been recognized as one of the main, if not the most, MIPS consuming application in current and next generation standards. The key issue is to reduce the number of memory accesses and secondly the number of operations to implement the algorithm. The kernel of the algorithm is the Viterbi butterfly as shown on Fig. 42.52. The basic equations executed in this butterfly are:

d ( 2i ) = min { d ( i ) + a,d ( i + s/2 ) – a } d ( 2i + 1 ) = min { d ( i ) – a ,d ( i + s/2 ) + a } These equations are implemented by the “add-compare-select (ACS)” instruction and its associated data path unit. Indeed, one needs to add or subtract the branch metric from states i and i + s/2, compare them, and select the minimum. In parallel, state 2i + 1 is updated. The butterfly arrangement is chosen because it reduces the amount of memory accesses by half, because the two states that use the same data to update the same two next states are combined. DSP processors have special hardware and instructions to implement the ACS operation in the most efficient way. The lode architecture [21] uses the two Mac units and the ALU to implement the ACS operation as shown in Fig. 42.59(a). The dual Mac operates as a dual add/subtract unit. The ALU finds the minimum. The shortest distance is saved to memory and the path indicator, i.e., the decision bit is saved in a special shift register A2. This results in four cycles per butterfly. The Texas Instruments TMS320C54x and the Matsushita processor described in [14,22] use a different approach that also results in four cycles per butterfly. This is illustrated in Fig. 42.59(b). The ALU and the accumulator are split into two halves (much like SIMD instructions), and the two halves operate independently. A special compare, select, and store unit (CSSU) will compare the two halves, will select the chosen one, and write the decision bit into a special register TRN. The processor described in [14] describes two ACS units in parallel. One should note that without these specialized instructions and hardware, one butterfly requires 15 to 25 or more instructions.

© 2002 by CRC Press LLC

TABLE 42.3 Number of Parallel Address Generation Units for a Few DSP Processors Generation Units Processor

Data Address

Program Address

C5× [18] C54× [19]

1 (ARAU) 2 (DAGEN has two units: ARAU0, ARAU1) 2 (ACU0, ACU1) 2

1 1

Lode [21] Frio [10]

1 1

DSP Memory and Address Calculation Units Besides the data paths optimized for signal processing and communication applications, the DSP processors also have specialized address calculation units. As explained in section “DSP Architecture,” the parallel memory maps in the Harvard or modified Harvard architecture are essential for the data processing in DSP processors; however, to avoid overload on the regular data path units, specialized address generation units are included. In general, the number of address generation units will be same as the maximum number of parallel memory accesses that can occur in one cycle. A few examples are shown in Table 42.3. Older processors, such as the C5× with a modified Harvard architecture, have one address generation unit serving the data address bus, and one program address generation unit serving the program address bus. When the number of address busses go up, so will the arithmetic units inside the address calculation unit. For instance the Frio [10] has two address busses served by two ALUs inside the data address generation unit. The address generation units themselves are optimized to perform address arithmetic in an efficient way. This includes data paths with the correct word lengths. It also includes all the typical address modifications that are common in DSP applications. For instance indirect addressing with a simple increment can easily be done and expressed in the instruction syntax. More advanced addressing modes include circular buffering, which especially suits filter operations, and bit-reversed addressing, especially useful for fast Fourier transforms, and so on. There exist many good instruction manuals that describe the detailed operation of these specialized addressing modes, [11,18,19].

DSP Pipeline The pipeline of a DSP processor is different from the pipeline of a general purpose control-oriented micro-processors. The basic slots of the DSP pipeline and the RISC pipeline are shown in Fig. 42.60. In a DSP processor, the memory access stage in parallel with the address generation (usually “post-modification”) occurs before the execute stage. An example is described in [10]. In a RISC processor the memory access stage follows the execute stage [8], because the execute stage is used to calculate the address on the main ALU. The fundamental reason for this difference in pipeline structure is that DSP processors are optimized for memory intensive number-crunching type of applications (e.g., FIRs), while RISC type processors, including micro-controllers and micro-processors, are optimized for complex decision making. This is explained in Figs. 42.61 and 42.62. Typical for real-time compute intensive applications, is the continuous memory accesses followed by operations in the data path units. A typical example is the execution of the FIR filter as shown in the FIR pseudo code above. On a DSP processor, the memory access and the multiply-accumulate operation are specified in one instruction and follow each other in the pipeline stage. The same operation on a RISC machine will need three instruction slots. The first instruction slot will read the value from memory and only in the third instruction slot the actual computation takes place. If these delays are not obeyed, a data hazard will occur [8]. Fixing data hazards will lead to a large instruction and cycle overhead. Similarly, it can be argued that branches have a larger penalty on DSP processors than on RISC machines. The reason is explained on Fig. 42.62. If a data dependent branch needs to be executed, e.g., “branch if accumulator is zero,” then it takes that this instruction cannot follow immediately after the

© 2002 by CRC Press LLC

Time in clock cycles

Program instruction order

Fetch

Memory Access

Execute

Decode

Write Back Memory access / branch

Execution/ address generation (a) RISC pipeline

Fetch

Memory Access

Decode

Write Back

Execute

Execution Memory access/address post modification (b) DSP pipeline

FIGURE 42.60

Basic pipeline architecture for a RISC and a DSP processor.

Program instruction order

Time in clock cycles Fetch

Decode

Execute

Memory Access

Fetch

Decode

Execute

Memory Access

Write Back

Fetch

Decode

Execute

Memory Access

Write Back

r0 = *p0;

Write Back

// load data

a0 = a0 + r 0 // operate

Program instruction order

(a) RISC pipeline Time in clock cycles Fetch

FIGURE 42.61

Decode

Memory Access

Execute

Write Back

a0 = a0 + *p0;

(b) DSP pipeline

Memory-intensive number crunching on a RISC and a DSP.

accumulator is set. In the simple examples of Fig. 42.62, there needs to be two, respectively three instruction cycles between the setting of the accumulator flag and the usage of it in the decode stage, by the RISC and DSP processor, respectively. Therefore, the RISC has an advantage for control dominated applications. In practice these pipeline hazards are either hidden to the programmer by hardware solutions (e.g., forwarding or stalls) or they are visible to the programmer, who can optimize his code around it. A typical example are the branch and the “delayed branch” instruction in DSP processor. Because an instruction is fetched in the cycle before it is decoded, a regular branch instruction will incur an unnecessary fetch of the next instruction in memory following the branch. To optimize the code in DSP processors, the delayed branch instruction is introduced. In this case, the instruction that follows the branch instruction in memory will be executed before the actual branch takes place. Hence, a delayed branch instruction takes effectively one cycle to execute while a regular branch will take two cycles to execute.

© 2002 by CRC Press LLC

Program instruction order

Time in clock cycles Fetch

Decode

Execute

Memory Access

Write Back

Fetch

Decode

Execute

Memory Access

Write Back

Fetch

Decode

Execute

Memory Access

zeroflag is set

Write Back

if (acc = 0) then ...

(a) RISC pipeline

Program instruction order

Time in clock cycles Fetch

Decode

Memory Access

Execute

Write Back

Fetch

Decode

Memory Access

Execute

Fetch

Decode

(b) DSP pipeline

FIGURE 42.62

Fetch

Memory Access Decode

zeroflag is set

Write Back Execute

Write Back

Memory Access

Execute

Write Back

if (acc = 0) then ...

Decision making (branch) on a RISC and a DSP.

The delayed branch is a typical example on how DSP programmers are very sensitive to code size and code execution. Indeed, for embedded applications, minimum code size is a requirement.

Conclusions and Future Trends DSP processors are a special type of processors, very different from the general purpose micro controller or micro processor architectures. As argued in this chapter section, this is visible in all components of the processor: the overall architecture, the data paths, the address generation units, and even the pipeline structure. The applications of the future will keep on driving the next generation of DSP processors. Several trends are visible. Clearly, there is a need to support higher level languages and compilers. Traditional compiler techniques do not produce efficient, optimized code for DSP processors. Compiler technology has only recently started to address the needs of low power embedded applications. But also the architectures will need changes to accommodate the compiler techniques. One drastic approach is the appearance of VLIW architectures, for which efficient compiler techniques are known. This results, however, in code size explosion associated with a large increase in power consumption. A hybrid approach might be a better solution. For example, the processor described in [10] has unified register files. Yet, it also makes an exception for the accumulators. Another challenge is the increased demand for performance while reducing the power consumption. Next generation wireless portable applications will not only provide voice communication, but also video, text and data, games, and so on. On top of this the applications will change and will need reconfigurations while in use. This will require a power efficient way of runtime reconfiguration [16]. The systems on a chip that implement these wireless communication devices will include a large set of heterogeneous programmable and reconfigurable modules, each optimized to the application running on them. Several of these will be DSP processors and they will be crucial for the overall performance.

Acknowledgments The author thanks Mr. Katsuhiko Ueda of Matsushita Electric Co., Japan, for the interesting discussions and for providing several figures in this chapter section.

© 2002 by CRC Press LLC

References 1. Bahl L., Cocke J., Jelinek F., Raviv J., “Optimal decoding of linear codes for minimizing symbol error rate,” IEEE Trans. Information Theory, vol. IT-20, pp. 284–287, March 1974. 2. Berrou C., Glavieux A., Thitimajshima P., “Near shannon limit error-correcting coding and decoding: turbo-codes (1),” Proc. ICC ’93, May 1993. 3. Catthoor F., De Man H., “Application-specific architectural methodologies for high-throughput digital signal and image processing,” IEEE Transactions on ASSP, Feb. 1990. 4. Forney G., “The viterbi algorithm,” Proceedings of the IEEE, vol. 61, no. 3, pp. 268–278, March 1973. 5. Gass W., Bartley D., “Programmable DSPs,” Chapter 9 in Digital Signal Processing for Multimedia Systems, Parhi K., Nishitani T. (Eds.), Marcel Dekker Inc., New York, 1999. 6. Gatherer A., Stetzler T., McMahan M., Auslander E., “DSP-based architectures for mobile communications: past, present, future,” IEEE Communications Magazine, pp. 84–90, Jan. 2000. 7. Hagenauer J., Hoeher P., “A viterbi algorithm with soft-decision outputs and its applications,” Proc. Globecom ’89, pp. 47.1.1–47.1.7, Nov. 1989. 8. Hennessy J., Patterson D., Computer Architecture: A Quantitative Approach, 2nd Edition, Morgan Kaufmann Publ., San Francisco, CA, 1996. 9. Kabuo H., Okamoto M., et al. “An 80 MOPS peak high speed and low power consumption 16-bit digital signal processor,” IEEE Journal of Solid-State Circuits, vol. 31, no. 4, pp. 494–503, 1996. 10. Kolagotla R., et al., “A 333 MHz dual-MAC DSP architecture for next-generation wireless applications,” Proceedings ICASSP, Salt Lake City, UT, May 2001. 11. Lapsley P., Bier J., Shoham A., Lee E.A., DSP Processor Fundamentals: Architectures and Features, IEEE Press, 1996. 12. Lee E.A., “Programmable DSP architectures: Part I and Part II,” IEEE ASSP Magazine, pp. 4–19, Oct. 1988, pp. 4–14, Jan. 1989. 13. McMahan M.L., “Evolving cellular handset architectures but a continuing, insatiable desire for DSP MIPS,” Texas Instruments Technical Journal, Jan.–Mar. 2000, vol. 17, no. 1, reprinted as Application Report SPRA650-March 2000. 14. Okamoto M., Stone K., et al., “A high performance DSP architecture for next generation mobile phone systems,” 1998 IEEE DSP Workshop. 15. Oliphant M., “The mobile phone meets the internet,” IEEE Spectrum, pp. 20–28, Aug. 1999. 16. Schaumont P., Verbauwhede I., Keutzer K., Sarrafzadeh M., “A quick safari through the reconfiguration jungle,” Proceedings 38th Design Automation Conference, Las Vegas, NV, June 2001. 17. Strauss W., “Digital signal processing, the new semiconductor industry technology driver,” IEEE Signal Processing Magazine, pp. 52–56, March 2000. 18. Texas Instruments, TMS320C5x User’s Guide, document SPRU056B, Jan. 1993. 19. Texas Instruments, TMS320C54x DSP CPU Reference Guide, document SPRU131G, March 2001. 20. Verbauwhede I., Scheers C., Rabaey J., “Analysis of multidimensional DSP specifications,” IEEE Transactions on signal processing, vol. 44, no. 12, pp. 3169–3174, Dec. 1996. 21. Verbauwhede I., Touriguian M., “Wireless digital signal processing,” Chapter 11 in Digital Signal Processing for Multimedia Systems, Parhi K., Nishitani T. (Eds.), Marcel Dekker Inc., New York, 1999. 22. Verbauwhede I., Nicol C., “Low power DSP’s for wireless communications,” Proceeding ISLPED, pp. 303–310, Aug. 2000.

© 2002 by CRC Press LLC

43 Data Security 43.1 43.2

Introduction Unkeyed Cryptographic Primitives

43.3

Symmetric Key Cryptographic Primitives

Random Oracle Model Symmetric Key Block Ciphers • Symmetric Key Stream Ciphers • Message Authentication Codes

43.4

Public Key Encryption Schemes • Digital Signature Schemes • Advanced Topics for Public Key Cryptography

Matt Franklin University of California at Davis

Asymmetric Key Cryptographic Primitives

43.5

Other Resources

43.1 Introduction Cryptography is the science of data security. This chapter gives a brief survey of cryptographic practice and research. The chapter is organized along the lines of the principal categories of cryptographic primitives: unkeyed, symmetric key, and asymmetric key. For each of these categories, the chapter defines the important primitives, give security models and attack scenarios, discuss constructions that are popular in practice, and describe current research activity in the area. Security is defined in terms of the goals and resources of the attacker.

43.2 Unkeyed Cryptographic Primitives The main unkeyed cryptographic primitive is the cryptographic hash function. This is an efficient function from bit strings of any length to bit strings of some fixed length (say 128 or 160 bits). The description of the function is assumed to be publicly available. If H is a hash function, and if y = H(x), then y is called the “hash” or “hash value” of x. One desirable property of a cryptographic hash function is that it should be difficult to invert. This means that given a specific hash value y it is computationally infeasible to produce any x such that H(x) = y. Another desirable property is that it should be difficult to find “collisions.” This means that it is computationally infeasible to produce two inputs x and x′ such that H(x) = H(x′). The attacker is assumed to know a complete specification of the hash function. A cryptographic hash function can be used for establishing data integrity. Suppose that the hash of a large file is stored in a secure location, while the file itself is stored in an insecure location. It is infeasible for an attacker to modify the file without detection, because a re-hash of the modified file will not match the stored hash value (unless the attacker was able to invert the hash function). We will see other applications of cryptographic hash functions when we look at asymmetric cryptographic primitives in Section 43.4. Popular choices for cryptographic hash functions include MD-5 [1], RIPEMD-160 [2], and SHA-1 [3]. It is also common to construct cryptographic hash functions from symmetric key block ciphers [4].

© 2002 by CRC Press LLC

Random Oracle Model One direction of recent research is on the “random oracle model.” This is a design methodology for protocols and primitives that make use of cryptographic hash functions. Pick a specific cryptographic hash function such as MD-5. Its designers may believe that it is difficult to invert MD-5 or to find collisions for it. However, this does not mean that MD-5 is a completely unpredictable function, with no structure or regularity whatsoever. After all, the complete specification of MD-5 is publicly available for inspection and analysis, unlike a truly random function that would be impossible to specify in a compact manner. Nevertheless, the random oracle model asserts that a specific hash function like MD-5 behaves like a purely random function. This is part of a methodology for proving security properties of cryptographic schemes that make use of hash functions. This assumption was introduced by Fiat and Shamir [5] and later formalized by Bellare and Rogaway [6]. It has been applied to the design and analysis of many schemes (see, e.g., the discussion of optimal asymmetric encryption padding in the subsection on “Chosen Ciphertext Security for Public Key Encryption”). Recently, a cautionary note was sounded by Canetti, Goldreich, and Halevi [7]. They demonstrate by construction that it is possible for a scheme to be secure in the random oracle model and yet have no secure instantiation whatsoever when any hash function is substituted. This is a remarkable theoretical result; however, the cryptographic community continues to base their designs on the random oracle model, and with good reason. Although it cannot provide complete assurance about the security of a design, a proof in the random oracle model provides confidence about the impossibility of a wide range of attacks. Specifically, it rules out common attacks where the adversary ignores the inner workings of the hash function and treats it as a “black box.” The vast majority of protocol failures are due to this kind of black box attack, and thus the random oracle model remains an invaluable addition to the cryptographer’s tool kit.

43.3

Symmetric Key Cryptographic Primitives

The main symmetric key cryptographic primitives are discussed including block ciphers, stream ciphers, and message authentication codes.

Symmetric Key Block Ciphers A symmetric key block cipher is a parameterized family of functions EK, where each EK is a permutation on the space of bit strings of some fixed length. The input to EK is called the “plaintext” block, the output is called the “ciphertext” block, and K is called the “key.” The function EK is called an “encryption” function. The inverse of EK is called a “decryption” function, and is denoted DK. To encrypt a message that is longer than the fixed-length block, it is typical to employ a block cipher in a well-defined “mode of operation.” Popular modes of operation include output feedback mode, cipher feedback mode, and cipher block chaining mode; see [8] for a good overview. In this way, the plaintext and ciphertext can be bit strings of arbitrary (and equal) length. New modes of operations are being solicited in connection with the development of the Advanced Encryption Standard (see subsection “Advanced Encryption Standard (AES)”). The purpose of symmetric key encryption is to provide data confidentiality. Security can be stated at a number of levels. It is always assumed that the attacker has access to a complete specification of the parameterized family of encryption functions and to a ciphertext of adequate length. Beyond that, the specific level of security depends on the goals and resources of the attacker. An attacker might attempt a “total break” of the cipher, which would correspond to learning the key K. An attacker might attempt a “partial break” of the cipher, which would correspond to learning some or all of the plaintext for a given ciphertext. An attacker might have no resources beyond a description of the block cipher and a sample ciphertext, in which case he is mounting a “ciphertext-only attack.” An attacker might mount a “knownplaintext attack,” if he is given a number of plaintext-ciphertext pairs to work with (input-output pairs for the encryption function). If the attacker is allowed to choose plaintexts and then see the corresponding ciphertexts, then he is engaged in a “chosen-plaintext attack.”

© 2002 by CRC Press LLC

Symmetric key block ciphers are valuable for data secrecy in a storage scenario (encryption by the data owner for an insecure data repository, and subsequent decryption by the data owner at a later time), or in a transmission scenario (across an insecure channel between a sender and receiver that have agreed on the secret key beforehand). Perhaps the most popular symmetric key block cipher for the past 25 years has been the Data Encryption Standard (DES) [9], although it may be near the end of its useful life. NIST recently announced the Advanced Encryption Standard (AES) block cipher, which we discuss in section “Advanced Encryption Standard (AES).” Most modern block ciphers have an “iterated” design, where a “round function” is repeated some fixed number of times (e.g., DES has 16 rounds). Many modern block ciphers have a “Feistel structure” [10], which is an iterated design of a particular type. Let (Lj −1, Rj −1) denote the output of the (j − 1)th round, divided into two halves for notational convenience. Then the output of the jth round is (Lj, Rj), where Lj = R j −1, and Rj = Lj −1 xor f(R j −1, Kj) for some function f. Here Kj is the jth “round key,” derived from the secret key according to some fixed schedule. Note that a block cipher with a Feistel structure is guaranteed to be a permutation even if the function f is not invertible. Differential Cryptanalysis Differential cryptanalysis is a powerful statistical attack that can be applied to many symmetric key block ciphers and unkeyed cryptographic hash functions. The first publication on differential cryptanalysis is due to Biham and Shamir [11], but Coppersmith [12] has described how the attack was understood during the design of the DES in the early 1970s. The central idea of differential cryptanalysis for block ciphers is to sample a large number of pairs of ciphertexts for which the corresponding plaintexts have a known fixed difference D (under the operation of bitwise exclusive-or). The difference D leads to a good “characteristic” if the XOR of the ciphertexts (or of an intermediate result during the computation of the ciphertext) can be predicted with a relatively large probability. By calculating the frequency with which every difference of plaintexts and every difference of ciphertexts coincides, it is possible to deduce some of the key bits through a statistical analysis of a sufficiently large sample of these frequencies. 47 For a differential cryptanalysis of DES, the best attack that Biham and Shamir discovered requires 2 chosen plaintext pairs with a given difference. They note that making even slight changes to the S-boxes (nonlinear substitution transformation at the heart of DES) can lead to a substantial weakening with respect to a differential attack. Linear Cryptanalysis Linear cryptanalysis is another powerful attack that can be applied to many symmetric key block ciphers and unkeyed cryptographic hash functions. Consider the block cipher as being a composition of linear and nonlinear functions. The goal of linear cryptanalysis is to discover linear approximations for the nonlinear components. These approximations can be folded into the specification of the block cipher, and then expanded to find an approximate linear expression for the ciphertext output bits in terms of plaintext input bits and secret key bits. If the approximations were in fact perfect, then enough plaintextciphertext pairs would yield a system of linear equations that could be solved for the secret key bits; however, even when the approximations are far from perfect, they enable a successful statistical search for the key, given enough plaintext-ciphertext pairs. This is a known-plaintext attack, unlike differential cryptanalysis, which is chosen-plaintext. Linear cryptanalysis was introduced by Matsui and Yamagishi [13]. Matsui applied linear cryptanalysis 43 to DES [14]. In his best attack, 2 known plaintexts are required to break DES with an 85% probability. See Langford and Hellman [15] for close connections between differential and linear cryptanalysis. Advanced Encryption Standard (AES) In 1997, NIST began an effort to develop a new symmetric key encryption algorithm as a Federal Information Processing Standard (FIPS). The goal was to replace the DES, which was widely perceived to be

© 2002 by CRC Press LLC

at the end of its usefulness. A new algorithm was sought, with longer key and block sizes, and with increased resistance to newly revealed attacks such as linear cryptanalysis and differential cryptanalysis. The AES was to support 128-bit block sizes, and key sizes of 128 or 192 or 256 bits. By contrast, DES supported 64-bit block sizes, and a key size of 56 bits. Fifteen algorithms were proposed by designers around the world. This was reduced to five finalists, announced by NIST in 1999: MARS, RC6, Rijndael, Serpent, and TwoFish. In 2000, Rijndael was selected as the Advanced Encryption Standard. Rijndael has a relatively simple structure; however, unlike many earlier block ciphers (such as DES), it does not have a Feistel structure. The operation of Rijndael proceeds in rounds. Imagine that the block to be encrypted is written as a rectangular array of byte-sized words (four rows and four columns). First, each byte in the array is replaced by a different byte, according to a single fixed lookup table (S-box). Next, each row of the array undergoes a circular shift by a fixed amount. Next, a fixed linear transformation is applied to each column in the array. Last, the entire array is exclusive-or with a “round key.” All of the round keys are calculated by expanding the original secret key bits according to a simple key schedule. Note that the only nonlinear component is the S-box substitution step. Details of Rijndael’s operation can be found at [16].

Symmetric Key Stream Ciphers Stream ciphers compute ciphertext one character at a time, where the characters are often individual bits. By contrast, block ciphers compute ciphertext one block at a time, where the block is much larger (64 bits long for DES, 128 bits long for AES). Stream ciphers are often much faster than block ciphers. The typical operation of a stream cipher is to exclusive-or message bits with a “key stream.” If the key stream were truly random, this would describe the operation of a “one-time pad.” The key stream is not truly random, but it is instead derived from the short secret key. A number of stream ciphers have been optimized for hardware implementation. The use of linear feedback shift registers is especially attractive for hardware implementation, but unfortunately these are not sufficiently secure when used alone. The Berlekamp–Massey algorithm [17] allows a hidden linear feedback shift register to be determined from a very short sequence of output bits. In practice, stream ciphers for hardware often combine linear feedback shift registers with nonlinear components to increase security. One approach is to apply a nonlinear function to the output of several linear feedback shift registers that operate in parallel (“nonlinear combination generator”). Another approach is to apply a nonlinear function to all of the states of a single linear feedback shift register (“nonlinear filter generator”). Still another approach is to have the output of one linear feedback shift register determine when a step should be taken in other linear feedback shift registers (“clock-controlled generator”). Some stream ciphers have been developed to be especially fast when implemented in software, e.g., RC5 [18]. Certain modes of operation for block ciphers can be viewed as symmetric key stream ciphers (output feedback mode and cipher feedback mode).

Message Authentication Codes A message authentication code (MAC) is a keyed cryptographic hash function. It computes a fixed-length output (tag) from an input of any length (message). When both the sender and the receiver know the secret key, a MAC can be used to transmit information with integrity. Without knowing the key, it is very difficult for an attacker to modify the message and/or the tag so that the hash relation is maintained. The MAC in the symmetric key setting is the analog of the digital signature in the asymmetric key setting. The notion of message authentication in the symmetric key setting goes back to Gilbert, MacWilliams, and Sloane [19]. Security for MACs can be described with respect to different attack scenarios. The attacker is assumed to know a complete specification of the hash function, but not the secret key. The attacker might attempt to insert a new message that will fool the receiver, or the attacker might attempt to learn the secret key. The attacker might get to see some number of message-tag pairs, either for random messages or for messages chosen by the attacker.

© 2002 by CRC Press LLC

One popular MAC is the CBC-MAC, which is derived from a block cipher (such as DES) run in cipher block chaining mode. Another approach is to apply an unkeyed cryptographic hash function after the message has been combined with the key according to some pre-packaging transform. Care must be taken with the choice of transform; one popular choice is HMAC [20]. The UMAC construction [21] has been optimized for extremely fast implementation in software, while maintaining provable security. Jutla [22] recently showed especially efficient methods for combining message authentication with encryption, by using simple variations on some popular modes of operation for symmetric key block ciphers.

43.4

Asymmetric Key Cryptographic Primitives

Two asymmetric key cryptographic primitives are discussed in this section: public key encryption schemes and digital signature schemes.

Public Key Encryption Schemes A public key encryption scheme is a method for deriving an encryption function EK and a corresponding decryption function DK such that it is computationally infeasible to determine DK from EK. The encryption function EK is made public, so that anyone can send encrypted messages to the owner of the key. The decryption function DK is kept secret, so that only the owner of the key can read encrypted messages. The functions are inverses of each other, so that DK(EK(M)) = M for every message M. Unlike the symmetric key setting, there is no need for the sender and receiver to pre-establish a secret key before they can communicate securely. Security for a public key encryption scheme relates to the resources and goals of the attacker. The attacker is assumed to have a complete description of the scheme, as well as the public encryption key EK. Thus, the attacker is certainly able to encrypt arbitrary messages (“chosen-plaintext attack”). The attacker might be able to decrypt arbitrary messages (“chosen-ciphertext attack,” discussed in more detail in subsection on “Chosen Ciphertext Security for Public Key Encryption”). The goal of the attacker might be to deduce the decryption function DK (“total break”), or simply to learn all or some information about the plaintext corresponding to a particular ciphertext (“partial break”), or merely to guess which of two plaintexts is encrypted by a given ciphertext (“indistinguishability”). The idea of public key encryption is due to Diffie and Hellman [23]. Most popular public key encryption schemes base their security on the hardness of some problem from number theory. The first public key encryption proposed remains one of the most popular today—the RSA scheme due to Rivest, Shamir, and Adleman [24]. Other popular public key encryption schemes are based on the “discrete logarithm problem,” including ElGamal [25] and elliptic curve variants [26]. For efficiency purposes, public key encryption is often used in a hybrid manner (called “key transport”). Suppose that a large message M is to be encrypted using a public encryption key EK. The sender chooses a random key k for a symmetric key block cipher such as AES. The sender then transmits EK(k), AESk(M). The first component enables the receiver to recover the symmetric key k, which can be used to decrypt the second component to recover M. The popular e-mail security protocol PGP uses this method (augmented with an integrity check). It is also possible to use a “key agreement protocol” to establish a secret key over an insecure public channel, and then to use the secret key in a symmetric key block cipher. The idea is due to Diffie and Hellman [22], and the original Diffie–Hellman key agreement protocol is still widely used in practice.

Digital Signature Schemes A digital signature scheme is a method for deriving a signing function SK and a corresponding verification function VK, such that it is computationally infeasible to derive SK from VK. The verification function VK is made public, so that anyone can verify a signature made by the owner of the signing key. The signing function SK is kept secret, so that only the owner of the signing key can sign messages. The signing

© 2002 by CRC Press LLC

function and verification function are related as follows: If the signature of a message M is SK(M), then it should be the case that VK(SK(M)) = “valid” for all messages M. Security for a digital signature scheme depends on the goals and resources of the attacker [27]. The attacker is assumed to know a complete specification of the digital signature scheme, and the verification function VK. The attacker might also get to see message-signature pairs for random messages (“known message attack”), or for arbitrary messages chosen by the attacker (“chosen message attack”). The goal of the attacker might be to derive the signature function (“total break”), or to forge a signature on a particular message (“selected message forgery”), or to forge any message–signature pair (“existential message forgery”). In practice, a signing function is applied not to the message itself, but rather to the hash of the message (i.e., to the output of an unkeyed cryptographic hash function applied to the message). The security of the signature scheme is then related to the security of the hash function. For example, if a collision can be found for the hash function, then an attacker can produce an existential message forgery under a chosen message attack (by finding a collision on the hash function, and then asking for the signature of one of the colliding inputs). One of the most popular digital signature schemes is RSA (based on the same primitive as RSA public key encryption, where SK = DK and VK = EK). Other popular digital signature schemes include the digital signature algorithm (DSA) [28] and ElGamal [25].

Advanced Topics for Public Key Cryptography Chosen Ciphertext Security for Public Key Encryption As discussed earlier, a number of definitions for the security of a public key encryption scheme have been proposed. Chosen ciphertext security is perhaps the strongest natural definition, and it has emerged as the consensus choice among cryptographers as the proper notion of security to try to achieve. This is not to say that chosen ciphertext security is necessary for all applications, but instead of having a single encryption scheme, that is, chosen ciphertext secure will allow it to be used in the widest possible range of applications. The strongest version of definition of chosen ciphertext security is due to Rackoff and Simon [29], building from a slightly weaker definition of Naor and Yung [30]. It can be described as a game between an adversary and a challenger. The challenger chooses a random public key and corresponding private key [EK, DK], and tells the public key EK to the adversary. The adversary is then allowed to make a series of decryption queries to the challenger, sending arbitrary ciphertexts to the challenger and receiving their decryptions in reply. After this stage, the adversary chooses two messages M0 and M1 whose encryptions he thinks will be particularly easy to distinguish between. The adversary sends M0 and M1 to the challenger. The challenger chooses one of these messages at random; call it Mb, where b is a random bit. The challenger encrypts Mb and sends the ciphertext C to the adversary. Now the adversary attempts to guess whether C is an encryption of M0 or M1. To help him with his guess, he is allowed to engage in another series of decryption queries with the challenger. The only restriction is that the adversary may never ask the challenger to directly decrypt C. At some point the adversary makes his guess for Mb. If the adversary can win this game with any nonnegligible advantage c (i.e., with probability 1/2 plus 1/k , where k is the length of the private key and c is any positive constant), then we say that he has mounted a successful chosen ciphertext attack. If no adversary (restricted to the class of probabilistic polynomial time turing machines) can mount a successful chosen ciphertext attack, then we say that the cryptosystem is chosen ciphertext secure. This might seem like overkill for a definition of security. Unlimited access to a decryption oracle might seem like an unrealistically strong capability for the attacker. Merely distinguishing between two plaintexts might seem like an unrealistically weak goal for the attacker. Nevertheless, this definition has proven to be a good one for several reasons. First, it has been shown to be equivalent to other natural and strong definitions of security [31]. Second, Bleichenbacher [32] showed that a popular standard (RSA PKCS #1) was vulnerable to a chosen ciphertext attack in a practical scenario.

© 2002 by CRC Press LLC

In the random oracle model, chosen ciphertext security can be achieved by combining a basic public key encryption scheme such as RSA with a simple “prepackaging” transform. Such a transform uses random padding and unkeyed cryptographic hash functions to scramble the message prior to encryption. The prepackaging transform is invertible, so that the message can be unscrambled after the ciphertext is decrypted. The optimal asymmetric encryption padding (OAEP) transform takes an m-bit message M, a random s s bit string R of length s, and outputs OAEP(M, R) = ((M || 0 ) xor H(R)) || (R xor G((M || 0 ) xor H(R))). Here G and H are unkeyed cryptographic hash functions that are assumed to have no exploitable weaknesses (random oracles). This can be viewed as a two-round Feistel structure (e.g., DES is a 16-round round Feistel structure). Unpackaging the transform is straightforward. The OAEP transform is used extensively in practice, and has been incorporated in several standards. OAEP combined with RSA yields an encryption scheme that is secure against a chosen ciphertext attack [33,34]. Shoup [35] shows that OAEP+, a variation on OAEP, yields chosen ciphertext security when combined with essentially any public key encryption scheme: OAEP + (M, R) = ((M || W(M, R)) xor H(R)) || (R xor G(M || W(M, R)) xor H(R)), where G, H, and W are unkeyed cryptographic hash functions that behave like random oracles. Boneh [36] shows that even simpler prepackaging transforms (essentially one-round Feistel structure versions of OAEP and OAEP+) yield chosen ciphertext secure encryption schemes when combined with RSA or Rabin public key encryption. Without the random oracle model, chosen ciphertext security can be achieved using the elegant Cramer–Shoup cryptosystem [37]. This is based on the hardness of the Decision Diffie–Hellman problem (see subsection “New Hardness Assumptions for Asymmetric Key Cryptography”). Generally speaking, constructions in the random oracle model are more efficient than those without it. Threshold Public Key Cryptography In a public key setting, the secret key (for decryption or signing) often needs to be protected from theft for long periods of time against a concerted attack. Physical security is one option for guarding highly sensitive keys, e.g., storing the key in a tamper-resistant device. Threshold public key cryptography is an attractive alternative for safeguarding critical keys. In a threshold public key cryptosystem, the secret key is never in one place. Instead, the secret key is distributed across many locations. Each location has a different “share” of the key, and each share of the key enables the computation of a “share” of the decryption or signature. Shares of a signature or decryption can then be easily combined to arrive at the complete signature or decryption, assuming that a sufficient number of shareholders contribute to the computation. This “sufficient number” is the threshold that is built into the system as a design parameter. Note that threshold cryptography can be combined with physical security, by having each shareholder use physical means to protect his individual share of the secret key. Threshold cryptography was independently conceived by Desmedt [38], Boyd [39], and Croft and Harris [40], building on the fundamental notion of secret sharing [41,42]. Satisfactory threshold schemes have been developed for a number of public key encryption and digital signature schemes. These threshold schemes can be designed so as to defeat an extremely strong attacker who is able to travel from shareholder to shareholder, attempting to learn or corrupt all shares of the secret key (“proactive security”). Efficient means are also available for generating shared keys from scratch by the shareholders themselves, so that no trusted dealer is needed to initialize the threshold scheme [43,44]. Shoup [45] recently proposed an especially simple and efficient scheme for threshold RSA. New Hardness Assumptions for Asymmetric Key Cryptography A trend has occurred in recent years toward the exploration of the cryptographic implications of new hardness assumption. Classic assumption include the hardness of factoring a product of two large primes, the hardness of extracting roots modulo a product of two large primes, and the hardness of computing x discrete logarithms modulo a large prime (i.e., solving g = y mod p for x). One classic assumption is the Diffie–Hellman assumption. Informally stated, this assumption is that ab a b it is difficult to compute ( g mod p) given ( g mod p) and ( g mod p), where p is a large prime. This assumption

© 2002 by CRC Press LLC

underlies the Diffie–Hellman key agreement protocol. The “Decisional Diffie–Hellman Assumption” has proven to be useful in recent years. Informally stated, this assumption is that it is difficult to distinguish a b ab a b c triples of the form (g mod p, g mod p, g mod p) and triples of the form (g mod p, g mod p, g mod p) for random a, b, c. Perhaps most notably, the Cramer–Shoup chosen ciphertext secure encryption scheme is based on this new assumption. The security of RSA is based on a root extraction problem related to the hardness of factoring: Given 1/e message M and modulus N = pq of unknown factorization and suitable exponent e, compute M mod N. Recently, a number of protocols and primitives have been based on a variant of this assumption called 1/e the “Strong RSA Assumption:” Given M and N, find e and M mod N for any suitable e. For example, a provably secure signature scheme can be based on this new assumption without the need for the random oracle assumption [46]. The RSA public key scheme is based on arithmetic modulo N, where N = pq is a product of two primes (factors known to the private key holder but not to the public). Recently, Paillier [47] has proposed a novel 2 public key encryption scheme based on arithmetic modulo p q. His scheme has nice “homomorphic” properties, which enable some computations to be performed directly on ciphertexts. For example, it is easy to compute the encryption of the sum of any number of encrypted values, without knowing how to decrypt these ciphertexts. This has many nice applications, such as for secure secret ballot election protocols. Lastly, the “Phi-Hiding Assumption” was introduced by Cachin, Micali, and Stadler [48]. This is a technical assumption related to prime factors of p − 1 and q − 1 in an RSA modulus N = pq. This assumption enables the construction of an efficient protocol for querying a database without revealing to the database what queries are being made. (Private Information Retrieval). Privacy Preserving Protocols Using the cryptographic primitives described in earlier sections, it is possible to design protocols for two or more parties to perform useful computational tasks while maintaining some degree of data confidentiality. Theoretical advances were well established with the “completeness theorems” of [49] and others; however, practical solutions have often required special-purpose protocols tailored to the particular problem. One important example—both historically and practically—is the problem of conducting a secret ballot election [50,51]. This can be viewed as a cryptographic protocol design problem among three types of parties: voters, talliers, and independent observers. All types of parties have different security requirements. Voters want to guarantee that their individual ballots are included in the final tally, and that the contents of the ballots remain secret. Talliers want to produce an accurate final count that includes all valid ballots counted exactly once, and no invalid ballots. Independent observers want to verify that the tally is conducted honestly. One of the best secret ballot election protocol currently known for largescale elections is probably [52], which is based on threshold public key encryption.

43.5 Other Resources An excellent resource for further information is the CRC Handbook of Applied Cryptography [53], particularly the first chapter of that handbook, which has an overview of cryptography that is highly recommended. Ross Anderson’s book on security engineering [54] is a recommended resource, especially for its treatment of pragmatic issues that arise when implementing cryptographic primitives in practice. See also the frequently asked questions list maintained by RSA Labs (www.rsa.com/rsalabs).

References 1. Rivest, R., The MD5 message-digest algorithm, Internet Request for Comments, RFC 1321, April 1992. 2. Dobbertin, H., Bosselers, A., and Preneel, B., RIPEMD-160: a strengthened version of RIPEMD, in Proc. Fast Software Encryption Workshop, Gollman, D., Ed., Springer-Verlag, LNCS, Heidelberg, 1039, 71, 1996.

© 2002 by CRC Press LLC

3. FIPS 180-1, Secure hash standard, Federal Information Processing Standards Publication 180-1, U.S. Dept. of Commerce/N.I.S.T., National Technical Information Service, Springfield, VA, May 11, 1993. 4. Preneel, B., Cryptographic hash functions, European Trans. Telecomm. 5, 431, 1994. 5. Fiat, A. and Shamir, A., How to prove yourself: Practical solutions to identification and signature problems, in Advances in Cryptology—Crypto ’93, Springer-Verlag, LNCS, Heidelberg, 773, 480, 1994. 6. Bellare, M. and Rogaway, P., Random oracles are practical: a paradigm for designing efficient protocols, in Proc. ACM Conf. Comput. and Comm. Security, 62, 1993. 7. Canetti, R., Goldreich, O., and Halevi, S., The random oracle model revisited, in Proc. ACM Symp. Theory Comput., 1998. 8. Davies, D. and Price, W., Security for Computer Networks, 2nd ed., John Wiley & Sons, New York, 1989. 9. FIPS 46, Data encryption standard, Federal Information Processing Standards Publication 46, U.S. Dept. of Commerce/N.B.S., National Technical Information Service, Springfield, VA, 1977 (revised as FIPS 46-1: 1988; FIPS 46-2:1993). 10. Feistel, H., Notz, W., and Smith, J., Some cryptographic techniques for machine-to-machine data communications, in Proc. IEEE 63, 1545, 1975. 11. Biham, E. and Shamir, A., Differential cryptanalysis of DES-like cryptosystems, J. Cryptology, 4, 3, 1991. 12. Coppersmith, D., The Data Encryption Standard (DES) and its strength against attacks, IBM J. R&D, 38, 243, 1994. 13. Matsui, M. and Yamagishi, A., A new method for known plaintext attack of FEAL cipher, in Advances in Cryptology—Eurocrypt ’92, Springer-Verlag, LNCS, Heidelberg, 658, 81, 1993. 14. Matsui, M. Linear cryptanalysis method for DES cipher, in Advances in Cryptology—Eurocrypt ’93, Springer-Verlag LNCS 765, 386, 1994. 15. Langford, S. and Hellman, M., Differential-linear cryptanalysis, in Advances in Cryptology—Crypto ’94, Springer-Verlag, LNCS, Heidelberg, 839, 17, 1994. 16. National Institute of Standards and Technology, Advanced Encryption Standard (AES), http://csrc. nist.gov/encryption/aes/. 17. Massey, J., Shift-register synthesis and BCH decoding, IEEE Trans. Info. Th., 15, 122, 1969. 18. Rivest, R., The RC5 encryption algorithm, in Fast Software Encryption, Second International Workshop, Springer-Verlag, LNCS, Heidelberg, 1008, 86, 1995. 19. Gilbert, E., MacWilliams, F., Sloane, N., Codes which detect deception, Bell Sys. Tech. J., 53, 405, 1974. 20. Bellare, M., Canetti, R., and Krawczyk, H., Keying hash functions for message authentication, in Advances in Cryptology—Crypto ’96, Springer-Verlag, LNCS, Heidelberg, 1109, 1, 1996. 21. Black, J., Halevi, S., Krawczyk, H., Krovetz, T., and Rogaway, P., UMAC: Fast and secure message authentication, in Advances in Cryptology—CRYPTO ’99, Springer-Verlag, LNCS, Heidelberg, 1666, 216, 1999. 22. Jutla, C., Encryption modes with almost free message integrity, in Advances in Cryptology—Eurocrypt 2001, Springer-Verlag, LNCS, Heidelberg, 2045, 529, 2001. 23. Diffie, W. and Hellman, M., New directions in cryptography, IEEE Trans. Info. Th., 22, 644, 1976. 24. Rivest, R., Shamir, A., and Adleman, L., A method for obtaining digital signatures and public-key cryptosystems, Comm. ACM, 21, 120, 1978. 25. ElGamal, T., A public key cryptosystem and a signature scheme based on discrete logarithms, IEEE Trans. Info. Th., 31, 469, 1985. 26. Koblitz, N., Elliptic curve cryptosystems, Math. Comput., 48, 203, 1987. 27. Goldwasser, S., Micali, S., and Rivest, R., A digital signature scheme secure against adaptive chosenmessage attacks, SIAM J. Comput., 17, 281, 1988. 28. Kravitz, D., Digital signature algorithm, U.S. Patent #5,231,668, July 27, 1993. 29. Rackoff, C. and Simon, D., Non-interactive zero-knowledge proof of knowledge and chosen ciphertext attack, in Advances in Cryptology—Crypto ’91, Springer-Verlag, LNCS, Heidelberg, 576, 433, 1992. 30. Naor, M. and Yung, M. Public-key cryptosystems provably secure against chosen ciphertext attacks, in Proc. ACM Symp. Th. Comput., 33, 1989.

© 2002 by CRC Press LLC

31. Dolev, D., Dwork, C., and Naor, M., Non-malleable cryptography, SIAM J. Comput., 30, 391, 2000. 32. Bleichenbacher, D., Chosen ciphertext attacks against protocols based on RSA encryption standard PKCS #1, in Advances in Cryptology—CRYPTO’98, Springer-Verlag, LNCS, Heidelberg, 1462, 1, 1998. 33. Bellare, M. and Rogaway, P., Optimal asymmetric encryption, in Advances in Cryptology—Eurocrypt ’94, Springer-Verlag LNCS 950, 92, 1995. 34. Fujisaki, E., Okamoto, T., Pointcheval, D., and Stern, J., RSA-OAEP is secure under the RSA assumption, Advances in Cryptology—Crypto 2001, Springer-Verlag, LNCS, Heidelberg, 2139, 260, 2001. 35. Shoup, V., OAEP reconsidered., Advances in Cryptology—Crypto 2001, Springer-Verlag, LNCS, Heidelberg, 2139, 239, 2001. 36. Boneh, D., Simplified OAEP for the Rabin and RSA functions, Advances in Cryptology—Crypto 2001, Springer-Verlag, LNCS, Heidelberg, 2139, 275, 2001. 37. Cramer, R. and Shoup, V., A practical public key cryptosystem provably secure against adaptive chosen ciphertext attack, in Advances in Cryptology—Crypto ’98, Springer-Verlag, LNCS, Heidelberg, 1462, 13, 1998. 38. Desmedt, Y., Society and group oriented cryptography: a new concept, in Advances in Cryptology —Crypto ’87, Springer-Verlag, LNCS, Heidelberg, 293, 120, 1988. 39. Croft, R. and Harris, S., Public-key cryptography and re-usable shared secrets, in Cryptography and Coding, Beker, H. and Piper, F., Eds., Clarendon Press, Oxford, 189, 1989. 40. Boyd, C., Digital multisignatures, in Cryptography and Coding, Beker, H. and Piper, F., Eds., Clarendon Press, Oxford, 241, 1989. 41. Shamir, A., How to share a secret, Comm. ACM, 22, 612, 1979. 42. Blakley, R., Safeguarding cryptographic keys, in Proc. AFIPS Nat’l Computer Conf., 313, 1979. 43. Pedersen, T., A threshold cryptosystem without a trusted party, in Advances in Cryptology—Eurocrypt ’91, Springer-Verlag, LNCS, Heidelberg, 547, 522, 1992. 44. Boneh, D. and Franklin, M., Efficient generation of shared RSA keys, J. ACM, to appear. 45. Shoup, V., Practical threshold signatures, in Advances in Cryptology—Eurocrypt 2000, Springer-Verlag, LNCS, Heidelberg, 1807, 207, 2000. 46. Cramer, R. and Shoup, V., Signature schemes based on the Strong RSA assumption, ACM Trans. Inf. Sys. Sec., to appear. 47. Paillier, P., Public key cryptosystems based on composite degree residuosity classes, in Advances in Cryptology—Eurocrypt ’99, Springer-Verlag, LNCS, Heidelberg, 1592, 223, 1999. 48. Cachin, C., Micali, S., and Stadler, M., Computationally private information retrieval with polylogarithmic communication, in Advances in Cryptology—EUROCRYPT ’99, Springer-Verlag, LNCS, Heidelberg, 1592, 402, 1999. 49. Goldreich, O., Micali, S., and Wigderson, A., How to play any mental game or a completeness theorem for protocols with honest majority, in Proc. ACM Symp. Th. Comput., 218, 1987. 50. Cohen, J. and Fisher, M., A robust and verifiable cryptographically secure election scheme, in Proc. IEEE Symp. Found. Comp. Sci., 372, 1985. 51. Benaloh, J. and Yung, M., Distributing the power of a government to enhance the privacy of voters, in Proc. ACM Symp. Princ. Distrib. Comput., 52, 1986. 52. Cramer, R., Schoenmakers, B., and Genarro, R., A secure and optimally efficient multi-authority election scheme, European Trans. Telecomm., 8, 481, 1997. 53. Menezes, A., van Oorschot, P., and Vanstone, S., Handbook of Applied Cryptography, CRC Press, Boca Raton, FL, 1997. 54. Anderson, R., Security Engineering: a Guide to Building Dependable Systems, John Wiley & Sons, New York, 2001.

© 2002 by CRC Press LLC

XI Testing and Design for Testability 44 System-on-Chip (SoC) Testing: Current Practices and Challenges for Tomorrow R. Chandramouli Introduction • Current Test Practices • SoC Testing Complexity • Emerging Trends in SoC Test • Emerging SoC Test Standards • Summary

45 Testing of Synchronous Sequential Digital Circuits U. Glaeser, Z. Stamenkovi´c , and H. T. Vierhaus Introduction • Mixed-Level Test Generation • The Fogbuster Algorithm for Synchronous Circuits • Summary

46 Scan Testing Chouki Aktouf Introduction • Scan Testing • Future of Scan: A Brief Forecast

47 Computer-Aided Analysis and Forecast of Integrated Circuit Yield Z. Stamenkovi´c and N. Stojadinovi´c Introduction • Yield Models • Critical Area Extraction • Yield Forecast • Summary

© 2002 by CRC Press LLC

44 System-on-Chip (SoC) Testing: Current Practices and Challenges for Tomorrow 44.1 44.2

Introduction Current Test Practices Scan-Based Test

44.3

SoC Testing Complexity Core Delivery Model • Controllability and Observability of Cores • Test Integration • Defects and Performance

44.4

Emerging Trends in SoC Test

44.5 44.6

Emerging SoC Test Standards Summary

Creation of Test-Ready Cores • Core Test Integration

R. Chandramouli Synopsys Inc.

44.1 Introduction Rapidly evolving submicron technology and design automation has enabled the design of electronic systems with millions of transistors integrated on a single silicon die, capable of delivering gigaflops of computational power. At the same time, increasing complexity and time to market pressures are forcing designers to adopt design methodologies with shorter ASIC design cycles. With the emergence of systemon-chip (SoC) concept, traditional design and test methodologies are hitting the wall of complexity and capacity. Conventional design flows are unable to handle large designs made up of different types of blocks such as customized blocks, predesigned cores, embedded arrays, and random logic as shown in Fig. 44.1. Many of today’s test strategies have been developed with a focus on single monolithic block of logic; however, in the context of SoC the test strategy should encompass multiple test approaches and provide a high level of confidence on the quality of the product. Design reuse is one of the key components of these methodologies. Larger designs are now shifting to the use of predesigned cores, creating a myriad of new test challenges. Since the end user of the core has little participation in the core’s architectural and functional development, the core appears as a black box with known functionality and I/O. Although enabling designers to quickly build end products, core-based design requires test development strategies for the core itself and the entire IC/ASIC with the embedded cores. This chapter begins with a discussion of some of the existing test methodologies and the key issues/ requirements associated with the testing of SoC. It is followed by a discussion on some of the emerging approaches that will address some of these issues.

© 2002 by CRC Press LLC

core TEST

Wrapper A C C E S S

FIGURE 44.1

Core access through wrapper isolation.

44.2 Current Test Practices Current test practices consist primarily of ATE-based external test approaches. They range from manual test development to scan-based test. Most of the manual test development efforts depend on fault simulation to estimate the test coverage. Scan-based designs are becoming very common, although their capacity and capability to perform at-speed test are being increasingly affected by physical limitations.

Scan-Based Test Over the past decade, there has been an increased use of the scan DFT methodology across a wide variety of designs. One of the key motivations for the use of scan is the resulting ability to automatically generate test patterns that verify the gate or transistor level structures of the scan-based design. Because test generation is computationally complex for sequential designs, most designs can be reconfigured in test mode as combinational logic with inputs and outputs from and to scannable memory elements (flip-flops) and primary I/O. Different types of scan design approaches include mux-D, clock scan, LSSD, and random access scan [1]. The differences are with respect to the design of the scannable memory elements and their clocking mechanisms. Two major classes of scan design are full scan and partial scan. In the case of full scan, all of the memory elements are made to be scannable, while in the case of partial scan, only a fraction of the memory elements, based on certain overhead (performance and area) constraints, are mapped into scan elements. Because of its iterative nature, the partial scan technique has an adverse impact on the design cycle. Although full scan design has found wider acceptance and usage, partial scan is seen only in designs that have very stringent timing and die size requirements. A major drawback with scan is the inability to verify device performance at-speed. In general, most of the logic related to scan functionality is designed for lower speed. Back-End Scan Insertion Traditional scan implementation depended on the “over-the-wall” approach, where designers complete the synthesis and hand off the gate netlist to the test engineer for test insertion and automatic test pattern generation (ATPG). Some electronic design automation (EDA) tools today help design and test engineers speed the testability process by automatically adding test structures at the gate level. Although this technique is easier than manual insertion, it still takes place after the design has been simulated and synthesized to strict timing requirements. After the completed design is handed over for test insertion,

© 2002 by CRC Press LLC

many deficiencies in the design may cause iteration back into module implementation, with the attendant risks to timing closure, design schedule, and stability. These deficiencies may be a violation of full-scan design rules (e.g., improper clock gating or asynchronous signals on sequential elements not handled correctly). In some cases, clock domain characteristics in lower-level modules can cause compatibility problems with top-level scan chain requirements. In addition, back-end scan insertion can cause broken timing constraints or violate vendor-specific design rules that cannot be adequately addressed by reoptimization. If back-end scan insertion is used on a large design, the reoptimization process to fix timing constraints violated by inserting scan can take days. If timing in some critical path is broken in even a small part of the overall design, and the violated constraint could not be fixed by reoptimization, the entire test process would have to iterate back into synthesis to redesign the offending module. Thus, back-end test, where traditionally only a small amount of time is budgeted compared to the design effort, would take an inordinately long time. Worse, because these unanticipated delays occur at the very end of the design process, the consequences are magnified because all the other activities in the project are converging, and each of these will have some dependency on a valid, stable design database. RT-Level Scan Synthesis Clearly, the best place to insert test structures is at the RT-level while timing budgets are being worked out. Because synthesis methodologies for SoCs tend to follow hierarchical design flows, where subfunctions within the overall circuit are implemented earliest and then assembled into higher-level blocks as they are completed, DFT should be implemented hierarchically as well. Unfortunately, traditional fullscan DFT tools and methodologies have worked only from the top level of fully synthesized circuits, and have been very much a back-end processes. The only way to simultaneously meet all design requirements—function, timing, area, power, and testability—is to account for these during the very earliest phases of the design process, and to ensure that these requirements are addressed at every step along the way. A tool that works with simulation and synthesis to insert test logic at this level will ensure that the design is testable from the start. It also ensures that adequate scan structures are inserted to meet the coverage requirements that most companies demand—usually greater than 95%. Achieving such high coverage is usually difficult once a design has been simulated and synthesized. Tools that automatically insert test structures at the RT-level have other benefits as well. Provided that they are truly automatic and transparent to the user, a scan synthesis tool makes it easy for the designer to implement test without having to learn the intricacies of test engineering. Inserting scan logic before synthesis also means that designers on different teams, working on different blocks of a complex design, can individually insert test logic and know that the whole device will be testable when the design is assembled. This is especially important for companies who use intellectual property (IP) and have embraced design reuse. If blocks are reused in subsequent designs, testability is ensured because it was built in from the start. A truly automated scan synthesis tool can also be used on third party IP, to ensure that it is testable. One of the key strengths of scan design is diagnosability. The user is able to set the circuit to any state and observe the new states by scanning in and out of the scan chains. For example, when the component/ system fails on a given input vector, the clock can be stopped at the erring vector and a test clock can be used to scan out the error state of the machine. The error state of the machine is then used to isolate the defects in the circuit/system. In other words, the presence of scan enables the designer to get a “snap shot” of the system at any given time, for purposes of system analysis, debug, and maintenance.

44.3 SoC Testing Complexity With the availability of multiple millions of gates per design, more and more designers are opting to use IPs to take advantage of that level of integration. The sheer complexity and size of such devices is forcing them to adopt the concept of IP reuse; however, the existing design methodologies do not support a cohesive or comprehensive approach to support reuse. The result is that many of these designs are created

© 2002 by CRC Press LLC

using ad hoc methodologies that are localized and specific to the design. Test reuse is the ability to provide access to the individual IPs embedded in the SoC so that the test for the IP can be applied and observed at the chip level. This ability to reuse becomes more complex when the IPs come from multiple sources with different test methods. It becomes difficult to achieve plug and play capability in the test domain. Without a standard, the SoC design team is faced with multiple challenges such as a test model for the delivery of cores, the controllability and observability of cores from the chip I/O, and finally testing the entire chip with embedded IPs, user defined logic, and embedded memories.

Core Delivery Model Core test is an evolving industry-wide issue, so no set standards are available to guide the testing of cores and core-based designs. Cores are often delivered as RTL models, which enable the end-users to optimize the cores for the targeted application; however, the current test practices that exist in the “soft core” based design environment are very ad hoc. To a large extent it depends on whether the “soft core” model is delivered to the end user without any DFT built into the core itself. The core vendors provide only functional vectors that verify the core functionality. Again, these vectors are valid only at the core I/O level and have to be mapped to the chip I/O level in order to verify the core functionality at the chip level. Functional testing has its own merits and demerits, but the use of functional tests as manufacturing tests without fault simulation cannot provide a product with deterministic quality. It can easily be seen that any extensive fault simulation would not only result in increased resources, but also an extended test development time to satisfy a certain quality requirement.

Controllability and Observability of Cores A key problem in testing cores is the ability to control and observe the core I/O when it is embedded within a larger design. Typically, an ASIC or IC is tested using the parallel I/O or a smaller subset of serial ports if boundary scan is used. In the case of the embedded core, an ideal approach would be to have direct access to its I/O. A typical I/O count for cores would be in the order of 300–400 signals. Using a brute-force approach all 300 signals could be brought out to the chip I/O resulting in a minimum of 300 extra multiplexers. The overhead in such a case is not only multiplexers, but also extra routing area for routing the core I/O to the chip I/O and most of all, the performance degradation of at least one gate delay on the entire core I/O. For most performance driven products, this will be unacceptable. Another approach would be to access the core I/O using functional (parallel) vectors. In order to set each core I/O to a known value, it may be necessary to apply many thousands of clocks at the chip I/O. (This is because, the chip being a sequential state machine, it has to be cycled through hundreds of states before arriving at a known state—the value on the core I/O signal).

Test Integration Yet another issue is the integration of test with multiple cores potentially from multiple sources. Along with the ability to integrate one or more cores on an ASIC, comes other design challenges such as layout and power constraints and, very importantly, testing the embedded core(s) and the interconnect logic. The test complexity arises from the fact that each core could be designed with different clocking, timing, and power requirement. Test becomes a bottleneck in such an environment where the designer has to develop a test methodology, either for each core or for the entire design. In either case, it is going to impact the overall design cycle. Even if the individual cores are delivered with embedded test, the end user will have to ensure testing of the interconnects between the multiple cores and the user logic. Although functional testing can verify most of these and can be guaranteed by fault simulation, it would be a return to resource-intensive ways of assuring quality. Because many cores are physically distinct blocks at the layout level, manufacturing test of the cores has to be done independent of other logic in the design. This means that the core must be isolated from the rest of the logic and then tested as an independent entity. Conventional approaches to isolation and test

© 2002 by CRC Press LLC

impact the performance and test overhead. When multiple cores are implemented, testing of the interconnects between the cores and the rest of the logic is necessary because of the isolation-and-test approach.

Defects and Performance Considerable design and test challenges are associated with the SoC concept. Test challenges arise both due to the technology and the design methodology. At the technology level, increasing densities has given rise to newer defect types and the dominance of interconnect delays over transistor delays due to shrinking geometry’s. Because designers are using predesigned functional blocks, testing involves not only the individual blocks, but also the interconnect between them as well as the user-created logic (glue logic). The ultimate test objective is the ability to manufacture the product at its specified performance (frequency) with the lowest DPM (defective parts per million). As geometry’s shrink and device densities increase, current product quality cannot be sustained through conventional stuck-at fault testing alone [2]. When millions of devices are packed in a single die, newer defect types are created. Many of these cannot be modeled as stuck-at faults, because they do not manifest themselves into stuck-at-like behavior. Most of the deep submicron processes use multiple layers, so one of the predominant defect types is due to shorts between adjacent layers (metal layers), or even between adjacent lines (poly or metal lines). Some of these can be modeled as bridging faults, which behave as the Boolean ANDing or ORing of the adjacent lines depending on the technology. Others do not manifest themselves as logical faults, but behave as delay faults due to the resistive nature of certain shorts. Unlike stuck-at faults, it becomes computationally complex to enumerate the various possible bridging faults since most of them depend on the physical layout. Hence, most of the practical test development is targeted towards stuck-at faults, although there has been considerable research in the analysis and test of bridging faults. At the deep submicron level interconnect delays dominate gate delays and this affects the ability to test at speed the interconnect (I/O) between various functional blocks in the SoC design environment. Since manufacturing test should be intertwined with performance testing, it is necessary to test the interaction between various functional blocks at-speed. The testing of interconnects involves not only the propagation of signal between various blocks, but also at the specified timing. Current approaches to test do not in general support at-speed test because of a lack of accurate simulation model, limited tester capabilities, and very little vendor support. Traditional testing, which is usually at lower speed, can trigger failures of the device at the system level.

44.4 Emerging Trends in SoC Test Two major capabilities are needed to address the major test challenges that were described earlier in this chapter: (1) making the core test-ready and (2) integration of test-ready cores and user logic at the chip level.

Creation of Test-Ready Cores Each core is made test-ready by building a wrapper around it as well as inserting appropriate DFT structures (scan, BIST, etc.) to test the core logic itself. The wrapper is generally a scan chain similar to the boundary scan chain [3] that helps the controllability and observability of the core I/O. The wrapper chain enables access to the core logic for providing the core test vectors from the chip boundary (Fig. 44.1). The wrappers also help in isolating the cores from other cores while the core is being tested, independent of the surrounding cores and logic. One of the key motivation for wrapper is test-reuse. When a testready core is delivered, the chip designer does not have to recreate the core test vectors but reuses the core test vectors. The wrappers also help in isolating the core electrically form other cores so that signals coming from other cores do not affect the core and vice versa.

© 2002 by CRC Press LLC

1149.1 Boundary Scan wrapper (Collar)

Demux

Direct

Scan Register core

Transparency Xout := Xin

Xout := Xin Mux

FIGURE 44.2

Core isolation techniques.

Transparency

A ALU

C=A+B = A if B = 0

FIGURE 44.3

An example for transparency.

B

Core Isolation Many different approaches (Fig. 44.2) can be used to isolate a core from other cores. One common approach is to use multiplexers at each I/O of the core. The multiplexors can be controlled by a test mode so that external vector source can be directly connected to the core I/O during test. This approach is very advantageous where the core doesn’t have any internal DFT structure and has only functional vectors which can be applied directly from an external source to the core I/O; however, this approach is disadvantageous when the number of core I/O exceeds that of the chip I/O and also impacts physical routing. In contrast, other approaches minimize the routing density by providing serial access to the core I/O. The serial access can be through dedicated scan registers at the core I/O or through shared registers, where sharing can happen between multiple cores. The scan register is called a wrapper or a collar. The scan registers isolate the core from all other cores and logic during test mode. The wrapper cell is built with a flip-flop and multiplexor that isolates each pin. It can be seen that the wrapper-based isolation has impact on the overall area of the core. Sharing existing register cells at core I/O helps minimize the area impact. Trade-offs exist with respect to core fault coverage and the core interconnect fault coverage, between shared wrapper and dedicated wrappers. Access to cores can also be accomplished using the concept of “transparency” through existing logic. In this case, the user leverages existing functionality of a logic block to gain access to the inputs of the core and similarly from the core outputs to the chip I/O through another logic block. Figure 44.3 shows an example of “transparency” in a logic block. Although this approach involves no hardware overhead, detection of transparency is not a simple automation process. In addition, the existence of transparency cannot be predicted a priori.

© 2002 by CRC Press LLC

DRAM SRAM User Logic

FIGURE 44.4

Core Core

User Logic

Core

D/A-A/D

An SoC includes multiple cores with memory and user logic.

Core

Core

Core

Wrapper

wrapper

wrapper

User-Logic

Test Bus

1149.1 TAP on-chip Test Controllers

FIGURE 44.5

R O M

on-chip BIST controllers

SoC

A high-level architecture for SoC test integration.

It becomes evident that the isolation approach and the techniques to make a core test-ready depends on various design constraints such as area, timing, power as well as the test coverage needs for the core, and the core interconnect faults.

Core Test Integration Testing SoC devices with multiple blocks (Fig. 44.4), each block embedding different test techniques, could become a nightmare without an appropriate test manager that can control the testing of various blocks. Some of these blocks could be user-designed and others predesigned cores. Given the current lack of any test interface standard in the SoC environment, it becomes very complex to control and observe each of the blocks. Two key issues must be addressed: sequencing of the test operations [4] among the various blocks, and optimization of the test interface between the various blocks and the chip I/O. These depend very much on the test architecture, whether test controllers are added to individual blocks or shared among many, and whether the blocks have adopted differing design-for-test (DFT) methodologies. A high-level view of the SoC test architecture is shown in Fig. 44.5, where the embedded test in each of the cores is integrated through a test bus, which is connected to a 1149.1 TAP controller for external access. Many devices are using boundary scan with the IEEE 1149.1 TAP controller not only to manage in-chip test, but also to aid at board and system-level testing. Some of the test issues pertain to the use of a centralized TAP controller or the use of controller in each block with a common test bus to communicate between various test controllers. In other words, it is the question of centralized versus distributed controller architecture. Each of these has implications with respect to the design of test functionality within each block. Besides testing each core through their individual access mechanism such as the core isolation wrapper, the complete testing of the SoC also requires an integrated test which tests the interconnects between the cores and the user-defined-logic (UDL). The solution requires, first to connect the test facilities

© 2002 by CRC Press LLC

between all the cores and the UDL, and then connect it to a central controller. The chip level controller needs to be connected to either a boundary scan controller or a system interface. When multiple cores with different test methodologies are available, test scheduling becomes necessary to meet chip level test requirements such as test time, power dissipation, and noise level during test.

44.5 Emerging SoC Test Standards One of the main problems in designing test for SoC is the lack of any viable standard that help manage the huge complexity described in this article. A complete manufacturing test of a SoC involves the reuse of the test patterns that come with the cores, and test patterns created for the logic outside the cores. This needs to be done in a predictable manner. The Virtual Socket Interface Alliance (VSIA) [5], an industry consortium of over 150 electronic companies, has formed a working group to develop standards for exchanging test data between core developers and core integrators as well as test access standards for cores. The IEEE Test Technology Committee has also formed a working group called P1500 to define core test access standards. As a part of the IEEE standardization effort, P1500 group [6,7] is defining a wrapper technology that isolates the core from the rest of the chip when manufacturing test is performed. Both the VSIA and IEEE standard are designed to enable core test reuse. The standards will be defining a test access mechanism that would enable access to the cores from the chip level for test application. Besides the access mechanism, the P1500 group is also defining a core test description language called core test language (CTL). CTL is a language that describes all the necessary information about test aspects of the core such that the test patterns of the core can be reused and the logic outside the core can be tested in the presence of the core. CTL can describe the test information for any arbitrary core, and arbitrary DFT technique used in the core. Furthermore, CTL is independent of the type of tests (stuck-at, Iddq, delay tests) used to test the core. CTL makes this all possible by using protocols as the common denominator to make all the different scenarios look uniform. Regardless of what hardware exists in a design, each core has a configuration that needs to be described and the method to get in and out of the configuration is described by the protocol. The different DFT methods just require different protocols. If tools are built around this central concept namely CTL, then plug-and-play of different cores can be achieved on a SoC for test purposes. To make CTL a reality, it is important that tools are created to help core providers package their core with CTL and tools be developed that work off CTL and integrate the cores for successful manufacturing test of the SoC. By documenting the test operation of the core in CTL reduces the test complexity and enables automation tools to use a black-box approach when integrating test at the SoC level. Black-boxed designs are delivered with documentation that describe the fault coverage of the test patterns, the patterns itself, the different configurations of the core, and other pertinent test information to the system integrator. The system integrator uses this information (described in CTL) possibly with the help of tools to translate the test patterns described in CTL at the boundary of the core to the chip I/O that is accessible by the tester. Furthermore, the system integrator would use the CTL to provide information on the boundary of the core to create patterns for the user defined logic (UDL) of the SoC outside the core. The methods used for system integration are dependent on the CTL information of the core, and the core’s embedded environment. All these tasks can be automated if CTL is used consistently across all black-box cores being incorporated in the design. Figure 44.6 shows the tasks during the integration process. DFT on SoC CTL for a Core

FIGURE 44.6

Core integration tasks with CTL.

© 2002 by CRC Press LLC

Translation of the Core Tests UDL Testing

SoC Test Patterns

SoC methodologies require synchronization between the core providers, system integrators, and EDA tool developers. CTL brings all of them together in a consistent manner. Looking ahead we can see the industry will create many other tools and methodologies for test around CTL.

44.6 Summary Cores are the building blocks for the newly emerging core-based IC/ASIC design methodology and a key component in the SoC concept. It lets designers quickly build customized ICs or ASICs for innovative products in fast moving markets such as multimedia, telecom, and electronic games. Along with design, core-based methodology brings in new test challenges such as, implementation of transparent test methodology, test access to core from chip I/O, ability to test the core at-speed. Emerging standards for core test access by both VSIA and IEEE P1500 will bring much needed guidance to SoC test methodologies. The key to successful implementation of SoC test depends on automation and test transparency. A readyto-test core (with embedded test and accompanying CTL, which describes the test attributes, test protocol, and the test patterns) provides complete transparency to the SoC designer. Automation enables the SoC designer to integrate the test at the SoC level and generate manufacturing vectors for the chip.

References 1. 2. 3. 4.

Alfred L. Crouch, Design for Test, Upper Saddle River, NJ: Prentice-Hall, 1999. R. Aitken, “Finding defects with fault models,” in Proc. International Test Conference, pp. 498–505, 1995. K.P. Parker, The Boundary Scan Handbook, Boston, MA: Kluwer Academic Publishers, 1992. Y. Zorian, “A distributed BIST control scheme for complex VLSI devices,” VTS’93: The 11th IEEE VLSI Test Symposium, pp. 4–9, April 1993. 5. Virtual Socket Interface Alliance, Internet Address: http://www.vsi.org 6. IEEE P1500, “Standards for embedded core test,” Internet Address: http://grouper.ieee.org/groups/ 1500/index.html 7. Y. Zorian, “Test requirements for embedded core-based systems and IEEE P1500,” in Proc. International Test Conference, pp. 191–199, 1997.

To Probe Further Core-based design is an emerging trend, with the result, test techniques for such designs are still in the evolutionary phase. While waiting for viable test standards, the industry has been experimenting with multiple test architectures to enable manufacturability of the SoC designs. Readers interested in these developments can refer to the following literature. 1. Digest of papers of IEEE International Workshop pf Testing Embedded Core-Based System-Chips, Amisville, VA, IEEEE TTTC, 1997, 1998, 1999, 2000, 2001. 2. Proceedings of International Test Conference, 1998, 1999, 2000. 3. F. Beenker, B. Bennets, L. Thijssen, Testability Concepts for Digital ICs—The Macro Test Approach, vol. 3 of Frontiers in Electronic Testing, Kluwer Academic Publishers, Biotin, USA, 1995.

© 2002 by CRC Press LLC

45 Testing of Synchronous Sequential Digital Circuits 45.1 45.2

Introduction Mixed-Level Test Generation Basic Concepts • Switch-Level Test Generation • Modified FAN Approach • Robustness Check for Pattern Pairs • Inter-Level Communications • Reconvergency Analysis • Merging of Test Pattern Pairs • Comparative Results

45.3

U. Glaeser

Circuits • General Approach in Comparison with Other Algorithms • Test Generation Technique • Fault Propagation and Propagation Justification • The Over Specification Problem in Sequential Test Generation • Detection of State Repetitions in Test Generation • Use of Global Set and Reset Signals in ATPG • Experimental Results

Halstenbach ACT GmbH

Z. Stamenkovi´c ∨

University of Nis

H. T. Vierhaus Bradenburgische Technische Universitat

The Fogbuster Algorithm for Synchronous Circuits

45.4

Summary

45.1 Introduction Automatic test generation for combinational logic based on the FAN algorithm [1,2], relying on the Dalgorithm [3] has reached a high level of maturity. FAN has also been modified for test generation in synchronous sequential circuits [4,5]. Because the shortcomings of the static stuck-at fault model in the detection of opens, dynamic faults, and bridging faults [6,7], became evident, interest has focused on refined fault modeling either using switch-level structures or dynamic gate-level fault models. The authors have shown [8–10] that the potential fault coverage by stuck-at-based test patterns for transistor faults is potentially as high as 80% or above if the circuit consists of simple 2- and 3-input fully complementary static CMOS gate primitives (AND, NAND, OR, NOR) only, but may drop to 60% or below if complex gates and pass-transistor networks are used. The first solution to this problem is switchlevel test generation [11], which is inherently slower than gate-level test generation. The need for test generation based on real transistor structures is also demonstrated by industrial work [12], which reported the first mixed-level test generation approach. Advanced work in this area was reported more recently in [13–15]. The main problem associated with such methods is the adequate fault modeling based on the transistor circuitry for structures other than primitive logic gates. Cox and Rajski [16] have shown that also using a transition fault model in ATPG, transistor faults in fully complementary CMOS complex gates can be covered by gate-level networks.

© 2002 by CRC Press LLC

This method, however, lacks the applicability to general pass-transistor networks and is also inefficient because the resulting networks have a very large number of gates. Delay fault testing [17–21] based on gate or path delay fault models has recently emerged as the potentially most powerful method for functional testing. In order to achieve a high fault coverage, delay fault testing requires a detailed timing characterization of the circuit elements. Although this is usually satisfied for gate-level ATPG based on logic primitives, it is difficult to compute timing properties in a full-custom design style employing complex gates, transistor networks, and bidirectional switch-level macros. If no explicit timing information is available, a transition fault model [22] is the most convenient choice. Such a model checks for possible high-low and low-high transitions of all internal circuit nodes at gate-level. The system clock then sets the timing limit for transitions. Recently, the detection of defects beyond functional faults by methods such as built-in overcurrent measurements (Iddq-testing) [23,24] has received considerable attention. Despite their potential coverage of transistor faults and bridging faults, these methods are static by nature and therefore they are a complement to, instead of a replacement for, dynamic testing. The work introduced here is aimed at the generation of efficient test sets for conventional voltagebased tests as well as for Iddq tests. The method is based on the transition fault model and on available structural information at transistor-level and gate-level. Our software also supports various other fault models. The basic approach relies on relatively few but efficient modifications to the FAN algorithm in combination with adapted local switch-level test generation. The advantage over other approaches is that switch-level structures are only addressed where truly necessary, and fault propagation is essentially handled at gate-level. Test sets are kept short by using robust multipattern sequences where possible. A sequential test pattern generation approach is presented in the second part of this paper. It is based on the new FOGBUSTER-algorithm. As opposed to the BACK-algorithm [25], which was built on basic theoretical work [26,27], FOGBUSTER uses a forward propagation and backward justification technique, which is in general more efficient than the exclusive reverse time processing BACK uses. The advantage of all these test pattern generators over simulation-based approaches [28,29], which are generally much faster than these techniques, is that they are complete, i.e., for any given testable fault a test pattern is generated assuming sufficient time. The overall approach is summarized in Table 45.1. Although MILEF (mixed-level FAN) is able to generate test patterns for combinational circuits using a modified FAN-algorithm, SEMILET can generate test patterns for synchronous sequential circuits using the FOGBUSTER-algorithm. The rest of this chapter is organized as follows: The first part (Section 45.2) describes the mixed-level ATPG approach for combinational logic. The second part is devoted to sequential ATPG (Section 45.3). Results for the ISCAS ’85 (combinational) and the ISCAS ’89 (sequential) benchmark circuits are presented and compared to other approaches. The chapter ends with a summary (Section 45.4).

TABLE 45.1

The Relation between Test Generation Approaches

ATPG Tool Circuit behavior Algorithm Test generator for embedded switch-level macros Voltage-based fault models

Current-based fault models

© 2002 by CRC Press LLC

MILEF

SEMILET

Combinational circuits or full scan circuits Modified FAN CTEST

Synchronous sequential circuits

Stuck-at Stuck-open Transition Stuck-at Stuck-on

FOGBUSTER CTEST Stuck-at

Stuck-at Stuck-on

45.2 Mixed-Level Test Generation The mechanism of mixed-level test generation technique is described as follows. An extraction algorithm is used for dividing the circuit into relatively small parts (1-stage complex gates, for instance) and extraction of primitive gates (NAND, NOR, INVERTER). The original FAN-algorithm can generate test patterns only for circuits that consist of primitive logic gates because FAN makes, exclusively, use of controlling and noncontrolling values. Thus the FAN-algorithm has to be modified to handle circuits, which consist of gates described by their logic behavior. These modifications and inter-level communication between gate- and switch-level are described in the following. Finally, additional heuristics like the robustness check supporting robust stuck-open test generation and the reconvergency analysis decreasing the number of backtracks between the two hierarchies are shown. The performance increase of the heuristics is pointed out by giving experimental results. The overall MILEF approach is described in Fig. 45.1. If the circuit consists of complex gates or transistor networks, the logic extractor is called. Primitive gates can directly be handled by the FANalgorithm. For complex gates and transistor networks, CMOS test pattern generator (CTEST) computes local test patterns. These local patterns are globally applied and propagated. Propagation of fault effects over a switch-level macro is done at gate-level.

Basic Concepts The MILEF system developed at GMD initially concentrated on the once popular stuck-open test [30]. MILEF has continuously been developed to cover other potentially more significant faults in an efficient way. MILEF is not restricted to a specific fault model. It is designed to also handle dynamic faults with the exception of explicit consideration of timing limits. Only transitions that are either impossible or delayed beyond the duration of a clock period are detected (gross delay faults). The main objective in MILEF is to handle only the necessary structures at the switch-level and to perform all general path-tracing operations at the gate-level, thus obtaining an acceptable overall performance of the test generation system. At the switch-level, a path-oriented ATPG approach as required for stuck-open tests is applied. At the gate-level, MILEF uses a transition fault model, which includes the restriction of single input transitions, e.g., the Hamming distance between init and test pattern is 1. Robust pattern pair requirements can be used as an option (see subsection “Robustness Check for Pattern Pairs”). With these extensions, stuckopen faults and stuck-at faults in primitive gates are safely covered. Stuck-on faults and local bridging faults are excited but not safely propagated. Start

switch-level netlist

gate-level netlist

reconvergency analysis

logic extractor

switch-level macros

CTEST for switch-level ATPG

FIGURE 45.1

primitive gates

Inter-level communication

Functionality of the MILEF A TPG system.

© 2002 by CRC Press LLC

modified FAN for gate-level ATPG

robustness check (for stuck-open)

test patterns

I1 I2

VDD

O1

I3 I4

OUT

I5 I6

FIGURE 45.2

2-NAND

O2

GND switch-level macro

Logic extraction of switch-level macros.

Input formats of MILEF are ISCAS-benchmarks (for circuits with exclusive primitive gates) and SPICE for circuits including complex gates, transmission gates, and bus-structures. Transistor-level information is actually reduced to a simplified switch-level structure for the test generation. All transistor-level networks are first analyzed by an advanced logic extraction program, which recognizes structures representing primitive gates such as NANDs, NORs, and inverters [31]. Transistor-level networks that contain only simple gates are shifted to the gate-level before any explicit test generation and are therefore not dealt with at the switch-level. Local test generation is done for nontrivial macros only and can mostly be limited to one-stage complex gates and smaller networks. The example circuit in Fig. 45.2 consists of a number of primitive gates and one transistor netlist. This transistor netlist is divided into two parts by the extractor. The first part of this transistor netlist can be identified as a 2-input NAND and is therefore shifted to gate-level by the extractor. Test generation for this gate is done exclusively at gate-level. The second part of the transistor netlist representing a ANDNOR function could not be mapped to any primitive gate. Consequently, test generation for this gate is done at the switch-level. Fault propagation over this switch-level macro during the test generation is done at gate-level by using the 0- and 1-cubes of the switch-level macro. The idea of extracting gate-level modules from switch-level circuits was already proposed in [32]. To also handle sequential circuits, the extraction algorithm can also identify several sequential elements such as, for instance, flip-flops and D-latches.

Switch-Level Test Generation The local test patterns for the switch-level macros remaining after the extraction are generated by CTEST (CMOS test pattern generator) [30,33]. CTEST generates transition test pattern pairs with a hamming distance 1 between initialization and test. Therefore, the robust stuck-open test condition is satisfied for local test patterns since the switch-level macros extracted do not have internal path reconvergencies. To satisfy this condition also globally in as many cases as possible, MILEF avoids static hazards at the inputs of the switch-level macro with the actual fault whenever possible. CTEST has a reasonable performance for circuits up to the size of 200 transistors. This is sufficient for the MILEF approach, because the switch-level macros left by the extractor are relatively small, in general, not exceeding 20 transistors [31]. The time spent for switch-level test generation measured in experimental results was less than 5% of the overall test generation time, even when the extraction process

© 2002 by CRC Press LLC

left around 30% of complex gates. The CPU time spent on global path tracing techniques of the FANalgorithm was immensely larger. In local mode, CTEST can also handle circuits including bi-directional transistors [34] and small sequential circuits like flip-flops and latches.

Modified FAN Approach MILEF is based on extensions of the de-facto standard FAN [1] for the test generation at the gate-level. The original FAN-algorithm is not able to generate test patterns for circuits, including switch-level macros, because FAN is exclusively based on controlling and noncontrolling values of the gates, which in general do not exist for switch-level macros. Consequently, the modified FAN in MILEF also handles Boolean functions of switch-level macros, i.e., multiple cubes [3]. Thus the main FAN functions such as implication, unique sensitization, and multiple backtrace are modified to handle switch-level macros. The modified FAN approach is described as follows (see also Fig. 45.1): As a preprocess of the test generation process, gate-level netlists are analyzed for path reconvergencies. They have to be identified for the reconvergency analysis described in subsection on “Merging of Test Pattern Pairs.” Most path reconvergencies in practical circuits are of a local nature, so this procedure is quite useful. Results are also transferred to the local switch-level test generator for specific macros in order to obtain globally applicable local test patterns from the beginning (see section “Reconvergency Analysis”). Furthermore, redundancies discovered during this initial step are included in the formal testability analysis. The constraint list is used later in conjunction with global implications in the FAN algorithm. The second preprocessing step for the gate-level test generation is the formal testability analysis. In particular, the formal controllability and observability analysis guiding the path searching process in FAN was modified in order to also accommodate switch-level macros. Modified testability measures that can optionally be used are, for instance, COP [35], LEVEL [36], SCOAP [37], and EXTEST [8]. We got the best experimental results in MILEF by using COP. Furthermore, the initial FAN-functions for implication, sensitization, and multibacktrace [1] are modified for handling switch-level macro cells described by multiple cubs [3]. Information for propagation over switch-level macros is described in a 9-valued logic, e.g., multiple cubes are used, and the good and faulty machine are described separately [10]. As MILEF is based on the transition fault model, pattern sequences instead of single vectors are generated. If possible, pattern pairs are merged into longer sequences to minimize initialization efforts and to save on the overall test length. The path searching performed by MILEF differs from approaches known from gate delay fault test generation in three ways: • Initialization is excited at a particular gate and test for a particular gate is propagated to primary outputs. The transition may not be observable directly as a transition at one output, which means it can also be observable because of a transition at an output in the faulty case. For example, in Fig. 45.3, the stuck-open fault at the n-transistor with the gate connected to A1 in g1 will cause a stable 1 at C and thus a rising transition at D while in the “good case” D has a “0” value. • Hazard analysis concentrates on static hazards that can invalidate the test by switching other paths or elements to “conducting” (see subsection “Robustness Check for Pattern Pairs”). • For the generation of only overcurrent-related test vectors in combination with Iddq measurements, the propagation of faults to primary circuit outputs can be omitted. This results in a simplified path-tracing process and fewer patterns. Three phases of test generation are used in MILEF. In the first phase, test patterns are computed for all easily detectable faults and no backtracking is allowed. No dynamic fault sensitization [2] and no dynamic learning [4] are used in this step. The user may give a backtrack limit for the second phase in which the extensions of SOCRATES [2] are used. For redundancy identification and generating test patterns for hard detectable faults, dynamic learning [4] is used in the third phase of test generation.

© 2002 by CRC Press LLC

B 0 0-1 A A1

g1 C A2

FIGURE 45.3

1-0/1

g2 D 0/0-1

0-1

Stuck-open test in gate g1.

MILEF = MIxed LEvel FAN

Get Get Put Generate Cubes Constraints D-Cubes Test

Reject

Simulate Pattern

CTEST = CMOS Test

FIGURE 45.4

Communication between MILEF and CTEST.

Robustness Check for Pattern Pairs The understanding of “robustness” slightly differs between authors. We follow the robustness definition of Reddy et al., described in [38], avoiding static hazards in the input cone of the fault location. Since MILEF has no information on timing conditions, a “worst case” analysis for static hazard occurrences is performed. Starting from the fault location backward to the primary inputs of the circuit, every constant signal is checked for a static hazard. If both a rising and a falling transition are found at one gate, the output of the gate is marked hazardous. If for a pattern the static hazard is propagated to the fault location, the corresponding pattern is marked nonrobust. By using a fault simulator that takes into account timing conditions such as FEHSIM [39], these nonrobust patterns possibly can be identified as robust.

Inter-Level Communications In the present version, MILEF works on two levels of hierarchy at the gate-level and the switch-level. Communication procedures between the gate-level and the switch-level are performed systematically during several steps of the program execution. The communication functions in Fig. 45.4 are described as follows: • get_cubes: The logic function of a switch-level macro is computed and the corresponding values are prepared to be used at the gate-level. • put_constraints: Constraints are computed on the gate-level for a block (e.g., by the reconvergency analysis described in the following section) and stored for a constraint driven ATPG at the switchlevel. • get_D_cubes: Propagation cubes are computed and prepared to be used for the propagation of fault effects over a switch-level macro during the test generation at the gate-level. • generate_test: The previous local io-values of a switch-level macro are used locally as an initpattern at the switch-level and the computed local test pattern is used at the gate-level for global propagation and application.

© 2002 by CRC Press LLC

• reject: The actual local test pattern is not applicable at gate-level, e.g., the fault is redundant or aborted, an inter-level backtrack is performed. If possible a new local test pattern is generated for the same fault condition. • simulate_pattern: A call of the local switch-level fault simulator for local simulation of the actual init/test-pattern pair is performed. A robustness check is included.

Reconvergency Analysis The reconvergency analysis is based on a learning technique at reconvergent paths. It is executed as a preprocess of the test generation. The learning technique is similar to SOCRATES [2]. The algorithm is described as follows: For all switch-level macros M for all inputs S of M, which are part of a reconvergency in M assign all signals of the circuit to value X; (1) assign S to value 0; (2) implication; for all inputs I of M, which were set to any value 0 or 1 by the implication (a) store_constraint (S = 0 => I = value of I); assign all signals of the circuit to value X; (1) assign S to value 1; (2) implication; for all inputs I of M, which were set to any value 0 or 1 by the implication (a) store_constraint (S = 1 => I = value of I); Every input of a switch-level macro M, which is part of a reconvergency in M, i.e., in one of the reconvergent paths reconverging in M, is set to both values 0 and 1 (1). By performing the implication of the modified FAN-algorithm (2) every resulting value of this assignment is computed. If any assignment is made at any other input of M, a local constraint at M is found. This constraint and its contraposition are stored by the function store_constraint (a). Only simple constraints (dependencies between two signals) can be detected by this technique, e.g., if input A is 1, input B must be 1. By using these constraints at the switch-level and performing a constraintbased switch-level test generation, a large number of inter-level backtracks can be avoided, and thus the CPU time of switch-level test generation can be reduced. Since the constraints result from a simple FAN implication, no backtrack in test generation is required for computing the constraint behavior. For example, assume a constraint (a = 0 => b = 0) is detected by the reconvergency analysis. Thus, a simple implication with starting condition a = 0 will detect that b has to be assigned to 0. Since the same implication function is also performed during the test generation process whenever a is set to 0, b will be set to 0 by the implication. Thus no CPU time could be saved in the test generation at the gate-level. Experimental results showing the efficiency of the reconvergency analysis are illustrated in Table 45.2. TABLE 45.2

Inter-Level Backtracks in Test Generation

ISCAS- Benchmark Circuit Extracted from Layout lay432 lay880 lay1355 lay2670 lay3540 lay5315 lay6288 lay7552

© 2002 by CRC Press LLC

Inter-Level Backtracks with no Reconvergency Analysis

Inter-Level Backtracks with Reconvergency Analysis

19 29 65 254 147 336 542 221

0 2 2 31 89 40 0 46

Percentage of Saved Backtracks (%) 100 93.1 96.9 87.8 39.5 88.1 100 79.2

A B1 B

&

D X->0

B2

C

FIGURE 45.5

0 X

SLM

E

Reconvergency analysis of a simple circuit.

In Fig. 45.5 a simple example is shown. By performing the reconvergency analysis a constraint between the inputs of the macro SLM is detected. If B2 is set to 0, by performing an implication D is set to 0. Thus the constraint (B2 = 0 => D = 0) is found and stored at SLM by using the function put_constraints. In Table 45.2 the number of inter-level backtracks without using the reconvergency analysis is compared to test generation while using the reconvergency analysis. Depending on the structure of the sequential benchmark circuits, between 39% and 100% of inter-level backtracks could be saved. The average saving in the ISCAS ’85 benchmark circuits was about 85%. These results are also important for hierarchical test generation on higher levels of abstraction (RTlevel, behavior-level). The authors believe that using a simple constraint analysis on each abstraction level of the circuit could save a significant time. This requires a constraint-driven test generation technique at lower abstraction levels.

Merging of Test Pattern Pairs The test pattern set computed by MILEF is not a minimal set. A test set compaction method is implemented to reduce the number of test patterns as follows. Every generated test pattern is used as an initialization for the next undetected fault whenever possible. Experimental results with ISCAS ’85 and ISCAS ’89 benchmark circuits have shown that with this method between 30% and 45% of stuck-open patterns could be saved. This method is described in detail in [40].

Comparative Results MILEF was used for gate-level benchmark circuits, for mixed gate-level netlists, and for pure transistor netlists. Computations were performed on a Sun SPARC2 with a general limitation of 10 backtracks per fault. The computational effort for fault simulation, which is small in comparison to the test generation effort, is not included in the CPU times of the tables. MILEF was operated in the following modes: 1. Single-input robust transition fault generation including full stuck-open coverage and stuck-on/ local bridging fault excitation. 2. Stuck-at test at gate-level including input/output stuck-at tests for switch-level macros. 3. Iddq-test patterns derived from stuck-at patterns, no propagation to outputs. The fault coverage FC is computed by

fault coverage = detected faults/total number of faults Table 45.3 gives results for switch-level circuits containing primitive and complex gates. The layouts of the ISCAS ’85 circuits [41] were synthesized by MCNC, Raleigh, NC. For mixed-level netlists containing nontrivial switch-level primitives, no standard benchmarks could be used. The circuits in Table 45.3 contain several types of complex gates. The number of complex gates in these circuits was about 30–40% of the total number of gates in the circuit. This evaluation includes extraction and test generation on two levels with robustness checking.

© 2002 by CRC Press LLC

TABLE 45.3 MILEF Performance on Pure Switch-Level Netlists (Robust Single-Input Transition Fault Model) Mode 1 Faults Redundant + Aborted

Circuit lay432 lay499 lay880 lay1355 lay1908 lay2670 lay3540 lay5315 lay6288 lay7552

Robust Test Patterns

Non-Robust Test Patterns

Robust Fault Coverage (%)

CPU Time/s

511 1148 491 1528 1056 1319 1858 3036 1282 3079

63 219 43 295 93 64 166 205 63 258

89.5 85.8 97.2 85.4 95.6 95.1 92.6 96.3 91.1 95.7

41 503 30 586 146 234 487 317 4017 1742

46 + 0 8+0 0+4 5+1 8+6 65 + 22 121 + 3 33 + 2 2+1 46 + 45

TABLE 45.4

Performance of MILEF (Stuck-at Fault Model) Mode 2

Circuit

Faults Redundant + Aborted

Test Patterns

C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552

4+0 8+0 0+0 8+0 9+0 117 + 0 137 + 0 59 + 0 34 + 0 131 + 0

77 90 120 107 142 266 222 286 41 350

Fault Coverage (%)

CPU Time/s

99.2 98.9 100 99.5 99.5 95.7 96.0 98.9 99.6 98.3

9 34 2 61 42 76 58 30 133 179

TABLE 45.5 Performance of MILEF (Stuck-at Patterns with Simplification for Overcurrent Tests) Mode 3 Circuit

Faults Redundant + Aborted

Test Patterns

Fault Coverage (%)

CPU Time/s

C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552

0+0 0+0 0+0 0+0 0+0 13 + 12 1+0 1+0 17 + 1 4+0

14 44 22 91 45 57 66 67 59 106

100 100 100 100 100 99.5 99.9 99.9 99.9 99.9