328 44 37MB
English Pages 462 [488] Year 1990
.
MICROPROCESSORS A PROGRAMMER'S VIEW
ppppppPPPPPP PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPPJ PPPPPPPPPPPPJ pgppppppppppf PPPPPPPPPPPPJ PPPPPPPPPPPP PPPPPPPPPPPP PPPPPPPPPPPPf pppprppppppp! ppppppppppppE PPPPPPPPPPPP pppppppppppp| ppppppppppppu ppppppppupppf PPPPPPPP3PPP| PPPPPPRP 3pppa pppppppd DPPPu ppppppppppppy ppppppppppppppi ppppppppc PPPR p,lpppp.pi
|
I
ROBERT
B. K.
COMPUTER PROFESSIONALS
RISCrd.
DEWAR / MATTHEW SMOSNA
:j
003_ 17
Digitized by the Internet Archive in
2014
http://archive.org/details/microprocessorspOOrobe
MICROPROCESSORS A PROGRAMMER'S VIEW
MICROPROCESSORS A PROGRAMMER'S VIEW
Robert B. K. Dewar
Matthew Smosna Computer Science Department Courant
Institute,
New
York University
McGraw-Hill Publishing Company New York
St.
Louis
Lisbon
Oklahoma City
Paris
San Francisco
London San Juan
Madrid Sao Paulo
Auckland
Mexico
Bogota
Milan
Singapore
Caracas
Montreal
Sydney
Tokyo
Hamburg
New
Delhi
Toronto
MICROPROCESSORS: A PROGRAMMER'S VIEW Copyright
©
1990 by McGraw-Hill,
of America. Except
States
as
1976, no part of this publication
by any means, or stored
Inc. All rights reserved. Printed in the
United
permitted under the United States Copyright Act of
may
be reproduced or distributed
in
any form or
without the prior written
in a data base or retrieval system,
permission of the publisher.
1234567890 DOC DOC
10
9 5 4 3 2
ISBN D-D7-DlLb3fl-S -CS0FT*
ISBN D-D7-Dlth3T-D -CHARD} This book was
Adobe Garamond and Helvetica by
set in
Ventura Publisher, Adobe
The
editor
the authors using Xerox
and Corel Draw.
Illustrator,
was David M. Shapiro.
R. R. Donnelley
&
Sons
Company was
printer
and binder.
Library of Congress Cataloging-in-Publication Data Dewar, Robert
B. K.
A
Microprocessors:
Programmer's View
/
Robert
Dewar,
B. K.
Matthew Smosna cm.
p.
Includes bibliographical references.
ISBN 0-07-016638-2.— ISBN 0-07-016639-0 Microprocessors
1. II.
— Programming.
I.
(hard)
Smosna, Matthew.
Title.
QA76.6.D515 005.26— dc20
1990
89-77320
Figures 3.5 (p. 90), 4.1
344),
1
1
346),
.3 (p.
1
1
(p.
105), 4.3 (p. 117), 4.4 (p. 118), 11.1 (p. 343), 11.2
.4 (p.
347),
1
1
.5 (p.
363), and
with the permission of Intel, Inc., Copyright i386
DX,
i376, i860,
Figures 6.1
(p.
and i486
are
©
Intel
1 1
.6 (p.
(p.
364) were reprinted
Corporation.
The terms
i386,
trademarks of the Intel Corporation.
166), 6.5 (p. 198), 7.1 (p. 204), 7.2 (p. 205),
and 7.4
(p.
217) were
reprinted with the permission of Motorola, Inc.
Figures 9.1
289), 9.8
(p.
(p.
266), 9.2
(p.
295), and 9.9
Computer Systems,
268), 9.3 (p.
(p.
269), 9.4
(p.
273), 9.6
(p.
288), 9.7
297) were reprinted with the permission of
(p.
MIPS
Inc.
f igures 10.1 (p. 303), 10.2 (p. 306), 10.3 (p. 308),
with the permission of Sun Microsystems, Inc. 1989. All rights reserved.
and 10.5
© Copyright
(p.
318) were reprinted
Sun Microsystems,
Inc.,
To
my
parents,
Michael and Mary Dewar
To
my father,
Stanislaw
Smosna
ABOUT THE AUTHORS
Robert B. K. Dewar department
at
is
a Professor
of Computer Science and past chair of the
Courant Institute of Mathematical Sciences
at
New York University. He
has been involved with computers for over twenty-five years and has written major software systems including real-time operating systems for Honeywell on early microprocessors and a series of compilers.
The SPITBOL compiler, which he originally wrote now been ported to most major
nearly twenty years ago for mainframe computers, has
microprocessors, including most recently the SPARC.
run-time library for the Realia has been involved with the reviewers.
Ada
He
COBOL
Ada
He
compiler for the
wrote the back end and
IBM
PC, and more recently
language, for which he was one of the language
has also been involved in the design and implementation of the Alsys
compilers for the
Matthew Smosna
IBM PC and
other microprocessors.
Courant Institute of Mathematical Sciences at New York University. He has worked on several implementations of the SETL system (SETL is a set theoretic language developed at NYU), and is currently involved in the implementation of a new Ada compiler for the IBM RP3 (an experimental parallel processor). His main field of research is compiler technology, with an is
a Research Scientist at the
emphasis on code generation techniques. compiler courses
He
has taught graduate and undergraduate
at several universities, including
textbook on compiler design, based on the
NYU, and
class notes, for
is
currently writing a
McGraw-Hill.
5
CONTENTS
Preface
Chapter
1
xvii
Microprocessors What
Is
l
A Microprocessor?
2
The User-Level View of a Microprocessor The System-Level View of a Microprocessor CISC and RISC Microprocessors Registers, Addressing,
3
4 5
and the Instruction Formats
5
Register Sets
6
Addressing Modes
7
Designing Instruction Formats
8
Data Representation
9
Representation of Characters
9
Representation of Integers
10
Packed Decimal
14
Floating-Point Values
Memory
Organization
Big-Endian vs Little-Endian Byte Ordering Big-Endian vs Little-Endian Bit Ordering The Alignment Issue Procedure Calls
The
Call Instruction
Building a Stack Frame
Why a
Frame Pointer Is Needed Hardware Support for Stack Frames Accessing Non-Local Variables Addressing Modes Direct
Memory Addressing
Indexed Addressing
1
15
16 18 19
23 23 24 25 26
27 27 28 29
ix
1
X
CONTENTS Based Addressing
30
Base Plus Index Addressing
31
Indirect Addressing
33
Indirect Addressing with Indexing
Even More Complicated Addressing Modes
33 35
Memory Mapping Memory Memory Caching
37 39 39
Virtual
Tasking
4
Exceptions
Hardware Support
Chapter 2
42 43
for Exceptions
Introduction to the 80386
45
Register Structure
45 48
Special Registers
and Instructions
Maintaining Compatibility with the 8086/88
The User
Instruction Set
Movement
Basic Data
50 Instructions
51
Basic Arithmetic and Logical Operations
51
Multiplication and Division Instructions
53 56
Decimal Arithmetic String Instructions
57
Shift Instructions
59
The
Set
on Condition Instructions
59
Summing Up Registers
60
and the Run-time Stack
Why EBP
Is
Instructions
Instruction
Timing
61
Needed That Make Use of ESP and EBP
Timing the
ENTER Instruction
Addressing and Memory
63 71
Memory on
the
Addressing
Using 16-Bit
61
70
Pipelining and Instruction Timings
Chapter 3
49
Memory on
the
386
Alignment Requirements Byte and Bit Ordering Addressing Modes
72
80386
75
75 76
76 77 80
Direct Addressing
80
Based Addressing
81
Based Addressing with Displacement
82 84 84
Double Indexing Double Indexing with Scaling Segmentation on the 80386
85
Historical Aspects
85
The Global Descriptor Table
87
CONTENTS
Levels of Protection
90 90
Operating System Structure
91
Validating Parameters
93 95
Mechanisms
Protection
Is
All This
The 80386
Chapter 4
Worthwhile?
96
Instruction Formats
Tasking, Virtual Memory, and Exceptions
on the 80386
103
Tasking
103
The
104 104
Local Descriptor Table
Context Switching
Memory
Shared
106
The System Stack
109
Virtual
109
But
Machine Support It All Worth It?
Ill
Memory Management
112
Virtual
Is
Virtual Segmentation
1
Segment-Swapping Algorithms
113
Paging
1 1
1
5
16
Handling Page Faults
120
Virtual Segmentation and Virtual Paging
123
An Anecdote on
124
Paging and Protection
Mode
Exceptions on the 386
1
26
126 128
A Sad Story How an Exception
5
12
The Format of Virtual Addresses
Paging and Virtual 8086
Chapter
xi
Is Handled Asynchronous Interrupts
129
Writing Exception Handlers
131
130
Fault Traps
132
Debugging Support
132
Microprocessors and Floating-Point Arithmetic
135
Floating-Point Implementations Floating-Point Operations:
A
136 Programmer's
Nightmare
The Ada Approach The IEEE Floating-Point Standard Basic Formats
Rounding Modes Extended Precision Formats Overflow and Infinite Values Not a Number (NaNs) Handling of Underflow
137 138 139 139 140 144 147 148 148
xii
CONTENTS Specialized Operations
150
Implementing the IEEE Standard
1
The Intel 387 Chip The 387 and the IEEE Standard The Register Set of the 387 The Instruction Set of the 387
1
1
1
52 53 54
Executing 387 Instructions
157 1
The Weitek
Chipset:
Memory Mapped The Weitek
59 160
An
Alternative
Approach
161
Access
1
Instruction Set
Register Set
Special Purpose Registers
The 68030 Linear Address Space C, Pointers, and the Linear Address Space
1
165 1
66
1
67
1
68
168
Data and the Linear Address Space Byte Ordering Bit Ordering
The 68030 User-Level Instruction Data Movement Instructions
6
161
The 68030 User Programming Model The 68030 User
69 170 1
171
172
Set
172
Integer Arithmetic Instructions
173
Logical and Shift Instructions
174
Bit Field Instructions
174
Program Control Instructions Decimal Instructions
177
The CAS2 (Compare and Double Exchange)
178 In-
180
struction
The 68030 Addressing Modes Addressing Modes and Instruction Sizes Simple Data Movement Postincrement and Predecrement Modes The Register Indirect with Displacement Mode The Register Indirect with Index Modes The Memory Indirect Addressing Modes PC Relative Addressing Modes and Position Independent Code Restrictions on the Use of the Addressing Modes Floating-Point on the 68030 Instruction Formats on the 68030
The 68030 Supervisor The Supervisor
83 184 1
185
186 189 190 192 195
196 1
96
198
201
Conclusion
Chapter 7
1
Coprocessor Emulation
Context Switching
Chapter 6
5
152
State Registers
State
203
204
CONTENTS
The
206
Privileged Instruction Set
Addressing on the 68030
208
Caching on the 68030 Cache Organization Cache Performance Cache Control
209 209 210
The 68030 Memory Management Unit The Address Translation Cache The 68030 Paging Mechanism The Structure of a 68030 Page Table
214 215 216 220 221
211
Transparent Translation
Context Switching
222
Trace Control
222
Exceptions
223 223 224 225
Trap Processing Interrupt Processing
Reserved Exceptions
Chapter 8
An
Introduction to
CISC
RISC
Architectures
One
Series
236 237 238 247 248 249 250
Instruction per Clock Cycle
Pipelining
Simplified
Memory
Addressing
Avoiding Microcoding Register-to-Register Operations
Simple Instruction Formats Register Sets in
229
230 233
Architectures
The IBM 360 What is RISC?
Chapter 9
xiii
RISC Machines
251
CISC, RISC, and Programming Languages
254
The First RISC Processors The CDC 6600 The IBM 801 Project The Berkeley RISC and Stanford MIPS Summary
257 257 258 264
The MIPS Processors
CPU
The Instruction Pipeline The Stall Cycle The Instruction Set The Instruction Formats The Load and Store Instructions The Computational Instructions Immediate Instructions
264
265
The MIPS Chip Register Structure of the
Projects
266 267 268 270 272 272 273 277 281
5 1
xiv
CONTENTS
The Jump and Branch
Instructions
Procedure Call Instructions
The Coprocessor
Instructions
Special Instructions t
Addressing Modes Direct Addressing
Indexed and Base/Index Addressing Base Plus Offset Addressing
Memory Management on
the
MIPS
The Address Space The Instruction and Data Caches The Translation Lookaside Buffer Floating-Point Operations on the
MIPS
Instruction Scheduling
Trap Handling and Overlapped Execution Exception Handling on the
MIPS
Hardware Interrupts
Chapter 10
287 287 289 290 294 296 297 298 300
300
The SPARC Architecture
301
The SPARC Architecture
30
The SPARC Signals The IU Register Set The User Register Set The System Register Set Register
Windows
Managing
the Register File
SPARC Addressing Modes The SPARC Instruction Set The Call Instruction Format The General Instruction Format The SETHI Instruction Format The Conditional Branch Instructions
302 303 304 304 306 307 310 3
1
317 318 318 326 327 331
Exceptions Floating-Point on the
SPARC
Floating-Point Registers
Overlapped Multiplication and Addition
1 1
284 284 285 286
Conclusion
General Organization
Chapter
282 283 283 283
333 334 334
The SPARC Implementations
338
Conclusion
339
The
Intel
A Summary
i860 of the i860
Basic Structure of the i860
Instruction Formats
The Processor Status Registers
341
342 342 344 345
CONTENTS XV Extended Processor Status Register Debugging Support
Memory Management
350
The i860 Cache The Integer Core Instruction
351 Set
Load and Store Integer Addition and Subtraction Multiplication and Division on the i860 Integer
The The
Shift Instructions
Logical Instructions
Control Transfer (Branch and Jump) Instructions
A
Digression on
Ada
-
Access Before Elaboration
Floating-Point Operations on the i860 Floating-Point Load and Store Floating-Point Addition
352 352 353 354 354 355 355 356 357 357 360
Floating-Point Multiplication
361
Adding and Multiplying at the Same Time Using Dual Instruction Mode
362 365 367 368 369
Floating-Point Division Floating-Point Square Root
IEEE 754 Compatibility The i860 Graphics Unit Graphics Pixel Data Type
370 370
Graphics Instructions
371
Exceptions
Context Switching
Chapter 12
347 350
373 374
Programming Model
374
The IBM RISC Chips
377
The IBM RIOS Architecture The Branch Unit The Arithmetic-Logic Unit. The Floating-Point Unit Register Renaming
380 380 383 384 385 385
Data Cache
The RIOS
Instruction Set
Bit Field Instructions
Complex
Instructions
Floating-Point Instructions
Branch Instructions Condition Flag Instructions
Memory
Addressing
Addressing Modes Direct Addressing
Operand Alignment Big-Endian Ordering
Memory Management
386 387 387 388 390 392 393 393 394 395 396 397
4 2 3 5
xvi
CONTENTS Paging Mechanisms
Hardware Locking
An
Example: Matrix Multiplication
Scheduling Comparison Instructions
Hand-Coded Summary
Chapter 13
The
Routines
INMOS
408
Transputer
409
The Transputer and Occam The Structure of the Transputer
410 4
1
Instruction Format
412
Register Structure
4
1
Memory
4
1
4
1
Structure
Loading Values From Memory Extending Operand Values The Remaining Basic Instructions The Extended Instruction Set (Operate) Using the Evaluation Stack
Communication Between Transputers Internal
and External Channels
Process Control
Interrupt Handling
Error Handling. Possible Network Arrangements Other Network Arrangements
Chapter 14
398 400 403 407 407
416 417 420 425 426 427 427 432 433 434 434
Conclusion
435
The Future of Microprocessor Design
437
Developments
437
in Instruction-Set
The CISC Chips CISC vs RISC
Fight Back
Designs
438 441
Glossary
443
Bibliography
455
Index
457
PREFACE 4.
The
introduction of microprocessors
in the use
many
some
ten years ago was an important milestone
of computers. The early microcomputers had limited power, but there are
which
tasks
are satisfied
by
this limited
power such
as control
machines, automobile ignition systems, and computer games. As a
house
is
likely to
have dozens of devices that would be regarded
by the standards of the
More
as
of washing
result, the
average
powerful computers
early developers in the field.
recently, the technology has
advanced
to the point
have achieved very substantial computing power, challenging
where microprocessors larger systems, and
much
book examines and compares these powerful microprocessor architectures. What to do in writing this book is to look at these processors from a software point of view. You will find few schematic diagrams in the book, since we are not interested in the hardware level design. You will, on the other hand, find many assembly language programming examples, showing the significance of the architectural variathis
we attempt
#
between the processors we examine.
tions
The
challenge of describing what a
programmer needs to know about the made even more difficult, but also
architectural features of microprocessors has been
more
entertaining, by a basic split in the architectural philosophies influencing micro-
processor design. Until recently, microprocessor development has ever
more complex hardware, including
use of high-level languages and operating systems. architects to
shown
a trend to
specialized features intended to support the
As VLSI techniques have allowed
pack more transistors on a chip, they have been able to produce micro-
processors with capabilities going well
beyond the mainframes of only
a
decade ago.
number of designers have proposed, known as RISC processors, or Reduced
In a recent sharp reaction to this trend, a
designed and implemented a Instruction Set Computers.
class
of processors
Reduced instruction set computers are streamlined processors
with a simplified instruction
set.
This simplified instruction
set allows a
hardware
designer to use specialized techniques to increase the performance of a machine in a
manner
that
not better
if
is
peculiar to these architectures.
the efficiency cost
Mainstream tion Set Computers,
is
The RISC view
is
basically that
more
is
too high.
dubbed CISC, for Complex Instructhis acronym suggests much too complicated. Some advocates of CISC designs have
architectural designs have been
by RISC proponents. The implicit criticism in
that these processors are
xvi 1
XV111
PREFACE
retorted that the term
Computers.
It all
CISC should be taken
to refer to
Complete Instruction Set
depends on the point of view.
we examine most of the important microprocessors, including both RISC and CISC processors. We certainly don't attempt to describe every representative the book would be too heavy to carry feature of every processor in complete detail around if we did but we do attempt to cover the most interesting points, and the RISC vs CISC debate is a unifying theme that runs through the book. The importance of RISC processors is well established Wall Street is almost as familiar with the term as the computer science establishment. In this book we attempt to provide a perspective on the issues and to give a basis for looking into the future to see where this design controversy might lead. The text is based on a "special topics" graduate course taught at New York University in the spring semester of 1989 by Robert B. K. Dewar. Matthew Smosna began taping the lectures, transcribing and typesetting them, and finally organized the notes in the first version of this book. With the help of our reviewers' comments, we then made several passes through that version, making many technical corrections and the result is the book that you are now reading. additions Selected chapters of the text were read by several of our friends and colleagues at New York University, including Fritz Henglein (now at Rijksuniversiteit Utrecht); Yvon Kermarrek, New York University; Cecilia Panicali, New York University and Jay Vandekopple, Marymount College. Jim Demmel, New York University, reviewed Chapter 5. Stephen R Morse, the principle designer of the 8086, helped by telling th e In this book,
—
—
—
—
inside story of the design of this processor.
Dan
Prener,
IBM
Research, helped us to
also of IBM Research, Our special thanks go to Richard Kenner both of New York University, who read the whole manuscript, and
better understand the RIOS.
Marc Auslander and
Peter
Oden,
shared their recollections of the 801 Project.
and Ed Schonberg, in
some
cases read
We
would
some chapters
several times.
thank our
John Hennessy, Stanford University; Kevin Kitagawa, Sun Microsystems; Daniel Tabak, George Mason University; and Safwat G. Zaky, University of Toronto. Our proximity to McGraw-Hill in New York City led us to come in unusually close contact with several members of the McGraw-Hill staff. Our sponsoring editor, David Shapiro, provided an enormous amount of daily support. Joe Murphy, senior editing manager, assisted the authors in the art of book design, which was done entirely by the authors using desktop publishing on Compaq PC's we don't just talk about microprocessors! Jo Satloff, our copy editor, did a wonderful job editing what was at times a very rough manuscript. Ingrid Reslmaier, editorial assistant, helped with a multitude of miscellaneous tasks and telephone calls. Finally, in the best tradition of book authors, we wish to thank our wives Karin and Liz for putting up with us and providing invaluable support during the very busy year of 1989, during which we prepared this book. also like to
official reviewers:
—
Robert B.K. Dewar Matthew Smosna
CHAPTER
1
MICROPROCESSORS
Microprocessors have revolutionized the use of computers are rapidly reaching the point
at all levels
where every kitchen appliance and every
of society.
We
child's toy will
contain a fairly sophisticated processor. In recent years, microprocessor technology has
advanced to the point that performance achieved. This
Two t
book addresses
levels rivalling those
of mainframes can be
the subject of these high-end microprocessors.
important events led to the greatly increased usage of microprocessors.
he introduction of the
IBM PC
led to the widespread use of personal
First,
computers based
on the Intel series of microprocessors. Second, a number of companies, including Sun and Apollo, marketed workstations based on the Motorola microprocessors. This popularized the notion in engineering circles that it was often more effective to have a reasonably powerful workstation on your desk than a small share of a powerful mainframe. It looked for a while as though the microprocessor products of Intel and Motorola would dominate the marketplace in high performance personal computers and work stations. However, requirements for ever increasing performance, particularly for engineering workstations, combined with continuing work in design and implemen-
tation of microprocessor architectures, has recently lead to an explosion of alternative architectures.
1
2
MICROPROCESSORS
These
alternative architectures are based
on new concepts of microprocessor
design, collectively referred to as reduced instruction set computers (RISC). In the past,
were generally considered an advantage
large instruction sets
proudly advertise "over 200 distinct instructions" advocates have turned this idea on
microprocessor instruction design,
The
"less
head by proclaiming that when is more."
architectural philosophy
by eliminating
sets
made
instructions can be
mean
RISC
idea behind the
reducing instruction
its
to run very
— manufacturers would
all
RISC comes to
in their glossy brochures.
is
that
it
by simplifying and
non-essential instructions, the remaining
much
By non-essential instructions we them with sequences of simpler impact on efficiency. The fundamental faster.
those executed so infrequently that replacing
instructions does not have any noticeable
observation that inspired the original RISC research was that only a small part of the instruction set of most other processors were
commonly
executed
—
a large
number of
instructions were executed rather infrequently.
The tion Set i
ndeed
existing philosophy,
Computers (CISC)
virtually
all
CISC
making
which has recently been described as Complex Instru csomewhat derisive fashion, is by no means dead, and
made by IBM, Apple,
personal computers, including those
Commodore and Amiga, chips are
in a
still
use
CISC
chips
.
significant inroads, although
When
it
comes
Atari
,
RISC powered by
to workstations,
many workstations
are
still
chips.
The continuing recent
New
controversy between the
York Times
article describing a
CISC and RISC camps
is
a fierce one.
A
conference on the West Coast sounded
more like coverage of a boxing match than a scientific meeting. The winner in the judgment of the reporter was RISC, but certainly not by a knockout. In this book we look at a number of representative CISC and RISC microprocessor designs, as well as some which do not clearly fall into either category the line between the two philosophies is not always completely clear from a technical point of view. Our intention is to understand the strengths and weaknesses of the two approaches, and to begin to guess how the argument will eventually be settled. 1
—
WHAT IS A MICROPROCESSOR? One
of the distinguishing characteristics of the microprocessor
is
that
it
is
usually
T his
means that, unlike minicomputers and mainframes, the complete machinery of the computer is present on a single chip or possibly a very small number of chips Floating-point operations, for example, are often implemented using implemente d
VLSI.
in
,
.
a separate coprocessor chip.
become more sophisticated, With the most advanced microprocessors now being used in workstations, the gap between minicomputer and microcomputer has become somewhat blurred, at least from a programmer's As the
they have
architectural features of microprocessors have
become
less distinctive as a
separate category of machine.
point of view.
1
"Computer Chip
Starts
Angry Debate,"
New
York Times, September 17, 1989.
WHAT IS A MICROPROCESSOR? Another important
characteristic
of microprocessors
is
3
that they are relatively
—
commodity items. They can be bought off the shelf the price range is and computers are then built around the microprocessors by typically $5 to $800 manufacturers. Unlike the IBM 370, where the processor is simply one second-party inexpensive
—
inseparable part of a complete t
computer system, the microprocessor
many different hardware environments in IBM PCs, but it also turns up as the
hat appears in
c hip
is
used
One
is
the
Intel
8088
are
manfactured
not really in the business of manufacturing computers
.
no such thing as lintel 8088 Motorola 68030 computer In both cases, and in the case of most of
consequence of this approach
computer or
a separate chip
controller chip for advanced
automobile ignition systems. Very few computers using the byTntel, and indeed Intel
is
For instance, the Intel 8088
.
is
that there
is
.
the other microprocessors
we
look
will
at in this
book, there are a great variety of
may be quite many respects. While they may share the same basic instruction set, such issues as memory access, input/output devices, and even the way floating-point computations are performed may vary from one computer to another. The actual cost of producing a microprocessor is very small, probably just a few dollars. Of course, this figure does not take into account the fact that designing a new microprocessor may cost tens of millions of dollars, which must be recovered in the selling price. However, it does mean that in applications where sufficient numbers of computers using these chips, and two computers using the same chip incompatible in
chips can be sold
it
becomes
feasible to
mass produce what are
in effect
extremely
sophisticated computers at remarkably low prices. These chips appear not only in
automobile ignition systems, but in microwave ovens, washing machines,
televisions,
and many other items usually not thought of as requiring the power of a computer.
An (HDTV).
interesting application for the near future
in high-definition television
is
HDTV requires sophisticated real-time data compression and decompression
algorithms, which need powerful processing capabilities. Within a few years, every
room
living
will
probably have more processing power available than the typical large
computer center of a few years past. The microprocessor makes large-scale computing practical.
a
commitment
to
such
The User-Level View of a Microprocessor
When
experienced programmers open up a manual describing a
for the very first time, there are certain questions that they is
the register structure like?
What
have learned to
ask.
What
data types are supported by the machine? Are there
any interesting or unusual instructions? Does
How
new microprocessor
are interrupts handled? In the
first
it
support tasking or virtual memory?
chapter of this book,
we
will cover
some of
these general issues to set the scene for looking at individual designs.
common
Three r egister set,
set,
architecture are particularly piler writers,
jump in and look at a new microprocessor are the and the addressing modes These aspects of a computer important to assembly language programmers and com-
places to
the instruction
who must
understand
.
this part
of the processor perfectly
in order to take
advantage of the machine. Ideally, high-level language programmers need to
know
nothing about the inner workings of a processor for which their programs are compiled.
4
MICROPROCESSORS
—
a large extent this is true in practice a C programmer can move C programs from one processor to another without knowing details of the different architectures. However, it is often useful to know what's going on, especially when things go wrong. it possible to switch from driving one It's much the same situation as driving a car
To
—
is"
car to another without being a mechanic, but if the engine suddenly conks out, useful to be able to look under the hood and
know what
is
that spirit,
we hope
which
allow you to judge the impact of the instruction
will
there
and how
it
it is
works. In
to be able to provide a description of microprocessor architectures
many aspects of the hardware
that influence
set,
addressing modes, and
how software is written
for these machines.
The System-Level View of a Microprocessor In addition to the user or applications view of a microprocessor, an operating system
designer must understand those features of the processor that are intended for imple-
menting system
tasks,
including
•
Tasking and process management.
•
Memory management and
•
Exceptions (traps and interrupts).
•
Coprocessor and floating-point unit support.
cache control.
These are the basic issues we will look at as we examine several microprocessors from the point of view of someone designing a complete system.
The provided
of tasking and process management have to do with the support
issues
in the
hardware that allows two or more tasks with separate threads of control
on the processor
to execute
single processor
each of them were executing simultaneously. Since a
as if
can execute only one task
at a time, this
is
achieved by allowing each
task to control the processor for a few cycles, with the operating system switching
between the different
tasks so that they
Memory management and system controls a
task's use
memory on
a
machine,
a re small (but fast)
memory
locations
to be executing at the
same time.
how an
operating
of memory. Most microprocessors have hardware support
memory, allowing
for virtual
seem
cache control both have to do with
a task's addressable
as well as
memories
memory
to be larger than the re al
memory. Caches most frequently used
allowing several tasks to share that
that hold copies of the data in the
.
Exceptions are events that cause the normal execution of a program to be interrupted.
They can occur due
to
an internal event such
as
an attempt to divide by
zero (a trap), or due to an external event such as a keyboard stroke (an interrupt). Finally, floating-point
on the chip
itself
support
point computations in hardware cessors are
commonly
v
provided by almost
ariety of
is
The
all
microprocessors, either
ability to
perform floating -
particularly important given the fact that micropr o-
used to build workstations for scientific and engineering use
For each microprocessor, the great
is
or using a separate coprocessor chip.
we
will describe
approaches used
in the
how
these features are supported
design of these processors.
.
and
AND INSTRUCTION FORMATS
REGISTERS, ADDRESSING,
5
CISC and RISC Microprocessors
A question which commonly asked by both applications programmers and operatings is
To what extent does the processor provide specialized instructions problem at hand? RISC designs generally provide only a minimal set of instructions from which more complex instructions can be constructed On CISC processors w e often find elaborate instructions intended to simplify programming of system designers
is:
that aid in solving the
.
frequently occurring specialized operations
The all
sorts
basic
attitude
— "You
is
an extensive
to provide
is
of special-purpose needs In
seems minimal
RISC
CISC philosophy
.
.
don't have to use
them
if you
don't need them."
that these fancy instructions are not really used often
the extra complexity in implementing the hardware
slow
down
more commonly executed
the
In practice, the dividing line is
is
programmed using simpler
and
clear.
that, like
For example, floating-point division
other complicated operations, can be
a floating-point division because in this particular case is
used often enough to justify
its
inclusion.
sophisticated system-level instructions appearing
do not seem
those that handle tasking,
to justify
that this complexity tends to
However, nearly
instructions.
The contrasting
enough
instructions.
not so
an extremely complicated operation
instruction
of instructions covering
set
approach, the cost of these extra instructions
this
it
all
RISC processors include
seems that the complicated
On
more
the other hand, the
on some CISC processors such as enough to be universally ,
to be important
included in RISC designs.
REGISTERS, ADDRESSING,
AND INSTRUCTION
FORMATS Two
of the major issues in the design of an architecture a re the register structure of the machine and the set of addressing modes provided by the hardwar e. Deciding on the structure of the register set generally involves deciding
should have, and the degree to which any or functions.
The
latter issue
is
all
how many
that of register uniformity, that
register similar or identical to
registers a processor
of the registers should have specialized is,
to
what extent
another register In the design of a
modes, the designer must decide which particular addressing modes
how
of addressing
will
be useful, and
they will be specified in the instruction.
Along with the design of the instruction impact on the in
one
is
set
.
which
all
of the
intended use.
Other
final
bits are laid
Some of the
bits are
set,
both of these
issues
have a significant
design of the i nstruction formats of a machine, that
bits in
.
way
the exact set,
and
their
an instruction must be used to define the opcode
used to define the registers or the
participate in the operation
is,
out for each instruction in an instruction
Another
set
memory
.
addresses or both that
of bits must be used to specify the addressing
modes. For example, when a machine allows
direct addressing that
is,
the ability to
memory location as an operand, space must be allocated so that the defines the memory address can be fit into the instruction format We
directly reference a bit pattern that
will
.
begin by looking at some of the issues involved in the design of register
addressing
modes and
their
impact on the
final instruction
sets
format of a machine.
and
,
MICROPROCESSORS
6
Register Sets
The number of
registers
parameter that has trade-off
which
a significant effect
machine
a
is
on the instruction formats of
very simple: the more registers there
is
on
are to be included
the
are,
more
instruction format to reference those registers. For example,
a
fundamental
a processor
.
The
bits are required in the
on
a
machine with 32
general-purpose registers, 5 bits in an instruction will be used up each time a register appears as part of the instruction. If a designer wishes to allow register-to-register operations in which in a third register,
16
bits
The
fit
applied to two registers and the result
up, leaving only a single bit for the
of keeping instructions short
issue
number of bytes
is
In place,
more
the
RISC
it
is
no hope
Many
of the
an d code
compact instructions was the concern with
memory
into the processor to be
,
more time
this takes.
much less concern over instruction density. In the first and we are no longer horrified by programs that occupy
now
is
and caching, have reduced the penalty for loading longer instructions, have larger numbers of registers than were previously practical.
feasible to
Although keeping data
in registers generally speeds
the point of diminishing returns
reached
is
after a
while a compiler cannot
make
up processing consider ably, one plots the speed of a
fairly rapidly. If
program against the number of registers which is,
there
megabytes of code. Second, modern architectural techniques, including instruc-
tion lookahead so
is
now cheaper
is
—
relatively expensive
bytes that are needed for the instructions, the
designs, there
memory
several
placed
required to program a given function) was an important
execution speed. Instructions must be loaded from
—
opcode
one that often comes up.
when memory was
consideration. Another factor favoring
executed
is
three of these register operands into a 16-bit instruction format.
designs date from the days
density (the
is
then a 16-bit instruction format cannot be used because 15 of the
would be taken
of being able to
CISC
some operation
are available, the curve flattens out, that
use of more registers
.
seems to be generally
It
more than 32 registers are needed at any one point. Also, having a large number of registers is not without some cost, since at least some of them will have to be stored when the processor switches from one task to the next. The other fundamental issue in the design of a register set is that of register uniformity. The term general register was first used in conjunction with the IBM 360 agreed that no
architecture, referring to a registers that can
Why
don't
all
for the register
XLAT
.
By doing so,
—
in identical ways.
register sets? The main reason is that it is which certain registers have been designated for becomes unnecessary to allocate space in the instruction
it
in
the use of that register
instruction
be used
machines have uniform
tempting to design instructions special purposes
all
on the 386 (used
is
implied in the instruction. For example, the
for translating character sets),
particular register (EBX) points to a translation table
(AL) contains the character to be translated.
XLAT
and is
only
1
byte long. If
designed so that both registers needed to be specified explicitly,
more the-
bytes. Since
386,
this lack
code density was
a
major design point
of uniformity seems
like a
assumes that one
that another particular register
it
for the
it
had been
would have needed
8086, an ancestor of
reasonable trade-ofF.
Even RISC processors occasionally break the tradition of register uniformity under the same pressures. For example, most RISC processors have a procedure call
AND INSTRUCTION FORMATS 7
REGISTERS, ADDRESSING,
instruction that stores the return point into
one
a 32-bit instruction format
and
Why
specially designated register.
RISC processor with
choose a particular register to store the return point? In a standard a 32-bit address space,
it
is
desirable to have a call
instruction have the largest possible range of addressability Using a dedicated register .
up almost
to hold the return address frees to
hold the address
— allowing
of addressability by 5
Non-uniform
all
of the 32
bits in the instruction
different registers to be specified
bits (for a 32-register
would reduce
machine).
registers are a particular
menace
to compiler writers. In writing the
code generator for a compiler, you want to be able to
of registers
treat the set
of interchangeable resources. Compilers typically are written so that there
whose
responsiblity
compiler
but
if
is
to allocate registers. It
instruction has
its
own
much
typically the result
is
that
it
is
as a
pool
a routine
easier to write this routine in a
— none of
as: "I
need a
register,
the others will do." If every
idiosyncratic set of register requirements, then the
allocating register use in an optimal
and
is
the compiler does not need to deal with requests such
has to be either special register SI or DI
it
format
the range
problem of
manner becomes very much more complicated,
simply
isn't
attempted.
Addressing Modes In choosing a set of addressing as in the case
modes
of register size. As we
the issues of complexity versus utility arise just
shall see in a later section
of this chapter that describes
programming languages and addressing mo des, the include increasingly complex addressing modes that directl y
the relationship between high- level
CISC
tradition has been to
support the use of high-level languages. For example, a compiler writer will recognize
one addressing mode as the one to be used for addressing variables local to a (recursively callable) procedure, and another addressing mode as the one to be used for accessing global variables. Just as increasing the
number of registers on
a
machine may
increases the size of
may have the same That trade-off has been resolved differently on different machines. In particular, we will see that the 68030 has a rich variety of addressing modes, which results in an instruction size that can vary widely, while the RISC processors all carefully restrict an instruction format, increasing the number of addressing modes e ffect.
them
to a small
but important
set so that
Whether an addressing mode
The
first
is
they will
all fit
important or not
into a 32-bit instruction format. is
quite application-dependent.
high-level languages to be used extensively in the United States were
TRAN and COBOL that a smaller
.
Both languages have an
and simpler
set
essentially static
of addressing modes are necessary. For example,
so important to provide double indexing, the ability to
instruction to
form an address, since
FORTRAN
FOR-
view of data, which means
add two
array accesses
it is
not
registers in a single
do not need
this
kind
of addressin g. In Europe,
has a
on the other hand,
much more
ALGOL
60 was
much more
popular.
ALGOL
60
complicated addressing structure, involving the use of a stack to
manage recursion. Some of the early European machines had more complex addressing mechanisms reflecting this emphasis. On the home doorstep, Burroughs was a great fan of ALGOL and built machines that reflected this attraction.
8
MICROPROCESSORS
These days, stack-based languages, including C, Ada, and Pascal, are in common use, and furthermore, they all support dynamic storage allocation Modern CISC designs especially reflect anticipated use of more complicated addressing modes that .
arise
from the use of a stack and dynamically allocated
data.
Designing Instruction Formats In designing instruction formats, there are two extreme positions.
have a very small number of formats and
fit
One approach
the instructions into this small
is
to
The
set.
is to design an optimal format for each instruction. Roughly speaking, RISC designers take the first approach, and CISC designers tend more to the second, although even in CISC processors there will be a degree of uniformity in that very similar instructions might as well have very similar formats. A fundamental decision has to do with the size of the opcode, that is, the number
other approach
of bits reserved for indicating the particular operation to be p e rformed Obviously, if more bits are used, then more distinct instructions can be supported. The cleanest .
approach
is
to use a fixed
number of opcode
bits for all instructions. Interestingly,
although RISC processors do have uniform instruction uniform, whereas there have been some
which always used an
CISC designs in
they are not quite that
sets ,
the past ( notably the
8-bit opcode.
In practice, a designer will recognize that certain operations are
common
IBM 36 0)
than others and react by adjusting the
number of opcode
much more
bits appropriately.
For example, if we have determined that only 4 bits are necessary to represent the most commonly used instructions then 16 possible bit patterns are available. Fifteen of these
most
are used for the
common
1
5 instructions,
of the other instructions. Additional
CISC
designs, this sort of principle
bits
is
is
used to indicate
all
then need to be allocated elsewhere in the
common
instruction format so that these less
and the sixteenth
instructions can be distinguished. In the
carried to extremes. For example, the
number of
opcode bits in 80386 instruction s ranges from 5 to 19. On the other hand, RISC machines tend to have fewer instructions, so fewer opcode bits are needed. Since various operations need different numbers and kinds of operands, space can be saved
if
the layout of instructions
instruction. Furthermore,
once
is
this typical
specialized to the particular needs of the
CISC philosophy
particular requirement that different instructions have similar
example, the CAS2 instruction on the 68030 dissect in detail in
instruction with
On
Chapter
its
(a
6) takes six operands,
operands can be
fit
is
followed, there
no
.
very complicated beast which
but they are
is
operand structures For
all registers,
we
will
so the entire
into a specialized 48-bit format.
RISC designs strongly favor a small number of uniform instruction formats, preferably all of the same size. The regularity of these formats simplifies the instruction-decoding mechanism, and means that a technique known as the other hand,
pipelining can be used.
be
tions
One aspect of pipelining is that several
.
This kind of
instructions will typically
and execution of several instruc overlapped decoding becomes much more difficult for the numerous
the pipeline, allowing the overlapped decoding
in
and complex instruction formats of CISC processors.
DATA REPRESENTATION
9
DATA REPRESENTATION The
of data representation has been complicated by the variety of conventions
issue
used in different manufacturer's hardware. Machines have had different word lengths,
ways of storing integers and floating-point values, etc. other hand, have a similar view of how various data types on the Most microprocessors, should be stored. This is one area where CISC and RISC designers have few disagreements. Since the data representations are so similar, we will treat them here in Chapter 1 with the understanding that they will apply with only minor modifications to all of different character sets, different
the remaining chapters of this book.
Representation of Characters
Through
the years, the
methods used
have varied widely, but
to represent characters
number of bits and then and characters The number of bits
the basic approach has always been the same: c hoose a fixed
designate a correspondence between bit patterns
chosen limits the
.
number of distinct characters that can be represented. For used on a number of earlier machines such as the CDC 6600,
total
example, 6-bit codes,
allow for 64 characters. This
is
enough
uppercase
to include the
One
selection of special characters, but not the lowercase letters.
CDC
heard a
lowercase;
it
salesman proclaim in the mid 1970s that really isn't
an
issue."
Times have
8-bit codes allowing lowercase letters
is
now
letters, digits,
and
a
of the authors once
"None of our customers need
certainly changed,
and the use of 7- or
universal.
Although IBM has persisted in the use of their own EBCDIC code for character computer world has standardized on the use of the ISO
representation, the rest of the
(International Standards Organization) code This exists in several national variants, .
a nd the variant used in the for
United States
is
called ASCII, the
Information Interchange. All the microcomputers that
the code for character representation. This
which
is
a
The
fits
well with the basic
use of ASCII is
a set
is
usually not
assumed
in the design
CISC designs, fancy instructions
and movjng strings of 8-bit characters, However, there of most processors that sents. In ASCII, this
is
One example
is
is
is
of the processor design. for scanning,
nothing
is
.
A single EDIT
the
EDIT
instruction
know about
on the IBM 370, which
instruction, for example, can convert the integer
None
of the microprocessors
we
character
in a single
COBOL 123456
character string $123,456.00, with the resulting output being represented in characters.
,
not the processor's concern.
processors do have instructions that
nstruction implements the kind of picture conversion that appears in
grams
comparing
in the instruction set
concerned with what particular character 01000001 repre-
the code for uppercase A, but that
Some mainframe
,
.
of instructions for manipulating arbitrary 8-bit quantities,
including in the case of some
i
memory organization
sequence of 8-bit bytes, each of which can be separately addressed
Instead, there
codes.
American Standard Code will look at use ASCII as
we
EBCDIC
describe in this text, not even the
processors, have instructions of that level of complexity,
and
completely neutral with respect to the choice of character
pro-
to the
CISC
their instruction sets are
sets
.
It
would thus be
quite
».
10
MICROPROCESSORS
implement an EBCDIC-oriented system on
possible to
even
IBM
a microprocessor, although not
has indulged in such strange behavior.
We should note that 8 bits
not enough for representing characters
is
languages like Japanese and Chinese.
Not only do such languages
larger character sets, but even in English, the increased use
do
full set
of characters, but
using a variety of fonts. Both requirements lead to the need for larger
it
character sets
character
other
much
of desktop publishing and
fancy displays means that one wants not only to represent the also to
sets for
require very
—
sets.
in the future
Luckily,
we
of 16- or 32-bit quantities, so Japanese
is
the
will
probably see increasing use of 16, or even 32,
most microprocessors
home
are equally at
bit
manipulating strings
in this respect they are built for the future.
main focus of these
efforts since
Japan
is
prominent
so
in the
computer field. The issue of character sets is perceived as an international problem that needs a smooth international solution. Japan itself is most interested in having international standards to solve such problems. At a recent meeting at which the issue of representing Japanese characters in Ada was discussed, the Japanese delegate to the relevant ISO committee explained that Japan is concerned with complaints from other countries over non-tariff barriers to imports.
A Japanese
standard that
tionally accepted can be regarded as being a non-tariff barrier.
It is
is
not interna-
interesting that an
international political conflict can ultimately affect the representation of character
codes on microprocessors!
Representation of Integers These days, everyone agrees that storing integers in binary format is a good idea. But it hasn't always been so! Early on there was quite a constituency of decimal machines, especially in the days when tubes were used to build computers. In those days, it was cheaper to build one 10-state tube than to build four binary-state tubes. In most scientific
allows for
more
written in
COBOL,
efficient
the
programming,
2
integers are stored in binary format,
which
handling of computations. Even in commercial applications
COMPUTATIONAL format allows
programmers
to specify the
use of binary format for quantities that will be used for extensive computations.
For unsigned integers, the binary representation indicate powers of 2,
decimal integer 130
00000000 00000 1
and the most
10000010 when
is
when
1
significant bit
it is
stored as a
it 1
is
is first
2
The
first
author's uncle
worked first
6-bit binary value.
a 12-state device
(it
was
for Plessey's (a large
a
For example, the
From time to
time,
row of decimal
(called the
devices, followed
called a duo-decatron).
Even the
some
we should write integers the other
computer firm
computer. This machine
numeric quantities consisting of
and
left).
Alan Turing, the famous computer
(least significant digit first).
involved with their very
the
stored as an 8-bit binary value and
mathematicians have tried to persuade the world that
way around
obvious. Successive bits
is
(at
in
England)
PEP) had by
British
at
scientist,
one time and was
registers for representing
a binary device, a
decimal device,
might have forgotten that that
is
a
reasonable format for pounds, shillings (which went up to 20) and pence (which went up to 12), because the British long ago
knew what
its
changed
to a decimal
domain was going
to be!
money
system. This
is
a
remarkable case of hardware that
really
DATA REPRESENTATION
TABLE The
1.1
representation of signed and unsigned 4-bit
vali
Unsigned Value Signed Value
Bit Pattern
+0
0000 0001
1
+1
0010
2
+2
0011
3
+3
0100
4
+4
0101
5
+5
0110
6
+6
0111
7
+7
1000
8
-8
1001
9
-7
1010
10
-6
1011
11
-5
1100
12
-4
101
13
-3
1110
14
-2
1111
15
-1
1
always wrote numbers the "wrong" way, but he did not scientists to follow his lead! It
significant bit
have no that
1 1
we
real significance
on
more
as
a matter
being on the
we we need
look
are several
of convention
left,
convince computer to regard the
because, of course,
a silicon chip. Nevertheless, the
ways of representing signed
convention
integers, but
at use the twos complement approach, so this
to look at in detail.
confusing, even for those architectures,
we need
to
The
two's
who know
it
left is
most
and right
so universal
To keep our examples
all
the microprocessors
the only representation that
complement representation can be
when and why we need
simple,
we
will for the bits.
moment assume
that unsigned
1.1. Bit patterns starting
and
In a 4-bit register, unsigned values range
and signed two's complement values range from minus 8
shown in the Table same values in both
quite
quite well. In looking at instruction set
and unsigned numbers.
signed integers are represented using four to 15,
is
have a clear understanding to see
separate instructions for signed
from
to
always think of integers being stored this way.
There will
is
of a binary integer
manage
with a zero
bit
on the
left
to plus 7, as
represent the
and signed case. Bit patterns starting with a one bit numbers in the signed case. The starting, or leftmost bit, is Whether or not a bit pattern whose sign bit is set to 1 is to be regarded the unsigned
are interpreted as negative called the sign
bit.
as negative (i.e.,
whether the number
is
to be regarded as signed or unsigned)
—
is
programmer you cannot look in a register, see the sign number is present. For example, suppose that a register contains the bit pattern 1101. This may represent either 13 or minus 3, and it is the logic of the program which determines how it is to be interpreted.
something that bit set
is
up
and know that
to the
a negative
MICROPROCESSORS
12
For these potentially negative numbers, the signed and unsigned interpretations always differ by 2
This
is
k
where k
,
the
is
number of bits
important to note, because
subtraction
work
— 16
why
in the case
of 4-bit numbers.
the operations of addition and
both signed and unsigned values. Consider the operation:
for
= 1111
0010 + 1101
numbers
explains
it
is adding 2 to 1 3 to get 1 5. If the numbers same addition is adding +2 to -3 to get -1 What is really k happening is that the normal binary addition is addition mod 2 that is, factors of 16 are simply ignored. Since the signed and unsigned values differ by 16, the resulting bit patterns are the same in the unsigned and signed case. In designing instruction sets, we only need one set of addition and subtraction instructions, which can then be used for signed or unsigned operands at the programmer's choice. The one difference between signed and unsigned addition arises
If the
are regarded as unsigned, this
are regarded as signed, this
.
,
in detecting overflow.
= 1110
+ 0111
0111
Considered
as
Consider the addition:
unsigned, this adds 7 to 7 to give a result of 14. However,
are interpreted as signed, clearly
wrong. The
mathematical
we
are adding
result has
result.
What we
the
+7
wrong
have here
is
to
+7 and
sign
and
if the
getting a result of —2, is
operands
which
is
16 different from the true
an addition that from the signed point ofview
causes arithmetic overflow.
A programmer will
often
want
to
be able to detect arithmetic overflow for signed
The programming language Ada requires that these overflows be detected, since an overflow can raise an exception known as a CONSTRAINT_ERROR, which can be
values.
handled by the program. Processors take one of two possible approaches to satisfying sets of addition and subtraction instrucwhich differ only in the detection of overflow, or they provide one set of instructions which set two separate flags, a carry flag which detects unsigned overflow, and a separate signed overflow flag for the signed case.
this
requirement. Either they do provide two
tions,
What about other operations? For multiplication, is
there are
two
single length, then the resulting bit patterns are, like addition
same
for the
0001
x
unsigned and signed =
1111
1
1
cases.
1111
For unsigned operands,
we have + times —
cases. If the result
and subtraction, the
we have
1
times 15 giving a result of 15. For the signed case,
giving a result of — 1 As with addition and subtraction, the overflow .
conditions are different, but the resulting bit patterns are the same, so only one single-length multiplication operation
Many machines
is
required.
also provide a multiplication instruction
which
length result. In this case, the signed and unsigned cases are different:
0001
x
1111
=
00001111 (unsigned case)
0001
x
1111
=
11111111
(signed case)
gives a double-
3
DATA REPRESENTATION
This means that
double length result multiply instruction
if a
is
provided,
it
1
should be
provided in two forms, signed and unsigned. Given only one of these two possible forms, the result for the other can be obtained with only moderate effort, but
much more convenient
certainly
Division
is
to
it is
have both.
different in the signed
and unsigned
cases even
where
all
operands are
single length:
If a
1110 - 1111
= 0000
(unsigned case)
1110 - 1111
= 0010
(signed case)
machine provides divide
should be provided.
It is
instructions, then separate signed
and unsigned forms
quite difficult to simulate one of these results given only the
other instruction. For example, simulating unsigned division given only a signed divide instruction
is
unpleasant.
The final operation
to be considered
signed and unsigned operands
is
comparison. Here again the situation with
is
obviously different:
1110 > 0001
(as unsigned values)
1110 < 0001
(as signed values)
As with addition and subtraction, there are two approaches that can be taken. Either two sets of comparison instructions must be provided, or a single set of comparison instructions is used which sets two sets of flags, and then there are two sets of conditional branch instructions, one giving the effect of unsigned comparisons, and the other for signed comparisons.
SIGN-EXTENSION. To move an unsigned number to ing a value with zero bits on the left. For instance, contains an unsigned value in the range a 32-bit register
by supplying 24 zero
If a signed value
if
an 8-bit
memory
location
to 255, then this value can be loaded into
bits
must be extended
a larger field involves extend-
on the
left.
in size, then the sign bit
must be copied
into
on the left. This process is called sign extension. For example, if the 4-bit pattern 1100 must be extended to 8-bits, then the result is 11111110. There are various the extra bits
approaches to providing sign extension capabilities. instructions for sign extending values. If there are
extension
is
no
Some
processors have specific
specific instructions, then sign
usually achieved using the arithmetic right shift instruction,
which prop-
agates sign bits, as in the following example:
Byte value
Load
in
memory:
1
01 01 01
into 32-bit
register zero extended: Shift left
24
bits:
Shift right arithmetic
24
bits
ADDRESS ARITHMETIC. One ing addresses.
On
:
00000000 00000000 00000000 10101010 10101010 00000000 00000000 00000000 11111111 11111111 11111111 10101010
important use of unsigned arithmetic
is
in
comput-
the 32-bit microprocessors discussed in this book, address arith-
metic uses 32-bit unsigned addition and subtraction.
MICROPROCESSORS
14
Unsigned arithmetic has "wrap-around" semantics, which means that carries are An important consequence is that the effect of signed offsets can be achieved
ignored.
without signed arithmetic. For instance, of an
offset,
if an
mode provides
addressing
for the addition
then adding an offset of all one-bits has the effect of subtracting one. Even
though the address arithmetic
unsigned, the offsets can be regarded as signed, since
is
signed and unsigned addition gives the same results.
For the same reason, sign extension of offsets also makes sense, even though the address arithmetic
which
is
unsigned.
A common arrangement
8-bit offset field
first
is
address with an unsigned addition.
arithmetic overflow
is
We stress
integers that
cant
fit
is
unsigned since
don't
want any kind
that address arithmetic
as a result
MULTIPLE-PRECISION ARITHMETIC. Software on
to provide short offset fields
not relevant for address computation
of overflow error conditions to be signalled
tic
is
added into the address. For example, an sign extended to 32 bits, and then the result is added to the
are then sign extended before being
—we
of computing addresses.
routines for performing arithme-
into registers can be handled natura ll y by
instructions using algorithms similar to those used by
o n long number s. Most processors
we will look
at
most humans
to
t
he proces sor
do
arith
m etic
have some support for assisting
in
writing such routines. For addition and subtraction, a carry indication and special versions of the add and subtract operations that include the carry
from a previous and we find such instructions even on most RISC processors. For multiplication and division, we need double-length operations, and some RISC machines don't even have single-length multiply and divide, so we don't necessarily get much help when it comes to multiple-precision multiply and divide. stage are needed,
Packed Decimal With current design techniques, of four binary that
is
called
bits
it is
more reasonable to store decimal data as a sequence
than in a single 10-state device. This
packed decimal
If
you have 4 binary
bits
is
a very standard data format
per decimal character with the
obvious binary encoding, then the decimal integer 13 looks
This format
is
like
processing must be able to deal with numbers in decimal format to
0001001
1
in binary.
important, because computer languages intended for commercial
do mostly I/O operations and
relatively little arithmetic.
The
if a
program
is
going
conversion of binary
is a rather expensive operation whether it is done in hardware Adding two packed decimal numbers, on the other hand, is less efficient that adding two's complement integers, but not terribly so. Multiplication and division of packed decimal numbers is not nearly as efficient, but since these operations may not be performed as frequently as addition and subtraction, this may not be an important concern. If all that is done is a little bit of addition and subtraction and a small amount of other arithmetic, it may be attractive to store integers in decimal
to decimal (and vice versa)
or in software.
format, since
When
it
improve the efficiency of input/output operations. done on integers in this packed decimal format it is nice
will greatly
arithmetic
is
the hardware provides instructions that support this format. Full-scale
if
CISC machines
MEMORY ORGANIZATION
l
IBM 370 have
ike the
15
add two packed decimal numbers, each with is done in a single hardware instruction.
instructions that
16 digits, giving a 16-digit result. All of this
Of
course,
microprocessors are capable of operating on packed decimal
all
numbers using s oftware. Even on some of the RISC processors specialized su pport for
the case of the is
no
slower than the hardware instructions on the IBM mainframes. In 80386 and 68030, we do not have full-blown decimal arithmetic, but
much
tions are not
there
that have absolutely
packed decimal, the speeds of these software-supported opera-
a small set
of instructions to
assist in
writing software routines of this type.
Floating-Point Values
The
as numerous as which have supported them. One unpleasant consequence of has created an incompatible mess of hardware where floating-point
formats used used to represent floating-point numbers have been
the variety of machines this variety
that
is
it
some
calculations have yielded slightly, or in
cases completely, different results as they
were moved from one machine to another.
The IEEE P754
standard for floating-point arithmetic, approved and published
attempted to remedy
in 1985,
this situation
and operating on floating-point
data.
specifying a highly desirable approach,
by specifying
Although it
has
it
still
a
uniform method
for storing
has been widely recognized as
not been universally adopted. Too
much hardware has been built using proprietary formats such as those of IBM and DEC. However,
in the microprocessor world, the
point just as Intel was designing the the 8087. This chip
is
not quite
first
100%
IEEE standard appeared
compatible with the standard, because there
were a few last-minute changes in the standard that i/lfiowever,
all
The complex.
details
We
to the 8087, are compatible with the
of
how
8087 was designed, including the 80287 and
just after the
subsequent microprocessor floating-point chips,
80387 follow-ons
at a cr it ical
commercial floating-point coprocessor chip,
IEEE standard and manipulated .
floating-point values are stored
devote the whole of Chapter 5 to
are quite
this subject, reflecting the fact that
floating-point calculations are extremely important in the microprocessor world. In the
and more and top-end video games rely on
case of engineering workstations, floating-point performance
mundane efficient
applications like high-definition television
and accurate floating-point operations
is
critical,
.
MEMORY ORGANIZATION Almost
all
microprocessors organize
memory
into 32-bit words, each of
which
is
divided into four 8-bit bytes These bytes can be individually addressed, so for .
purposes one can equally well regard the
T he
two ways
memory
as
being logically
some composed of a
which the various processors differ are the order in which successive bytes of multiple byte quantities are stored and whether such quantities must be aligned on specific boundaries. sequence of 8-bit bytes.
in
MICROPROCESSORS
16
Big-Endian vs Little-Endian Byte Ordering
The
memory
organization of
to be addressed.
means
into bytes
that the ordering of these bytes needs
As English speakers, we normally think of data
to right, rather than right to
left.
When we
being arranged
left
think of successive bytes in memory,
we
think of lower-numbered bytes as being to the
example,
we think of a
32-bit
number
When the
left
a
number
and the
is
—
I
I
32
stored in a register,
least significant bit
I
bits
we
bytes
out
to 3 laid
as
D—
2
1 I
of higher-numbered bytes. For
left
memory occupying
in
as
-
think of the most significant bit being on
being on the right, because
this
is
the
way numbers
are represented in English:
bits
natural to assume that
when
a 32-bit
is loaded from memory, the high-order bit of the number is the leftmost and the low-order bit of the number is the rightmost bit of byte 3:
2
1
3
bit
number
of byte
I
'
r
low
high
32
J
bits
This picture corresponds to big-endian byte ordering, where the "big end" or the most significant byte
is
stored in the lowest addressed byte in
indeed store multibyte quantities in
memory
in this
memory. Many
However, the apparent naturalness of this ordering dent on our writing customs. Arabic left to right.
3
is
written right to
left,
is,
do
of course, simply depen-
but numbers are
Arab readers might therefore find it more natural
in the following
processors
manner.
to write the
still
written
above picture
manner 2
3
1
high
o
\
low |
32
'
Train schedules
right
in the
(
bits
lasablanca station, for instance, have familiar times, but the departure
of the board and the destination
is
on the
— most confusing
left
for
Western
readers!
is
on the
MEMORY ORGANIZATION
17
and might therefore naturally expect to find the high-order bit in the leftmost bit of byte 3 and the low-order bit in the rightmost bit of byte 0. This picture corresponds to little-endian byte ordering, where the "little end" of the number is stored in the lowest memory byte. The reason we mention Arabic here is to emphasize that there is nothing inherently natural in choosing one ordering over the other. little-endian ordering "backwards," since they are
being organized
left to right,
and
so they think of the little-endian picture
low
high
32
-
at the
hardware
there
level,
call
as
as:
2
1
However,
You may hear people
determined to think of memory
no
is
|
-
bits
and
left
right,
and even the convention of
thinking of the register as having the most significant byte on the
left is
purely arbitrary.
For various historical reasons, both kinds of byte ordering are found in currently
We will
available microprocessors.
find four different approaches:
•
Processors like the Intel 80386, which always use little-endian byte addressing.
•
Processors like the Motorola 68030, which always use big-endian byte addressing
•
Processors like the
MIPS 2000, where
big- or little-endian addressing
a signal at reset
to be used,
is
.
time determines whether
and the mode then never subse-
quently changed. •
Processors like the Intel i860, where there
is
a software instruction to
backwards and forwards between the two modes while
From
a
programming point of view,
it
a
program
generally does not matter very
is
change
running.
much which
type
of addressing we have, although there are times when we certainly have to be aware of the endian i
nstance
is
if
mode. In
we
particular,
transfer a binary
68030-based
—we
identical except {or the
this picture,
a PC,
is
which
passed between machines is
—
for
386-based, to a Sun-3, which
the
is,
way integers and characters are represented,
are generally
annoying difference in endianness. Furthermore, there
algorithm for the conversion
From
binary data
from
often have considerable trouble. For example, the data formats of
the two processors, that
containing a 4-byte
when
file
field,
—
it
is
Fl, followed by
we can
is
no
set
data dependent. Consider the case of a record
two 2-byte
see that the pattern
fields,
F2 and F3
(see Figure 1.1).
of byte swapping required to convert
from one format to the other is dependent on a detailed knowledge of the data layout. There are a few cases where one of the orderings is more convenient than the other. For example, if a
big-endian ordering
is
dump
of
memory
more convenient
the situation exactly reversed). Generally,
unfortunately, there
is
is
displayed byte by byte
left to right,
(but Arabic-speaking programmers might find it
which ordering is used, but camps are well established and
doesn't matter
no hope of agreement,
since both
each regards the other as being hopelessly backwards.
MICROPROCESSORS
18
F2
F1
MSB
LSB
MSB LSB
|
|
v
v
V
MSBJ^J-SBJ
j
V
V
MSB LSB MSB
LSB >
A
r
msb|
LSB A
F2
F1
FIGURE
F3
1.1
Converting
from big-
a record
to little-endian.
Big-Endian vs Little-Endian Bit Ordering
When bit as
a binary value
being on the
Although
this
is
is
we might
we
this
the
left to right
(the
left
most
it is
significant
left
to right.
well established, and the pictures and
book, and indeed throughout the reference manuals for
all
discuss in this book, use this convention.
However, there remains an from
we normally think of the most
think of bytes as being laid out
an arbitrary convention,
diagrams throughout the processors
stored in a register,
left just as
issue
of whether the
or from right to
left.
The
significant bit),
and
bit
bits in the register are
left-to-right ordering
31
is
on the
means
numbered
that bit
is
on
right (the least significant bit): 31
_ —
„J°!?lZI
l£llL~.
-
•
32
MICROPROCESSORS
procedure THINK X, Y,
A
Z
:
is
INTEGER;
array (1..100) of
:
INTEGER;
begin
X
:=
A(Y);
end
The
addressing of A(Y) involves both using the frame pointer as a base pointer and
using
Y
as
an index
(see Figure 1.6).
Computing
the address of A(Y) involves three
elements: the base address, in this case the frame pointer; the starting offset, which
known at Some
compile time; and the index value, which
may
typically
need
scaling.
processors provide this type of base-index addressing, sometimes called
double indexing,
since, as
registers are similar.
On
we observed
before, the functions of base registers
and index
such processors, the fetching of A(Y) corresponds to a single
load instruction. Other processors not provid in g this double indexing feature
um
may
which the necessary indexing address that is, th e scaled index, m ust be computed and placed in an of the frame pointer and the
require a sequence of instructions in s
is
,
index register so that single indexing can be used. also
It is
important to note that in the case
an element of an array allocated
to access
if a
compiler needs to generate code
in a stack frame, there are really three
Run-time stack
fp
—
(frame pointer
Offset to first element of A, known at compile time
points to current
frame, THINK)
_Z_ A(100)
A(Y)
A(2)
Y
is
the index value
A(1)
FIGURE
1.6
Use of based plus index addressing
to address arrays allocated
within stack frames.
ADDRESSING MODES
33
components involved: the frame pointer, the starting offset of the array, and the (possibly scaled) index. As we shall see when we compare the 386 and the 68030 to the RISC chips, only the CISC processors provide an addressing mode which allows one to access such an array element in a single instruction. Some RISC chips do have the double ndexing, but none of them allow a programmer to add two registers as well as a constant i
displacement to form an address. This
which
is
a
consequence of the instruction formats,
consquence of the decision to use pipelining, and
are in turn a
is
consistent with
the philosophy of keeping things simple.
Indirect Addressing
When
parameters are passed to procedures, the value passed and stored for use by the
calling procedure
is
often the address of the actual parameter, rather than a copy of the
value of the parameter. In this
method,
=
+
1
programming language may (e.g.,
is
the
require the use of
VAR parameters of Pascal).
optional. Consider the case of the
procedure:
SUBROUTINE QSIMPLE 1
cases, a
of passing parameters
method of passing parameters
In other cases, the
FORTRAN
some
call by reference,
(I)
1
END Within QSIMPLE, the value stored for the parameter is not the value of I, but the address of I. This means that when I is referenced, there is an extra step of fetching the address of I and then dereferencing it (see Figure 1.7(a)). Obviously the reference to I can be achieved by
and However, some addressing which in a single
using an instruction to load the address of
first
I
into a base register
then using based addressing (with an offset of zero) to access processors provide an addressing instruction
mode
fetches the pointer to
first
the actual value of I. extra instruction
is
Of course,
I
called indirect
and then uses
this still takes
an extra
I.
this pointer to fetch (or store)
memory
data reference, but an
not required.
Indirect Addressing with Indexing If the
parameter being passed
addressing must be
FORTRAN
combined
=
an array, then indirect addressing and indexed
an element of the
array.
If in the
above
example, the parameter had been an array
SUBROUTINE QARRAY DIMENSION D(100) D(I)
is
to access
D(I)
+
(D)
1
END then accessing D(I) would involve getting the address of the subscript
I
(see Figure 1.7(b)).
D
and then indexing
it
with
34
MICROPROCESSORS
Memory
Static
address
Static data
of
Memory
Static
Addr
I
address
item,
known
at
link
of
D
Static data item, address
time
known link
value of
at
time
D(100)
I
D(l)
Index value
is
(scaled) value or
D(2)
I
D(1) (a)
FIGURE
(b)
1.7
Indirect addressing of a simple variable,
As
mode look
this gets
and an
array.
more complicated, the issue of whether to provide a single addressing becomes more contentious. Only one of the processors we
that handles this case
Motorola 68030) has this addressing mode built in. On other processors, two or more instructions is needed to access an indirect array element.
at (the
a sequence of
INDIRECT ADDRESSING WITH BASING. so
far,
In our examples o f indirect addressing
the pointer has been allocated statically. However, in a stack-based language,
word
the pointer
required to access
procedure
itself it.
may be allocated on the QSIMPLE in Pascal
Written
QSIMPLE
(var
I
:
stack,
and thus base addressing
instead of
is
FORTRAN,
INTEGER);
begin
I
:=
I
+
1;
end QARRAY; then the parameter passed for in the stack
an
frame for
I
would be
QSIMPLE
a pointer to
(see Figure
1
.8).
offset to a base pointer to get the pointer to
the value of
I.
Again we could do
addressing modes, but
some
this
I
and
this
pointer
Now addressing
I,
and then using
I
would be stored
involves
first
adding
this pointer to access
with a sequence of instructions using simpler
processors have this addressing
mode
built in.
ADDRESSING MODES
35
Run-time stack
—
>
fp
(frame pointer
Offset to
points to current
is
frame,
I
known
pointer at
compile time
QSIMPLE) Address
-*j
of
value of
I
I
I
-ttJ-§ :
FIGURE
1.8
Using indirect addressing with based indexing.
INDIRECT ADDRESSING WITH BASING AND INDEXING. For consider the case where an array
QARRAY
(D
in
INTARRAY)
:
the grand finale,
passed as a parameter in a stack-based language.
QARRAY
Suppose that we had written procedure
is
Ada
instead of
FORTRAN
is
begin 6(1) := D(l)
...
+
1
...
;
end QARRAY; then the parameter passed for
D would be a pointer to the array, and this pointer would QARRAY (see Figure .9). Now the access to an element
be stored in the stack frame for
of D involves three
1
we use based addressing to get the pointer to D; then we finally we used base plus index addressing, using the pointer
steps: first
dereference this pointer;
and the subscript as the index. This is getting quite complicated, and the Motorola 68030) have a few processors (just one among our examples
as the base
relatively
—
specialized addressing
mode
allowing a single instruction to be used for this access.
other processors, accessing an element of D
may
take
up
On
to four instructions.
Even More Complicated Addressing Modes It is
and data
possible to write structures
accesses in high-level languages corresponding
to arbitrarily complicated addressing sequences:
type
A
= array
REC2
[1
INTEGER;
..10] of
= record
....
REC1 = record X = array [1 ..10]
AA
var
G I
:
X;
:= X(I) A
:
A;
...
end;
A
Q REC2; A of REC1;
...
.Q A .AA(J);
:
...
end record;
36
MICROPROCESSORS
Run-time stack
FP (frame pointer) points to current
frame.
1
QARRAY)
I
Offset to is
Pointer to
known
D
pointer
compile time
at
D
D(100)
D(l)
Index value
D(2) D(1)
FIGURE 1.9 One use of indirect
We won't
the I
addressing with basing and indexing.
even attempt to draw
expression!
is
(scaled) value of
You can imagine
a picture
of th e
that a processor
memory
might be
access c orresponding; to this
built
with an amazing addressing ;
mode
exactly
corresponding to the required access sequence.
H owever,
not even the
most ardent CISC advocate would expect to see a processor go this far in providing modes! How far is far enough? This is an important point in designing microprocessors Qne of the important factors differentiating; CISC and RISC designs is precisely that o f specialized addressing
.
,
a ddressing
modes. RISC processors tend
uniform, and highly efficient
set
to concentrate
on providing
paths can be constructed as a sequence of instructions designs tend to include a complex set of addressing;
common
a relatively small,
of addressing modes from which complex addressing
when
needed, whereas CISC
modes intended
high-level language situations such as those
we have
to take care
described here
.
of
The
Motorola 68030 goes further than the examples here and includes some even more is difficult to explain in terms of programming langauge Whether this is an appropriate design choice is one of the questions to be answered as the CISC and RISC designers battle things out in the marketplace.
complicated modes whose use features.
MEMORY MANAGEMENT 37
MEMORY MANAGEMENT At the hardware level, the main memory of a microprocessor can be regarded as a vector of 8-bit bytes, where the vector subscript is the memory address. In earlier machines, and in some simple machine designs today, the logical view of memory is identical to t his
t
—when an
hardware view
instruction references a
memory location,
it
o fetching or storing the data from the designated locations in physical
correspond s
memory
.
Although this view of memory results in a very simple organization from both a sofware and hardware point of view, it is quite unsatisfactory for a number of reasons: •
If several programs are running on the same processor in a multi-programmed manner, then we have to make sure that they do not conflict in their use of memory. If programs reference physical memory directly, then this avoidance of conflicts would have to be done at the program level.
•
Physical
memory
is
limited in
size. If programs
address physical
memory directly,
then they are subject to the same limitations. Furthermore, the amount of physical
memory
varies
from one machine to another, and we would prefer that these way progams are written.
variations not affect the •
Compared
to the
has to access
speed of processors, memories are rather slow.
memory every
time
it
matter on every instruction, since the instruction
memory,
t hen access to the
If a
program
really
executes a load or store instruction, or for that
memory would become
the overall execution speed unacceptably
itself
has to be fetched from
a bottleneck that
would
limit
.
To address these problems, the microprocessors we discuss in this book all provide for memory management. T his phrase refers to a combination of hardware and operating system features which provide for efficient logical
and physical memory
memory
access by separating the notion of
accesses.
Memory Mapping To solve the problem of separate programs intefering with one another, some kind of memory mapping facility is provided by the hardware. This automatically performs a mapping function on all addresses used by a program so that the addresses used within a program do not correspond directly to physical addresses. The simplest approach is to simply relocate all addresses by a constant, as shown in Figure 1.10. By providing a limit register, this
scheme
memory
which
indicates the length of the logical
also allows the
outside
its
own
memory
for a given
program,
hardware to check that a program does not reference
logical region.
This simple base/limit approach has two limitations. First, the memory for a given program must be contiguous. It is always more difficult to allocate large, variable-sized, contiguous chunks of memory, than to allocate in small fixed-sized blocks. Secondly, there is no way that two programs can share memory. Although the general idea is to separate the logical address space of separate programs, there are cases in which we do
38
MICROPROCESSORS
Main memory
Program one addresses
Memory
for I
program one
as though
address
J
for
I
program two
FIGURE 1.10 Memory management through
want
to share
code
itself
memory. In
0.
this section of
as though J
memory
started at
it
Program two addresses
1
Memory
this section of
address
memory
started at
it
0.
the use of relocation.
particular, if two
programs are using the same code, then the
can certainly be shared.
A more flexible scheme divides the logical address space of a program, i
ts
virtual address space, into a sequence
pages are individually
mapped
of fixed-length chunks called
into corresponding physical pages
contiguous in physical memory, allowing a simpler, more
pages.
also called
These virtual
which need not be
efficient allocation
of physical
memory.
The mechanism table
for
mapping the pages is typically quite complex, and involves in memory. Since it would be unacceptably slow to search
lookup structures stored
these structures for every
memory reference,
the processor has a small piece of the table
stored locally in a translation lookaside buffer (TLB). reference
found
is
to first look in the
there. If not, the
The
The approach on
a
memory
TLB, and hope that the necessary translation entry
main memory
is
tables are consulted.
of how these translation tables are stored and accessed, and the how much of this process is in the hardware, and how much is left up to
details
decision of
the operating system, vary considerably from one processor to another. several quite different
schemes
as
we look
at the various processors.
We
will see
MEMORY MANAGEMENT 39
Memory
Virtual
mapping scheme with
Given that we implement
a
implement the concept of
virtual memory. This allows a
memory
not limited by the
space that
is
we have
to
All
do
The
translation tables that says "page not present."
The
traps to the operating system.
on
and when
disk,
it
is
now
to reference a virtual
typically a single bit, to the page
translation process sees this bit
and
it
reads the required page into
memory, swapping
up the page table entries to indicate program can then continue.
that the
new page
present. Execution of the
This approach that
fixes
a small step to
operating system maintains these not-present pages
gets the trap,
out some other page, and
program
it is
of physical memory.
add information,
to
is
size
fixed-size pages,
demand paging,
called
is
swapped
since pages are
in
on demand,
is,
when
they are referenced. Obviously the execution speed becomes painfully
if
every
memory
slow
reference results in a disk read, but
what we hope
is
that in
practice, the great majority of references are to pages which are present, so the overhead
of page swapping
minimal
is
To minimize
this
.
make
overhead, the operating system must
appropriate deci-
swap out when new pages are demanded. There are many algorithms designed to optimize these decisions. Most are based on some variation of the least recently used (LRU) principle, which suggests that the appropriate page to which pages
sions as to
discard
some
is
the one
to
which was
least recently accessed.
Most paging hardware provides
we when a page is accessed, and the other, modified. The latter information is important,
limited support to assist in implementing such algorithms. In particular,
usually find
two
called the dirty
bits in the bit, set
page
when
tables,
page
a
is
one
since pages that have not been modified
old image on disk
set
do not need
to be written
back
to disk (the
valid).
is still
Memory Caching To avoid
the problem of referencing the relatively slow
r eference
instruction,
microprocessors use
much more
to obtain the instructions to
memory caches^ These
microprocessor chip
memories
and
itself,
or
are relatively small,
faster
main memory on every memory be executed J high-performance
are small, very fast
on intimately connected
it is
memories, either on the
separate chips
economically feasible to use
hardware, resulting in the ability to access
memory within
main memory. A memory reference then becomes a two-step operation.
.
Since these
much more
expensive,
the cache
much
rapidly than
to see if the desired
memory
location
is
present. If so, then
First the it
cache
is
checked
can be accessed in the
main memory. If not, then main memory As with the TLB and page table accesses, we hope that most of the time the memory we want ism the cache, so that the overhead of the slow main memory cache, completely avoiding the relatively slow
must be is
accessed.
minimized.
How
often will
the size of the cache favorable
—
we
find the data in the cache? This obviously depends
and the pattern of references
for example,
when we
in the
program.
Some cases
on both
are clearly
execute a tight loop, the instructions of the loop can
/
40
MICROPROCESSORS
generally be expected to be found in the cache, a situation
the other hand, following a linked
bad case which
A cache
list
256
lines,
is
organized into
is
On
refer to as a cache hit.
is
a
lines
where if a
a line
cache
is
is
a contiguous sequence of bytes
on
4K bytes long, it might be organized The choice of line size is an number of references to main fewer lines, and we are less likely
each containing a 16-byte chunk of memory.
important design parameter.
memory
we
around the memory space
all
will result in cache misses.
an appropriate boundary. For example, as
which roams
increased. If
If
it is
too small, then the
too large, then there are
it is
memory we want in the cache. Typical choices are in the 16-byte range, although we will see caches where this parameter varies considerably. One important design consideration is that it must be possible to search the cache very efficiently, since this search takes place on every memory reference. Going back to to find the
our example of a
4K cache divided into 256 lines of 16-bytes each,
the task
is
to quickly
if any of the 256 lines contains the data we are looking for. At one extreme, a fully associative cache can be constructed, where the data we want can be stored in any of the 256 lines, so that at least from a logical point of view, all 256 lines have to be checked. Obviously we cannot search the lines serially, so we
determine
heed some rather elaborate hardware
to search
all
possibilities in parallel. It
is
possible
to construct such hardware, but quite difficult to keep the performance high enough
down memory
to avoid significantly slowing
At the other extreme,
references.
a directly addressed'cache
is
organized so that a given
location can only be stored in one particular cache line.
One way of doing this
memory is
to use
a portion of the address to specify the cache line. For example, for 32-bit addresses
cached
in
The 4-bit line
is
our
field
4K cache with 256
is
lines, the
address could be divided into three
the byte within the cache line
The
addressed.
issue
is
and the
we want, that is whether the top 20-bits match. the data we want is not in the cache. Referencing a directly addressed cache associative cache. case,
8-bit field indicates
whether that particular cache
However, there
is
If not,
is
much
then
fields:
which cache
line contains the address
we immediately know
easier
than referencing a
that
fully
a significant disadvantage. In the directly addressed
two memory locations that correspond
to the
same cache
line
can never be in the
cache simultaneously. This considerably increases the probability of encountering
4K
from one array would result in none of the data ever being present in the cache, since we would be bounding backwards and forwards between two memory locations which needed the same cache line. unfortunate cases. In the case of our example
to another
where the
A compromise
arrays are separated
is
by
its
our 1
4K
cache
as a
lines.
The
12
bits
A given memory location can be stored
any of the separate caches. For example, we could organize 4-way set associative cache, where each section of the cache had 64
corresponding
6-byte
of 2
to design a set-associative cache. This essentially consists of a
collection of separate directly addressed caches. in
cache, a string copy
a multiple
lines in
address
is
now
interpreted as follows:
41
TASKING
22
The
6-bit field
that a given
is
4
bits
number within any of the four
the line
memory
6
bits
and destination
easier at the
performed
much
is
When
hardware
level since the
no time
set associative
a
is
random
made on which
must be
data to evict from is no There
only one alternative, so there
cache, or a fully associative cache, there
to execute fancy algorithms, so typical
rather simple approaches.
make
cache
set associative
searches that
smaller.
the cache. For a directly addressed cache, there
is
multiway
number of parallel
a cache miss occurs, a decision must be
problem. For a
of the cache. This means
arrays can be cached in separate
sections of the cache, avoiding the conflict. Searching a
much
parts
location can be stored in any of four different cache lines. In the
string copying example, the source
is
bits
hardware makes
One approach which works quite well
is
a choice.
this decision
in practice
is
using
simply
to
choice.
TASKING The fundamental
idea behind tasking
is
that a
machine can have two or more processes
wit h separate threads of control that are both executing in a multiprogramming sense.
Each of those t asks owns the processo r and the a
machine
that has only
registers
when
it is
executin g.
one processor, one program counter, and one
can execute only one program
at a time.
But
it is
set
Of course, of
registers
possible to effectively simulate true
multiprogramming by executing some instructions for one task and then, for whatever reason, switching back to the next task and executing its instructions. In switching from one task to another, an operation known as a context switch, the operating system needs to save the state of the executing task, called the machine state.
The machine
state
is
essentially everything that
include the state of memory, because each task has includes the instruction pointer, the flags that
might have been
Once
set,
and
its
show
is
in the processor.
is
memory
When
(TCB)
used by the task
still
completely removed so that some other task can use all
of the original
task's
information back into the processor and the processor will be ready to
state
execute the instruction that
block
state
the registers.
all
the processor. Later on, the operating system will arrange to put
occurred.
does not
the results of condition tests that
the machine state has been saved (with the
preserved) the processor state
machine
It
own memory. The machine
is
a task
is
it
was
just
about to execute when the context switch
temporarily suspended, a data structure calle d a task control
used to store the machine state information for the inactive task
A context switch saves the machine state
for the current task in
its
.
TCB and
then
TCB for some other task that is ready to execute to restore its machine state. With most processors, the context switch is accomplished by software, using a sequence uses the
of instructions to save the current machine
state and a corresponding sequence of machine state. The TCB is an operating system data determined by the operating systems software.
instructions to restore the current
object
whose
structure
is
42
MICROPROCESSORS
There are situations ation.
This typically
which the context switching time is a critical considerwhich must rapidly switch attention
in
arises in real-time systems,
between input/output devices. In such
situations, context switching represents yet
another possible target for the eager CISC designer, always ready to implement specialized instructions to help with common operations.
Most of the processors covered in to support tasking, but there are
this book do not have any hardware instructions two exceptions, the Intel 80 S86^ nd the INMOS ,
Transputer, where the processor provides hardwar e sup port for
ta sking Particularly in
the case of the Transputer, this hardware support makes very rapid context switching practical.
EXCEPTIONS An
exception
two
is
an interruption of the normal flow of instruction processing. There are
which
situations in
this occurs.
The
first,
which we
when
occurs
call a trap,
the
processor recognizes that the execution of an instruction has caused an error of some kind.
The
second, which
we
call
an interrupt, occurs
when
processor signals that a certain event should be brought to
ogy
its
a device external to the
attention.
The
terminol-
widely between processors, in a rather random manner. This
in this area varies
is
one area where we prefer to adopt a consistent terminology at the expense of not always matching every manufacturer's idiosyncratic usage.
The
instructions
on some
processors signal an error condition (such as overflow)
by setting a status flag which can be tested in a subsequent instruction. Integer overflow is handled this way on most of the processors we will look at. The alternative approach is to generate a trap, which causes a sudden transfer of control, much like a procedure the calling location
call in that
differing
from
a
(i.e.,
normal procedure
the instruction causing the problem) call in
supervisor or protected state. This
means
operating system. In some cases, this
is
memory
environment, a trap
will
in supervisor state to
do
it.
logically be
handle the
if
the required page
is
not present. Clearly the
by the operating system.
done by the application program
situation in
(for instance,
If the
it is
not so clear
handling should
an Ada program needs to
which the flow of instructions
is
program
as
needed.
suddenly modified
is
external interrupt occurs, typically signalling the completion of an input/out-
put operation. Again,
makes
certainlv needs to
CONSTRAINT_ERROR exception that results from a division by zero), then
The second
it
it
In other cases such as divide by zero,
the operating system can transfer control back to the application
when an
by the
obviously appropriate. For example, in a virtual
occur
that the condition should be handled
saved, but
that trap conditions are handled
operating system needs to handle this condition, and furthermore
be
is
that traps typically cause a transition into
this interrupt functions similarly to a
a transition to supervisor state.
procedure
Input/output handling
of the operating system. In the general
case,
it
is
is
call,
except that
clearly the province
quite likely that the task that
is
—
do with the interrupt it may well belong to some other task or some other currently running program in a multiprogramming system. The distinction between traps and interrupts is not always precise. Traps are interrupted has nothing at
all
to
generally synchronous, since they occur in conjunction with the execution of particular
EXCEPTIONS
instructions. Interrupts,
occur
at
any point
on the other hand, However, there
in time.
43
are generally asynchronous, since they can are intermediate cases. For example,
some
floating-point coprocessors have overlapped execution, so that a floating-point overfl-
ow
—
which from one point of view is a trap, since it is caused by a specific instruction on the coprocessor behaves like an interrupt, since it occurs asynchronously on the main processor a number of instructions after the one that led to the overflow.
—
Hardware Support we
All the processors
for Exceptions will
look
at
have hardware support for interrupts and
traps.
This
mechanism (usually available only in supervisor state) to turn the processor interrupts on and off to control whether hardware interrupts are recognized. When an exception occurs, the machine state of the executing task must not be logically altered. includes a
This
is
particularly important in the case of an
registers or flags disappearing
the very least, the hardware registers
may
and
asynchronous interrupt
—we
can't
have
without warning anywhere an interrupt might occur! At
must save the instruction
flags to the interrupt
routine
itself.
pointer, leaving the saving of
Alternatively, the
hardware designer
more of the machine state automatically. We will see variety of possibilities here as we study a range of processors. Another respect in which processors differ is the extent to which they separate
arrange to save considerably
a full
exceptions. At one extreme, every separate condition, including each interrupt separate device,
80386
is
is
automatically handled by a separate exception handl er.
an example of such an organization. At the other extreme, there
from
The
a
Intel
a single
is
exception handler which must handle
all traps and interrupts and has to check various what is going on. The i860 is arranged in this simpler manner it is simpler for the hardware designer, but, of course, the operating systems programmer who has to write the exception routine may not see things quite the same
status flags
—
and
registers to see
way.
The handling of hardware interrupts poses some special problems, because several same time and some interrupts require very rapid attention. Most processors have some kind of interrupt priority logic that assigns interrupts to interrupts can occur at the
various priority levels
one that
is
—
made not by
the decision as to
which device
to attach to
which
priority
the designer of the processor but by the system designer
puts together a processor using a specific chip. For example,
is
who
on the IBM PC, the timer
has the highest priority, to ensure that no timer interrupts are lost and that the time of
day
stays accurate.
Other
facilities
often include the ability to temporarily mask an interrupt.
operating system uses this in organizing the handling of interrupts. generally
you don't want the same device
to interrupt again before
you have
handling the previous interrupt. By masking the device until the handling a
second interrupt
is
inhibited until
Another consideration
in
it is
once again convenient
handling device interrupts
is
to
handle
There
are
two approaches
to this
problem.
interruptible, so that they can be interrupted
One
is
is
finished
complete,
it.
that instructions that take
a very long time to execute are problematic if interrupts occur only tions.
The
One example is that
to
between instruc-
make long
instructions
and then resume from where they
left off.
44
MICROPROCESSORS
The
string instructions of the
80386
are designed this way.
divide up a complex instruction into a sequence of
The second approach
component
i
s
to
instructions. For
example, the Transputer floating-point instructions include Begin Square Root, Continue Square Root,
and Finish Square
Root.
To compute
a square root, these three
instructions are issued in sequence (they are never used separately).
approach, a hardware interrupt does not have to wait
computation
is
complete.
By using
this
until the entire square root
CHAPTER
2 INTRODUCTION TO THE
The 386
an example, perhaps the example of a CISC architectu re. Describing a
is
complex instruction this
80386
chapter
we look
set
is
a
complex operation, so we devote three chapters to it. In from an application programmer's point of
at the instruction set
view, and in later chapters describe
its
support for operating systems.
REGISTER STRUCTURE
A starting point for looking at any microprocessor architecture
is its
register structure,
of CISC whose design can be
this affects the instruction set to a large degree, particularly in the case
s ince
processors.
The
traced back to
of the 386
386,
as
we
some of the
register set,
we
shall see, has earliest Intel
an unusual
microprocessors. Before looking at the details
will describe
some of
this heritage.
p redecessor of the 8086, was an 8-bit machine, and 8-bit registers, reflected this organization.
some of
register set
It
its
The
Intel
8080, the
register structure, a set
was possible
of eigh t
for very limited purposes to
form 16-bit registers, but it was nevertheless an most other respects. When the 8086 was designed, the issue of compatibility with the 8080 was an important one. On the one hand, Intel marketing was interested in guaranteeing total
join
8-bit
these 8-bit registers to
machine
in
45
— INTRODUCTION TO THE
46
80386
compatibility with the 8080, and the sales force gave the impression that the 8086
would be upwards compatible. Customers were then writing 8080 programs, and had an interest in protecting their software investments. On the other hand, the redesign was seen by the engineering group as an. opportunity for enhancements. At
clearly
memory from 64K to 128K was important. seem to have been a number of conflicting desires and requirements. Some constituencies had thoughts of major surgery, and there was talk of eliminating the very
Beyond
least,
this,
doubling the addressable
there
the non-symmetrical nature of the 8080. If taken seriously, this
would have meant a was concerned about maintaining its customer base, that time under siege from Zilog (the manufacturer of the Z80, a popular
complete redesign. Since
which was
at
Intel
replacement for the 8080), these requirements were important to management.
i
The principal designer of the 8086, Stephen Morse, steered an interesting course n the middle of these conflicting requirements On the one hand, the 8086 is q uite .
On
compatible with the basic structure of the 8080.
the other hand, the design
was
not constrained by an absolute compatibility requirement, and in particular, the
beyond the 128K that had been originally envisioned, to an address space of one megabyte, which at that time seemed huge for a microprocessor. One important benefit, at least in retrospect, even if it was not a deliberately intended effect, was that the attempt to maintain a reasonable level of compatibility helped to reduce the design work required, and therefore contributed to the important goal of getting something out fast. At the same time the final result was much more than Intel management's original concept of a slightly beefed-up 8080 To resolve the compatibility issue, a translation program was created which c onverted 8080 assembly language to 8086 assembly langua ge. In practice this program generated horrible code, and no one in the engineering department at Intel ever addressing was extended
far
.
expected
it
to be used.
On
now
the other hand, the sales force could
customers
talk to
them "Don't worry, all you have to do is to feed your code through our translator program which fixes up the "minor" discrepancies between the 8080 and
and
tell
8086, and you'll never
know
that the architecture has changed." This kind of discrep-
ancy between what engineering thinks and what the
sometimes less
it
results
is
not
uncommon a
it is
more
or
conscious deception to keep customers locked into a manufacturer's product.
Returning to the affected all
sales force says
from confusion and wishful thinking, sometimes
by the
registers
register set
register structure
of the 8086 are 16
of the 8086, we will see that
its
design
of the 8080. With the exception of the
bits
wide
Some
(see Figure 2.1).
is
strongly
flags register,
of the registers have
curious names, which reflects their special uses. Each of the 16-bit registers AX, BX,
CX, and
DX
is
divided up into two 8-bit components that can be used
individual registers.
being the top 8 in
bits
an attempt to
registers
The AX register, of AX and the
map
for example,
latter
is
divided into
being the lower 8
the register structure of the 8-bit
of t he 16-bit 8086. If you look only at these eight
of the 8086 looks
just like that
was
they were
This was done partly into the
bottom four
registers, the register structure
of the 8080.
In addition to duplicating the register structure
instructions
bits.
8080
as if
AH and AL, the former
also provided. In general there are
two
of the 8080, a sets
full set
of 8-bit
of instructions on the 8086.
47
REGISTER STRUCTURE
16
bits
AH BH CH DH
AX: BX:
CX: DX:
AL BL
CL DL
SI:
Dl:
(Stack Pointer)
SP: BP:
CS DS SS ES BP:
(Instruction Pointer)
FL:
(Flags Register)
FIGURE 2.1 The
register structure
There
one
is
of the Intel 8086.
bit in every
opcode
that determines
whether an instruction
version of the instruction or the 8-bit version. This bit for
word bk, and
The i
ncluded
it is
set to
1
as part
called the W-bit,
is
the 16-bit
which stands
for the 16-bit case.
structure of the instruction formats of the
8086
is
such that the W-bit
is
of the opcode part of an instruction. Following the opcode in various
numbers.
places there are 3-bit fields that are operand/ register register
operands indicated by one of these
while
the
if
is
W-bit
is
fields will
If the
be interpreted
W-bit
is set,
the
as a 16-bit register,
off it will be interpreted as an 8-bit register.
Notice that having the W-bit in the opcode commits one from an architectural point of view to having
all
operands in an instruction be the same length. Looking
the kind of instructions that are available
MOV
AL, BL
MOV
AX, BX
;
on the 8086,
8-bit register
copied
it is
at
possible to write
to 8-bit register
or
but
it is
not possible to write
MOV What would low order 8
AX, BL the last instruction bits
of the
BX
mean?
register
A reasonable
interpretation
should be copied into the
AX
would be
register,
that the
with either
48
INTRODIK
1
:
ION
TO THE
sign or zero extension.
the is
W-bit that
is
80386
But neither of these reasonable interpretations is permitted since MOV opcode specifies that the size of all register operands
part of the
either 8 bits or 16 bits.
A
more
general architectural design
types and operands into the operands themselves, (the gives a quite general
mixing of operands of different
VAX
types.
is
to put designators of
uses this approach). This
But
that,
of course, takes
m ore bits and more logic, b ecause every operand would require such a bit. All of this
is
just a
matter of whether
it is
possible to
fit
to extend an 8-bit value into 16 bits. This can, of course, be It is
very simple to program that operation.
what
It is
into the
opcode the
programmed
written by zero-extending
if
ability
necessary.
BL
(if that's
required), using the sequence
is
MOV MOV
AL, BL
AH,
and Instructions
Special Registers
on the 8086 can be distinguished from every other register in some is something for which the 8086 is well known. This architecture is thus at the opposite design extreme from machines with uniform register sets. Enumerating all the specialized uses of the registers would take too long and be too messy, so we will simply give some examples. Multiplication on most machines involves putting a result into a register pair, since the result of an w-bit by w-bit multiplication will in general require 2n bits. The 8086 has a 16- by 16-bit multiply, which yields a 32-bit result. The solution on the 8086 is to require that operands and the result be placed in specific registers. This Each of the
registers
way. This lack of orthogonality
multiplication specifically requires that the multiplicand be put into AX, with the 32-bit result
put into the
DX:AX
The
CX
register
instruction j
ump
if
CX
(LOOP) is
is
pair.
another register with a special use.
that automatically decrements the
not equal to zero. To execute a loop
MOV
CX, 15
LOOP
LP
1
CX
The 8086
register
5 times, the
has a loop
and then executes
code
a
is
LP:
This
is
very
much
the kind of instruction that
for use in a very specific situation. Since the
not have enough
room
is
"mission-oriented," that
such
intended
normal format of a jump
jump
instructions test
to designate a register (most conditional
special bits within a status register
is,
instruction does
as the carry flag
and the overflow
flag),
the
operands are usually implied rather than being explicitly specified in the instruction. In this sense, the choice to use CX rather than another register as the basis of whether
jump is somewhat arbitrary. The XLAT instruction makes special use of both the BX register and the AL register. The memory location whose address is formed by adding the contents of the BX and AL registers is loaded into the AL register. One obvious use of this instruction
or not to
is
for translating character sets
— hence
the name.
REGISTER STRUCTURE
The i
index registers
SI
and DI have
49
uses in connection with string
special
nstructions that copy a sequence of bytes from one location to another. ("S" stands for
source,
and "D"
for destination.)
We will
call stack.
special uses in conjunction with the
discuss the special uses of each of these registers in detail later on.
we go
Before
BP and SP have
further, let us describe the register structure
about the operand formats.
The 386
has exactly the
same
of the 386 and
talk
structure as the 8086, except
that each of the registers is 32 bits wide and each register is renamed by putting an "E" on the front (you may think of the "E" as meaning "extended"). The bottom 16 bits of each register has a name that corresponds to the old 8086 names, the right-hand half of this picture is identical in all respects to the 8086 register model (Figure 2.2).
Maintaining Compatibility with the 8086/88 register structure of the 386 would seem rather peculiar if we did not understand 8086 origins (see Figure 2.2) At the right, you can see a structure that looks identical to the 8086 and is completely compatible with it, but the registers are extended to 32-bits. The 16-bit CX register on the 8086, for example, becomes a 32-bit extended register called ECX on the 386. The problem is, the instruction formats of the 386 have to be pretty much the same as the 286 because the compatibility requirement is very strong. Recall that on the 8086 and 80286 in the opcode byte (the general form of an instruction is that there is an 8-bit opcode followed by other fields) there is a W-bit that on the 286 says whether to use 8 or 1 6 bits. There isn't room in the 8-bit opcode field to fit an extra bit in saying, "Please use 32 bits." If the 386 were being designed from scratch, it would probably have been preferable to have three possible designators so that 8-bit, 16-bit, and 32-bit references could be freely mixed. But there just is not enough room in the existing
The its
.
instruction formats.
The
mode
trick that
is
used to solve
problem
this
for the processor that can be set to
W W
is
the following.
There
is
an overall
put the machine into either 16-bit
mode
or
is set to 0, the processor always uses 8-bit operations, regardless of mode. If the mode, but if is set to 1 then the processor uses either 32-bit or 1 6-bit operations depending on the mode. In the 32-bit mode there is a choice between 32-bit and 8-bit operands, while in the 1 6-bit mode there is a choice between 1 6-bit and 8-bit operands. I n order to write code that is compatible with the IBM PC or to run PC-compatible code, the processor will operate in 16-bit mode. In this mode, none of the code
32-bit
,
ever uses the upper half of the 32-bit registers t
—
all
machine,
it
must operate
in 32-bit
work in such a way To operate the 386 as a 32-bit
the instructions
hat they are blind and oblivious to the higher-order
bits.
mode. Eight-bit operations
course, since characters are important whatever the
word
are
still
available,
of
size.
There is one trick that gives a programmer a little more flexibility. There is an operand prefix byte (it has a special coding as 66 Hex, which is different from any opcode value) that directs the processor to change modes for the next instruction. That allows
you
to
mix some
16-bit
mode
instructions into 32-bit code, or vice versa. If you
have code that heavily mixes 16- and 32-bit instructions, then the code
will
be covered
50
INTRODUCTION TO THE
80386
AH BH CH DH
EAX:
EBX:
ECX: EDX:
ESI:
AL BL
CL DL
SI
EDI:
Dl
ESP:
SP BP
EBP:
Extended Stack Pointer
CS DS SS:
ES FS:
GS
EIP:
J
Extended
Instruction
Pointer
EFL:
FIGURE The
2.2
user register set of the 80386.
with these prefixes (wasting time and space). There
is
no
practical
way to flip
the current
processor between 16- and 32-bit operating modes.
This mechanism the design were started
considerations, the
mechanism devised
1
is
rather clumsy, probably not
from
scratch. If the design
what would have been chosen
might have been omitted, or mixing the three operand lengths.
6-bit operations for
if
were not constrained by compatibility at least a
more usable
THE USER INSTRUCTION SET In this section, set.
We
will
we will
give a brief overview of the general design of the
not describe every single instruction in detail
—such
386 instruction
a description can be
80386 Programmer's Reference Manual and in many other books on the 386. What we want to do is to get a general idea of the instructions that are available and concentrate on unusual instructions that exhibit th e CISC philosophy of providing specialized instructions for common high-level programming constructs. found
in the Intel
THE USER INSTRUCTION SET
Basic Data
Movement
regl,
mem
mem,
regl
regl
reg2
,
1
Instructions
The 386 move instructions allow you to move data between registers, between and memory, but not directly between different memory locations:
MOV MOV MOV
5
;
load regl from
memory
;
store the value
in
;
copy a value from reg2
regl into
registers
memory
into regl
The simplicity of this description of the addressing modes of the 386 hides the fact that the memory references implied by the mem operands actually allow a programmer to use a relatively rich set of addressing modes in defining a memory address.
Basic Arithmetic and Logical Operations
The most commonly used the addition instruction
ADD
two operands, one of which is a register; In assembly language, one format for
instructions take
memory location
the other can be a register or a
.
is
EAX, K
This instruction adds the contents of register, leaving the result in
memory
EAX. The addition
as
unsigned or two's complement. Three
•
CF, the carry
•
OF, the overflow
flag, is set if
•
ZF, the zero
is
flag, is set if
flag,
set if
there
is
location is
flags are set
there
the result
is
K,
by the
EAX
can be regarded
result:
a signed overflow.
is all
zero bits.
many of the other processors that we will memory as well as operations from memory:
ADD
to the contents of the
an unsigned overflow.
Unlike to
K
a 32-bit addition that
look
at,
the
386 permits operations
EAX
computes the same sum, but the result is stored back into memory location K. The same instruction format can also be used for operations between the registers:
ADD
EAX, EBX
This instruction computes the
number of two-operand
sum of EAX and EBX,
always replacing the contents of the
ADC SUB
op1
SBB
op1
CMP AND OR XOR
op2 op1 op2
addition including
;
subtraction
;
subtraction including
;
comparison
(like
;
logical
AND
,
;
logical
OR
,
;
logical exclusive
,
op2 op1 op2
in
EAX.
A large
CF
,
,
sum
operand:
;
op2 op1 op2 op1 op2 op1
left
,
,
placing the
instructions share this basic instruction format, with the result
CF
subtraction, but no result stored)
OR
52
INTRODUCTION TO THE
TEST
MOV LEA
The
80386
op1,op2 op1 op2 op1 op2 ,
,
ADC and SBB
AND,
but no result stored)
;
bit test (like
;
copy operand 2
;
place address of operand 2
to
operand
1
in
operand
1
instructions are useful for multiple precision addition
tion, since they include the carry flag
from the previous operation,
and subtrac-
so, for
example, a
typical triple-precision (96-bit) addition can be written as:
ADD ADC ADC
EAX,
EDX
;
EBX, ESI ECX, EDI
add low-order words add next word with carry from previous
ECX:EBX:EAX
The comparison
instruction,
a result. It does,
however,
CMP,
set the
=
ECX:EBX:EAX
EDI:ESI:EDX
behaves exactly like a subtraction but does not store
OF, CF, and ZF
from which a
flags,
signed and unsigned comparison conditions can be deduced. is
+
full set
A complete set
available to test these conditions:
JMP
Ibl
JA
Ibl
JAE
Ibl
JB
Ibl
JBE
Ibl
JE
Ibl
JNE JG JGE
;
;
;
;
;
;
Ibl
;
Ibl
;
Ibl
JL
Ibl
JLE
Ibl
;
;
;
unconditional jump
jump jump jump jump jump
above (greater than, unsigned) above or equal (unsigned) below (less than, unsigned) below or equal (unsigned)
equal (same for signed or unsigned) jump not equal (same for signed or unsigned) jump greater than (signed) jump greater than or equal (signed) jump less than (signed) jump less than or equal (signed)
Operations that take only a single operand can be used with either a
memory
The
INC
op
;
op
;
decrement operand by
op
;
negate operand
op
;
invert
increment operand by
operand
operations described so far can operate
AL, BL,
...),
16-bit operands (using
on
1
1
bits
8-bit operands (using
one of the 16-bit
instructions are
MOVSX MOVZX
one of the few
op1,op2 op1,op2
The motivation behind
cases ;
;
one of the 8-bit
registers,
32-bit operands (using one of the 32-bit registers, EAX, EBX,
...).
AX, BX,
The
...),
or
following
where operands of different lengths can be mixed:
move move
with sign extension with zero extension
the inclusion of these instructions in the instruction set
is
to
and either sign- or zero-extended fill the larger operand. For example, op J can be EAX and op2 can be a byte in memory. this case MOVSX loads a byte from memory, sign-extending it to fill 32 bits.
allow the second operand to be shorter than the
In
register or a
operand:
DEC NEG NOT
registers,
to
of both
of jumps
first
THE USER INSTRUCTION SET
53
Multiplication and Division Instructions
We will complete the picture of integer arithmetic by describing the set of multiply and The
divide instructions.
basic multiply instruction takes only
MUL
op1
;
unsigned multiplication
IMUL
op1
;
signed multiplication
The second operand
always the accumulator (AL, AX, or EAX, depending
is
length of the operand).
one operand:
The result always goes in the extended accumulator
on the
(AX, DX:AX,
or EDX:EAX).
This specialized use of registers keeps the instructions shorter, since the instruc-
one of the operands.
tion need not specify
On
the other hand,
it
complicates
life
for
programmer and particularly for a compiler writer, because it means that must be treated in a special way compared to addition and subtraction EAX must be treated differently from the other registers. Division is similarly
the assembler
multiplication
and
that
specialized:
DIV
op1
;
unsigned division
IDIV
op1
;
signed division
The dividend
is
always in the extended accumulator.
The remainder and
quotient are
stored back in the two halves of the extended accumulator. For example, in the 32-bit
EDX:EAX
form,
O n the 8086,
in
The
first
op1
,
set
of multiply and divide instructions The .
instructions to perform multiplication
op1,op
;
:
single-length multiply
op2, immediate
form performs
result in the left
—
EAX.
was the complete
this
386 has some additional IMUL IMUL
EDX
divided by the 32-bit operand, with the remainder stored in
is
and the quotient stored
a single-length multiplication (8-, 16-, or 32-bit), putting the
operand
as usual. It
is
interesting to note that there
is
no
MUL in this
none is needed, since, as in the case of addition and subtraction, the signed and unsigned results are the same if only the low-order bits are generated. This multiply instruction corresponds to the normal multiplication required in high-level languages format
like
C
or
FORTRAN,
so
it is
highly convenient for a compiler
.
The second format is highly idiosyncratic. It multiplies op2, which can be a register or memory by the immediate operand and places the resulting single-length ,
product in opl, which must be a
of this type in the instruction
register.
set.
There
are
no other three-operand
Why on earth did
this instruction get
instructions
added? This
good example of another mission-oriented CISC instruction. Consider the case of indexing an array, where the elements of the array bytes long. The following instruction is just what is needed:
is
a
IMUL
are
32
EBX, 1,32
EBX now contains
the byte offset into the array
this special instruction?
That
is
whose subscript
always the $64,000 question!
is I. Is it
On
worth having
the one hand, array
INTRODUCTION TO THE 80386
54
-»>%
common
On
indexing
is
a decent
compiler can eliminate nearly
a
operation.
the other hand,
RISC advocates would argue
that
such multiplication instructions using a
all
standard optimization called strength reduction. Consider the following loop: for
I
in
1
..
1
00 loop
S := S + Q(I).VAL: end loop; Let us assume that
VAL
field
is
Q
use of the special
4 bytes of each record. Naive code for
IMUL
MOV
MOV ADD
S.
INC
ECX
CMP
ECX. 100 LP
JNE
32 bytes long and the
is
this
loop can make nice
instruction:
ECX. 1 EAX, ECX, 32 EBX, Q[EAX]
IMUL
A
an array of records where each record
is
in the first
EBX
:
;
;
;
;
;
:
ECX
use load
add
VAL to
I
EAX
in
field
S
increment
I
test against limit
loop
until
would
clever compiler using strength reduction
hold
to
get offset
I
= 100
replace
I
by 32*1, generating the
following code:
MOV MOV
ECX, 32 EBX, Q[ECX]
ADD ADD
S,
CMP JNE This code
is
clearly
much more is
get 32 *
;
load
;
ECX, 32 ECX. 3200 LP
multiply instruction. If it tions,
EBX
;
:
:
;
efficient since
1
VAL
ECX
in
field
add to S add 32 to 32 * compare against adjusted limit loop until * 32 = 100 - 32 1
I
it
does not need to
make
then the RISC advocates have a point. In practice not
eliminated, so the situation
world, so
it
use of the fancy
true that compilers can always get rid of these multiplica-
is
clouded. There are also
many
can also be argued that relying on clever compilers
DOUBLE-LENGTH MULTIPLY AND DIVIDE. Not ble-length forms of multiply and divide, but as
all
all
of them can be
"stupid" compilers in the is
somewhat
unrealistic.
processors provide the dou-
we have
seen the 386
is
an example
Looking at high-level languages, one might wonder whether these instructions are of any use. Among all the commonly used high-level languages, only COBOL gives access to them. This is done using statements such as: of
a processor that has both.
MULTIPLY SINGLE-ONE BY SINGLE-TWO GIVING DOUBLE-RESULT DIVIDE DOUBLE-DIVIDEND BY SINGLE-DIVISOR GIVING SINGLE-QUOTIENT. There less
are three reasons for providing these instructions. First,
free in the
hardware. To multiply 32
bits
algorithms require 32 steps of shifting and adding.
by 32
bits,
it
tends to be
more
or
the standard hardware
A 64-bit result is naturally developed
without any extra work. Similarly, a 32-bit division involves 32 steps and can naturally deal with a double-length dividend.
THE USER INSTRUCTION SET
55
which these double-length When you learn multiplication in grade school, you are taught that multiplying a single digit by a single digit can give a result of up to two digits in the form of the multiplication tables up to 10 by 10 (the 10 times table is redundant, but it is easy, and we teach it to reinforce the notion of multiplication by 10 being equivalent to moving the decimal point). In addition, there are two
programming
situations in
instructions are useful. First, consider multiple-precision arithmetic.
When
grade-school students are taught the 9 times table, they learn that nine 9s are
—
you don't learn that nine 9s are 1 and the carry doesn't matter! That's because if you want to do long multiplication by hand you need that carry-digit on multiplication. The same principle applies to programming multiple-precision multiplications. When multiplying 10 words by 10 words, you need the one word by one word giving two words as the component instruction in the algorithm. Similarly, multiple-precision division, which is much more complicated, also requires the double-length divide. A second situation arises in computing expressions of the form B * C / D with 81
integer operands.
With double-length
operations, the result of the multiplication can
temporarily overflow into double length, with the division then bringing the quotient
back into single-length range. At DISC, a typesetting company in Chicago (owned by the brother of the
first
author), the primary application repeatedly evaluates expressions
of this type for scaling graphics and type on the screen and printer. For is
important that double-length
The on
results
original version of the
DISC
this scaling,
it
be permitted. application was written in assembly language
was no problem. The most and runs on the 386. Although the 386
a processor providing double-length results, so there
recent version of the software
is
written in
provides the double-length operations, a choice of catastrophes.
a =
a
int
wrong
results
which case
single precision (in
and hence
The
result) or
(long(b)
*
it is
C
C
does not provide access to them, so there
of all arithmetic operations can either be
possible to get an overflow
everything can be converted to 64
long(c))
/
on the multiplication bits:
long(d)
This implies a multiplication by two 64-bit values, and there certainly instruction to
do
that.
is
left in
Consequently,
a
C
compiler will generate a
call to a
is
not an
time-con-
suming software multiply routine, followed by another call to an even more time-consuming 64-bit division routine, even though all of this could have been done in assembly language in two instructions. At DISC, they finally had to resort to doing these scaling operations with a small assembler routine. Even with the extra overhead of the call, the application was speeded up by nearly 20%. That is still sort of sad, isn't it? The machine they use has the right instructions, but C does not give access to them. There isn't always perfect communication between language designers and hardware designers.
GETTING BOTH THE QUOTIENT AND REMAINDER. Another in the
how
same do
to
on the 386
feature of the di-
that
it
provides both the quotient and the remainder
instruction. Again, this
is
almost free
vide instruction
a long division.
is
At the end of
at
the hardware level
a division, the
— think about
remainder
is
left as a
;
56
INTRODUCTION TO THE 80386
consequence of doing the division. Once again, among high-level languages, only
COBOL
gives direct access to this instruction:
REMAINDER
DIVIDE A BY B GIVING C It
may
be a
C
:=
D
:=
and hope
little
A/B; A rem
we can do
it.
In Ada,
we have
to write
B;
that our compiler
We
division.
verbose, but at least
D.
will
clever
is
enough
probably be disappointed
to notice that
—even
if
it
only needed to do one
the compiler recognizes and
common subexpressions, it may well miss this case, because the common at the source level, only at the level of the generated code.
eliminates is
not
expression
Decimal Arithmetic The decimal ophy
arithmetic operations provide a nice example of the
in action. Let's consider
DAA
detail. if
one of them, Decimal Adjust
after
CISC design
philos-
Addition (DAA),
in
performs the following sequence of computations:
((AL and OFH) > 9) or (AF =
1 )
then
AL^ AL + 6; AF
9FH)or(CF AL
386
B