Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture [1]

The Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 1, describes the basic architecture and p

397 20 4MB

English Pages 466 Year 2006

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Intel® 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture [1]

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 1: Basic Architecture

N OTE: The Intel® 64 and IA-32 Architectures Software Developer's

Manual consists of five volumes: Basic Architecture, Order Number 253665; Instruction Set Reference A-M, Order Number 253666; Instruction Set Reference N-Z, Order Number 253667; System Programming Guide, Part 1, Order Number 253668; System Programming Guide, Part 2, Order Number 253669. Refer to all five volumes when evaluating your design needs.

Order Number: 253665-021 September 2006

I NFORMATI ON I N THI S DOCUMENT I S PROVI DED I N CONNECTI ON WI TH I NTEL PRODUCTS. NO LI CENSE, EXPRESS OR I MPLI ED, BY ESTOPPEL OR OTHERWI SE, TO ANY I NTELLECTUAL PROPERTY RI GHTS I S GRANTED BY THI S DOCUMENT. EXCEPT AS PROVI DED I N I NTEL’S TERMS AND CONDI TI ONS OF SALE FOR SUCH PRODUCTS, I NTEL ASSUMES NO LI ABI LI TY WHATSOEVER, AND I NTEL DI SCLAI MS ANY EXPRESS OR I MPLI ED WARRANTY, RELATI NG TO SALE AND/ OR USE OF I NTEL PRODUCTS I NCLUDI NG LI ABI LI TY OR WARRANTI ES RELATI NG TO FI TNESS FOR A PARTI CULAR PURPOSE, MERCHANTABI LI TY, OR I NFRI NGEMENT OF ANY PATENT, COPYRI GHT OR OTHER I NTELLECTUAL PROPERTY RI GHT. I NTEL PRODUCTS ARE NOT I NTENDED FOR USE I N MEDI CAL, LI FE SAVI NG, OR LI FE SUSTAI NI NG APPLI CATI ONS. I nt el m ay m ake changes t o specificat ions and product descript ions at any t im e, wit hout not ice. Developers m ust not rely on t he absence or charact erist ics of any feat ures or inst ruct ions m arked “ reserved” or “ undefined.” I m proper use of reserved or undefined feat ures or inst ruct ions m ay cause unpredict able behavior or failure in developer's soft w are code w hen running on an I nt el processor. I nt el reserves t hese feat ures or inst ruct ions for fut ure definit ion and shall have no responsibilit y w hat soever for conflict s or incom pat ibilit ies arising from t heir unaut horized use. The I nt el ® 64 ar chit ect ure processor s m ay cont ain design defect s or er ror s k now n as errat a. Cur rent charact er ized errat a ar e available on r equest . Hy per -Thr eading Technology r equir es a com put er sy st em w it h an I nt el ® pr ocessor suppor t ing Hy perThreading Technology and an HT Technology enabled chipset , BI OS and operat ing sy st em . Per for m ance w ill vary depending on t he specific har dwar e and soft war e you use. For m or e infor m at ion, see http://www.intel.com/technology/hyperthread/index.htm; including det ails on w hich pr ocessor s support HT Technology. I nt el ® Virt ualizat ion Technology r equir es a com put er sy st em w it h an enabled I nt el ® pr ocessor, BI OS, v ir t ual m achine m onit or ( VMM) and for som e uses, cert ain plat form soft war e enabled for it . Funct ionalit y, perform ance or ot her benefit s w ill var y depending on har dwar e and soft war e configurat ions. I nt el ® Vir t ualizat ion Technology- enabled BI OS and VMM applicat ions are curr ent ly in developm ent . 64- bit com put ing on I nt el ar chit ect ur e r equir es a com put er syst em w it h a pr ocessor, chipset , BI OS, oper at ing sy st em , dev ice dr ivers and applicat ions enabled for I nt el ® 64 ar chit ect ure. Processor s w ill not operat e ( including 32- bit operat ion) w it hout an I nt el ® 64 ar chit ect ur e- enabled BI OS. Per for m ance w ill vary depending on your hardware and soft war e configurat ions. Consult w it h your sy st em vendor for m ore inform at ion. Enabling Execut e Disable Bit funct ionalit y requires a PC wit h a processor wit h Execut e Disable Bit capabilit y and a support ing operat ing syst em . Check w it h your PC m anufact urer on w het her your syst em delivers Execut e Disable Bit funct ionalit y. I nt el, Pent ium , I nt el Xeon, I nt el Net Bur st , I nt el Cor e Solo, I nt el Core Duo, I nt el Cor e 2 Duo, I nt el Cor e 2 Ex t r em e, I nt el Pent ium D, I t anium , I nt el SpeedSt ep, MMX, and VTune ar e t radem ar k s or r egist er ed t radem ar k s of I nt el Corporat ion or it s subsidiar ies in t he Unit ed St at es and ot her count ries. * Ot her nam es and brands m ay be claim ed as t he pr oper t y of ot her s. Cont act your local I nt el sales office or your dist ribut or t o obt ain t he lat est specificat ions and befor e placing your pr oduct or der. Copies of docum ent s w hich have an order ing num ber and are r efer enced in t his docum ent , or ot her I nt el lit erat ure, m ay be obt ained from : I nt el Cor porat ion P.O. Box 5937 Denver, CO 80217- 9808 or call 1- 800- 548- 4725 or v isit I nt el’s w ebsit e at http://www.intel.com

Copy r ight © 1997- 2006 I nt el Corporat ion

CONTENTS PAGE

CHAPTER 1 ABOUT THIS MANUAL 1.1 INTEL® 64 AND IA-32 PROCESSORS COVERED IN THIS MANUAL . . . . . . . . . . . . . . . . . . . . . . 1.2 OVERVIEW OF VOLUME 1: BASIC ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 NOTATIONAL CONVENTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Bit and Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Reserved Bits and Software Compatibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2.1 Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Hexadecimal and Binary Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Segmented Addressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 A New Syntax for CPUID, CR, and MSR Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.6 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 RELATED LITERATURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1-1 1-2 1-4 1-4 1-5 1-5 1-6 1-6 1-7 1-8 1-8

CHAPTER 2 INTEL® 64 AND IA-32 ARCHITECTURES 2.1 BRIEF HISTORY OF INTEL® 64 AND IA-32 ARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2.1.1 16-bit Processors and Segmentation (1978) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2.1.2 The Intel® 286 Processor (1982) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2.1.3 The Intel386™ Processor (1985) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 2.1.4 The Intel486™ Processor (1989) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 2.1.5 The Intel® Pentium® Processor (1993) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 2.1.6 The P6 Family of Processors (1995-1999) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 2.1.7 The Intel® Pentium® 4 Processor Family (2000-2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 2.1.8 The Intel® Xeon® Processor (2001-2006). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 2.1.9 The Intel® Pentium® M Processor (2003-Current). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 2.1.10 The Intel® Pentium® Processor Extreme Edition (2005-Current). . . . . . . . . . . . . . . . . . . 2-5 2.1.11 The Intel® Core™ Duo and Intel® Core™ Solo Processors (2006-Current). . . . . . . . . . . 2-6 2.1.12 The Intel® Xeon® Processor 5100 Series and Intel® Core™2 Processor Family (2006-Current) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 2.2 MORE ON SPECIFIC ADVANCES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 2.2.1 P6 Family Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 2.2.2 Intel NetBurst® Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 2.2.2.1 The Front End Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10 2.2.2.2 Out-Of-Order Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.2.2.3 Retirement Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.2.3 Intel® Core™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 2.2.3.1 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 2.2.3.2 Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 2.2.4 SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 2.2.5 Hyper-Threading Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-17 2.2.5.1 Some Implementation Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 2.2.6 Multi-Core Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 2.2.7 Intel® 64 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Vol. 1 iii

CONTENTS PAGE

2.2.8 2.3

®

®

Intel Virtualization Technology (Intel VT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 INTEL® 64 AND IA-32 PROCESSOR GENERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20

CHAPTER 3 BASIC EXECUTION ENVIRONMENT 3.1 MODES OF OPERATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1.1 Intel® 64 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.2 OVERVIEW OF THE BASIC EXECUTION ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 3.2.1 64-Bit Mode Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 3.3 MEMORY ORGANIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 3.3.1 IA-32 Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-8 3.3.2 Paging and Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.3.3 Memory Organization in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.3.4 Modes of Operation vs. Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.3.5 32-Bit and 16-Bit Address and Operand Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 3.3.6 Extended Physical Addressing in Protected Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.3.7 Address Calculations in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.3.7.1 Canonical Addressing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.4 BASIC PROGRAM EXECUTION REGISTERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.4.1 General-Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 3.4.1.1 General-Purpose Registers in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 3.4.2 Segment Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 3.4.2.1 Segment Registers in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.4.3 EFLAGS Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.4.3.1 Status Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.4.3.2 DF Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.4.3.3 System Flags and IOPL Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 3.4.3.4 RFLAGS Register in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.5 INSTRUCTION POINTER. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.5.1 Instruction Pointer in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.6 OPERAND-SIZE AND ADDRESS-SIZE ATTRIBUTES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.6.1 Operand Size and Address Size in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 3.7 OPERAND ADDRESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 3.7.1 Immediate Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 3.7.2 Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27 3.7.2.1 Register Operands in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.7.3 Memory Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.7.3.1 Memory Operands in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.7.4 Specifying a Segment Selector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 3.7.4.1 Segmentation in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30 3.7.5 Specifying an Offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30 3.7.5.1 Specifying an Offset in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32 3.7.6 Assembler and Compiler Addressing Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32 3.7.7 I/O Port Addressing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33

iv Vol. 1

CONTENTS PAGE

CHAPTER 4 DATA TYPES 4.1 FUNDAMENTAL DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 4.1.1 Alignment of Words, Doublewords, Quadwords, and Double Quadwords . . . . . . . . . . . . 4-2 4.2 NUMERIC DATA TYPES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 4.2.1 Integers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 4.2.1.1 Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 4.2.1.2 Signed Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 4.2.2 Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 4.3 POINTER DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9 4.3.1 Pointer Data Types in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9 4.4 BIT FIELD DATA TYPE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 4.5 STRING DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 4.6 PACKED SIMD DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11 4.6.1 64-Bit SIMD Packed Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-11 4.6.2 128-Bit Packed SIMD Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12 4.7 BCD AND PACKED BCD INTEGERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 4.8 REAL NUMBERS AND FLOATING-POINT FORMATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 4.8.1 Real Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15 4.8.2 Floating-Point Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-15 4.8.2.1 Normalized Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-17 4.8.2.2 Biased Exponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-17 4.8.3 Real Number and Non-number Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-18 4.8.3.1 Signed Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19 4.8.3.2 Normalized and Denormalized Finite Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-19 4.8.3.3 Signed Infinities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-20 4.8.3.4 NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-20 4.8.3.5 Operating on SNaNs and QNaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-21 4.8.3.6 Using SNaNs and QNaNs in Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-22 4.8.3.7 QNaN Floating-Point Indefinite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23 4.8.4 Rounding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-23 4.8.4.1 Rounding Control (RC) Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-24 4.8.4.2 Truncation with SSE and SSE2 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . .4-25 4.9 OVERVIEW OF FLOATING-POINT EXCEPTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-25 4.9.1 Floating-Point Exception Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-27 4.9.1.1 Invalid Operation Exception (#I) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-27 4.9.1.2 Denormal Operand Exception (#D). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-27 4.9.1.3 Divide-By-Zero Exception (#Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-28 4.9.1.4 Numeric Overflow Exception (#O) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-28 4.9.1.5 Numeric Underflow Exception (#U) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-29 4.9.1.6 Inexact-Result (Precision) Exception (#P) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-30 4.9.2 Floating-Point Exception Priority . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-31 4.9.3 Typical Actions of a Floating-Point Exception Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-32 CHAPTER 5 INSTRUCTION SET SUMMARY 5.1 GENERAL-PURPOSE INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2

Vol. 1 v

CONTENTS PAGE

5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6 5.1.7 5.1.8 5.1.9 5.1.10 5.1.11 5.1.12 5.1.13 5.2 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6 5.3 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 5.4.7 5.5 5.5.1 5.5.1.1 5.5.1.2 5.5.1.3 5.5.1.4 5.5.1.5 5.5.1.6 5.5.2 5.5.3 5.5.4 5.6 5.6.1 5.6.1.1 5.6.1.2 5.6.1.3 5.6.1.4 5.6.1.5

vi Vol. 1

Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 Binary Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 Decimal Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Shift and Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Bit and Byte Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 String Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 I/O Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-8 Enter and Leave Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 Flag Control (EFLAG) Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 Segment Register Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 Miscellaneous Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 X87 FPU INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 x87 FPU Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 x87 FPU Basic Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 x87 FPU Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 x87 FPU Transcendental Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 x87 FPU Load Constants Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 x87 FPU Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 X87 FPU AND SIMD STATE MANAGEMENT INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 MMX™ INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 MMX Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 MMX Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 MMX Packed Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 MMX Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 MMX Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 MMX Shift and Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 MMX State Management Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 SSE INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 SSE SIMD Single-Precision Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 SSE Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 SSE Packed Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 SSE Comparison Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 SSE Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 SSE Shuffle and Unpack Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 SSE Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 SSE MXCSR State Management Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 SSE 64-Bit SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions. . . . . . . . . . 5-20 SSE2 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 SSE2 Packed and Scalar Double-Precision Floating-Point Instructions . . . . . . . . . . . . . 5-21 SSE2 Data Movement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 SSE2 Packed Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 SSE2 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 SSE2 Compare Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 SSE2 Shuffle and Unpack Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23

CONTENTS PAGE

5.6.1.6 5.6.2 5.6.3 5.6.4 5.7 5.7.1 5.7.2 5.7.3 5.7.4 5.7.5 5.7.6 5.8 5.8.1 5.8.2 5.8.3 5.8.4 5.8.5 5.8.6 5.8.7 5.9 5.10 5.11

SSE2 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-23 SSE2 Packed Single-Precision Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . .5-24 SSE2 128-Bit SIMD Integer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-24 SSE2 Cacheability Control and Ordering Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-25 SSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25 SSE3 x87-FP Integer Conversion Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-26 SSE3 Specialized 128-bit Unaligned Data Load Instruction . . . . . . . . . . . . . . . . . . . . . . . .5-26 SSE3 SIMD Floating-Point Packed ADD/SUB Instructions . . . . . . . . . . . . . . . . . . . . . . . . . .5-26 SSE3 SIMD Floating-Point Horizontal ADD/SUB Instructions . . . . . . . . . . . . . . . . . . . . . . .5-26 SSE3 SIMD Floating-Point LOAD/MOVE/DUPLICATE Instructions. . . . . . . . . . . . . . . . . . .5-27 SSE3 Agent Synchronization Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-27 SUPPLEMENTAL STREAMING SIMD EXTENSIONS 3 (SSSE3) INSTRUCTIONS . . . . . . . . . . 5-28 Horizontal Addition/Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-28 Packed Absolute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29 Multiply and Add Packed Signed and Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29 Packed Multiply High with Round and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29 Packed Shuffle Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-29 Packed Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-30 Packed Align Right. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-30 SYSTEM INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 64-BIT MODE INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 VIRTUAL-MACHINE EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31

CHAPTER 6 PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS 6.1 PROCEDURE CALL TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 6.2 STACKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 6.2.1 Setting Up a Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 6.2.2 Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 6.2.3 Address-Size Attributes for Stack Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 6.2.4 Procedure Linking Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 6.2.4.1 Stack-Frame Base Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 6.2.4.2 Return Instruction Pointer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 6.2.5 Stack Behavior in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 6.3 CALLING PROCEDURES USING CALL AND RET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 6.3.1 Near CALL and RET Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 6.3.2 Far CALL and RET Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6 6.3.3 Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 6.3.3.1 Passing Parameters Through the General-Purpose Registers . . . . . . . . . . . . . . . . . . . 6-7 6.3.3.2 Passing Parameters on the Stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 6.3.3.3 Passing Parameters in an Argument List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 6.3.4 Saving Procedure State Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 6.3.5 Calls to Other Privilege Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 6.3.6 CALL and RET Operation Between Privilege Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10 6.3.7 Branch Functions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-11 6.4 INTERRUPTS AND EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13 6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures . . . . . . . . .6-14

Vol. 1 vii

CONTENTS PAGE

6.4.2 6.4.3 6.4.4 6.4.5 6.4.6 6.5 6.5.1 6.5.2

Calls to Interrupt or Exception Handler Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 Interrupt and Exception Handling in Real-Address Mode. . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 INT n, INTO, INT 3, and BOUND Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 Handling Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 Interrupt and Exception Behavior in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 PROCEDURE CALLS FOR BLOCK-STRUCTURED LANGUAGES . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 ENTER Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20 LEAVE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-26

CHAPTER 7 PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS 7.1 PROGRAMMING ENVIRONMENT FOR GP INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1 7.2 PROGRAMMING ENVIRONMENT FOR GP INSTRUCTIONS IN 64-BIT MODE . . . . . . . . . . . . . . 7-2 7.3 SUMMARY OF GP INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7.3.1 Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7.3.1.1 General Data Movement Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7.3.1.2 Exchange Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 7.3.1.3 Exchange Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 7.3.1.4 Stack Manipulation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 7.3.1.5 Stack Manipulation Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 7.3.1.6 Type Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 7.3.1.7 Type Conversion Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 7.3.2 Binary Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 7.3.2.1 Addition and Subtraction Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11 7.3.2.2 Increment and Decrement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11 7.3.2.3 Increment and Decrement Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . 7-11 7.3.2.4 Comparison and Sign Change Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11 7.3.2.5 Multiplication and Divide Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 7.3.3 Decimal Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 7.3.3.1 Packed BCD Adjustment Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 7.3.3.2 Unpacked BCD Adjustment Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13 7.3.4 Decimal Arithmetic Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.3.5 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.3.6 Shift and Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.3.6.1 Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.3.6.2 Double-Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 7.3.6.3 Rotate Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 7.3.7 Bit and Byte Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.3.7.1 Bit Test and Modify Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.3.7.2 Bit Scan Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.3.7.3 Byte Set on Condition Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.3.7.4 Test Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 7.3.8 Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 7.3.8.1 Unconditional Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 7.3.8.2 Conditional Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 7.3.8.3 Control Transfer Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24 7.3.8.4 Software Interrupt Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24

viii Vol. 1

CONTENTS PAGE

7.3.8.5 7.3.9 7.3.9.1 7.3.10 7.3.10.1 7.3.11 7.3.12 7.3.13 7.3.14 7.3.14.1 7.3.14.2 7.3.14.3 7.3.15 7.3.16 7.3.16.1 7.3.16.2 7.3.16.3 7.3.16.4 7.3.17 7.3.17.1 7.3.17.2 7.3.17.3 7.3.17.4

Software Interrupt Instructions in 64-bit Mode and Compatibility Mode . . . . . . . .7-25 String Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-25 Repeating String Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-26 String Operations in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27 Repeating String Operations in 64-bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27 I/O Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-27 I/O Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28 Enter and Leave Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28 Flag Control (EFLAG) Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28 Carry and Direction Flag Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-28 EFLAGS Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-29 Interrupt Flag Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-30 Flag Control (RFLAG) Instructions in 64-Bit Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-30 Segment Register Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-30 Segment-Register Load and Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-30 Far Control Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31 Software Interrupt Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31 Load Far Pointer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31 Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-31 Address Computation Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32 Table Lookup Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32 Processor Identification Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32 No-Operation and Undefined Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-32

CHAPTER 8 PROGRAMMING WITH THE X87 FPU 8.1 X87 FPU EXECUTION ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 8.1.1 x87 FPU in 64-Bit Mode and Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.1.2 x87 FPU Data Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.1.2.1 Parameter Passing With the x87 FPU Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 8.1.3 x87 FPU Status Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 8.1.3.1 Top of Stack (TOP) Pointer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 8.1.3.2 Condition Code Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 8.1.3.3 x87 FPU Floating-Point Exception Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 8.1.3.4 Stack Fault Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 8.1.4 Branching and Conditional Moves on Condition Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 8.1.5 x87 FPU Control Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-10 8.1.5.1 x87 FPU Floating-Point Exception Mask Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-11 8.1.5.2 Precision Control Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-11 8.1.5.3 Rounding Control Field. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-12 8.1.6 Infinity Control Flag. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-12 8.1.7 x87 FPU Tag Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-12 8.1.8 x87 FPU Instruction and Data (Operand) Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-13 8.1.9 Last Instruction Opcode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-14 8.1.9.1 Fopcode Compatibility Sub-mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-14 8.1.10 Saving the x87 FPU’s State with FSTENV/FNSTENV and FSAVE/FNSAVE. . . . . . . . . .8-15 8.1.11 Saving the x87 FPU’s State with FXSAVE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-17

Vol. 1 ix

CONTENTS PAGE

8.2 8.2.1 8.2.2 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.3.6 8.3.6.1 8.3.7 8.3.8 8.3.9 8.3.10 8.3.11 8.3.12 8.3.13 8.4 8.4.1 8.5 8.5.1 8.5.1.1 8.5.1.2 8.5.2 8.5.3 8.5.4 8.5.5 8.5.6 8.6 8.7 8.7.1 8.7.2 8.7.3

X87 FPU DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Indefinites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 Unsupported Double Extended-Precision Floating-Point Encodings and Pseudo-Denormals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20 X86 FPU INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Escape (ESC) Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 x87 FPU Instruction Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Data Transfer Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Load Constant Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24 Basic Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25 Comparison and Classification Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26 Branching on the x87 FPU Condition Codes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 Trigonometric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-29 Pi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30 Logarithmic, Exponential, and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31 Transcendental Instruction Accuracy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31 x87 FPU Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-32 Waiting vs. Non-waiting Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-33 Unsupported x87 FPU Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-34 X87 FPU FLOATING-POINT EXCEPTION HANDLING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-34 Arithmetic vs. Non-arithmetic Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-35 X87 FPU FLOATING-POINT EXCEPTION CONDITIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-36 Invalid Operation Exception. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-36 Stack Overflow or Underflow Exception (#IS). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-37 Invalid Arithmetic Operand Exception (#IA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-38 Denormal Operand Exception (#D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-39 Divide-By-Zero Exception (#Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-40 Numeric Overflow Exception (#O). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-40 Numeric Underflow Exception (#U) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-41 Inexact-Result (Precision) Exception (#P). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-42 X87 FPU EXCEPTION SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-43 HANDLING X87 FPU EXCEPTIONS IN SOFTWARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-45 Native Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-45 MS-DOS* Compatibility Sub-mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-45 Handling x87 FPU Exceptions in Software. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-46

CHAPTER 9 PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY 9.1 OVERVIEW OF MMX TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 THE MMX TECHNOLOGY PROGRAMMING ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 MMX Technology in 64-Bit Mode and Compatibility Mode. . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 MMX Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 MMX Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Memory Data Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.5 Single Instruction, Multiple Data (SIMD) Execution Model . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 SATURATION AND WRAPAROUND MODES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 MMX INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x Vol. 1

9-1 9-2 9-2 9-3 9-4 9-4 9-4 9-5 9-6

CONTENTS PAGE

9.4.1 9.4.2 9.4.3 9.4.4 9.4.5 9.4.6 9.4.7 9.4.8 9.5 9.5.1 9.6 9.6.1 9.6.2 9.6.3 9.6.4 9.6.5 9.6.6 9.6.7 9.6.8 9.6.9

Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-8 Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9 Unpack Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-9 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10 Shift Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10 EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-10 COMPATIBILITY WITH X87 FPU ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-10 MMX Instructions and the x87 FPU Tag Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11 WRITING APPLICATIONS WITH MMX CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-11 Checking for MMX Technology Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-11 Transitions Between x87 FPU and MMX Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-12 Using the EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-12 Mixing MMX and x87 FPU Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13 Interfacing with MMX Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-13 Using MMX Code in a Multitasking Operating System Environment . . . . . . . . . . . . . . . .9-14 Exception Handling in MMX Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-14 Register Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-14 Effect of Instruction Prefixes on MMX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-15

CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) 10.1 OVERVIEW OF SSE EXTENSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.2 SSE PROGRAMMING ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 10.2.1 SSE in 64-Bit Mode and Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4 10.2.2 XMM Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-4 10.2.3 MXCSR Control and Status Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-5 10.2.3.1 SIMD Floating-Point Mask and Flag Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-6 10.2.3.2 SIMD Floating-Point Rounding Control Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-6 10.2.3.3 Flush-To-Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7 10.2.3.4 Denormals-Are-Zeros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-7 10.2.4 Compatibility of SSE Extensions with SSE2/SSE3/MMX and the x87 FPU . . . . . . . . . .10-8 10.3 SSE DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8 10.4 SSE INSTRUCTION SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9 10.4.1 SSE Packed and Scalar Floating-Point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10-9 10.4.1.1 SSE Data Movement Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10 10.4.1.2 SSE Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11 10.4.2 SSE Logical Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 10.4.2.1 SSE Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 10.4.2.2 SSE Shuffle and Unpack Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-14 10.4.3 SSE Conversion Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15 10.4.4 SSE 64-Bit SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-16 10.4.5 MXCSR State Management Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17 10.4.6 Cacheability Control, Prefetch, and Memory Ordering Instructions . . . . . . . . . . . . . . . 10-17 10.4.6.1 Cacheability Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 10.4.6.2 Caching of Temporal vs. Non-Temporal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18

Vol. 1 xi

CONTENTS PAGE

10.4.6.3 PREFETCHh Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 10.4.6.4 SFENCE Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.5 FXSAVE AND FXRSTOR INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 10.6 HANDLING SSE INSTRUCTION EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 10.7 WRITING APPLICATIONS WITH THE SSE EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 CHAPTER 11 PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) 11.1 OVERVIEW OF SSE2 EXTENSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1 11.2 SSE2 PROGRAMMING ENVIRONMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3 11.2.1 SSE2 in 64-Bit Mode and Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4 11.2.2 Compatibility of SSE2 Extensions with SSE, MMX Technology and x87 FPU Programming Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4 11.2.3 Denormals-Are-Zeros Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5 11.3 SSE2 DATA TYPES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5 11.4 SSE2 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6 11.4.1 Packed and Scalar Double-Precision Floating-Point Instructions . . . . . . . . . . . . . . . . . . . 11-6 11.4.1.1 Data Movement Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8 11.4.1.2 SSE2 Arithmetic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8 11.4.1.3 SSE2 Logical Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 11.4.1.4 SSE2 Comparison Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10 11.4.1.5 SSE2 Shuffle and Unpack Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10 11.4.1.6 SSE2 Conversion Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12 11.4.2 SSE2 64-Bit and 128-Bit SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 11.4.3 128-Bit SIMD Integer Instruction Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16 11.4.4 Cacheability Control and Memory Ordering Instructions . . . . . . . . . . . . . . . . . . . . . . . . . 11-16 11.4.4.1 FLUSH Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 11.4.4.2 Cacheability Control Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 11.4.4.3 Memory Ordering Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 11.4.4.4 Pause . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18 11.4.5 Branch Hints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18 11.5 SSE, SSE2, AND SSE3 EXCEPTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18 11.5.1 SIMD Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 11.5.2 SIMD Floating-Point Exception Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 11.5.2.1 Invalid Operation Exception (#I). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20 11.5.2.2 Denormal-Operand Exception (#D) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21 11.5.2.3 Divide-By-Zero Exception (#Z). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 11.5.2.4 Numeric Overflow Exception (#O). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 11.5.2.5 Numeric Underflow Exception (#U) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 11.5.2.6 Inexact-Result (Precision) Exception (#P). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23 11.5.3 Generating SIMD Floating-Point Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23 11.5.3.1 Handling Masked Exceptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23 11.5.3.2 Handling Unmasked Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25 11.5.3.3 Handling Combinations of Masked and Unmasked Exceptions. . . . . . . . . . . . . . . . . 11-26 11.5.4 Handling SIMD Floating-Point Exceptions in Software . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26 11.5.5 Interaction of SIMD and x87 FPU Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . 11-26 11.6 WRITING APPLICATIONS WITH SSE/SSE2 EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-27

xii Vol. 1

CONTENTS PAGE

11.6.1 11.6.2 11.6.3 11.6.4 11.6.5 11.6.6 11.6.7 11.6.8 11.6.9 11.6.10 11.6.10.1 11.6.10.2 11.6.10.3 11.6.11 11.6.12 11.6.13 11.6.14

General Guidelines for Using SSE/SSE2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking for SSE/SSE2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Checking for the DAZ Flag in the MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Initialization of SSE/SE2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saving and Restoring the SSE/SSE2 State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guidelines for Writing to the MXCSR Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction of SSE/SSE2 Instructions with x87 FPU and MMX Instructions . . . . . . . Compatibility of SIMD and x87 FPU Floating-Point Data Types . . . . . . . . . . . . . . . . . . Mixing Packed and Scalar Floating-Point and 128-Bit SIMD Integer Instructions and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interfacing with SSE/SSE2 Procedures and Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . Passing Parameters in XMM Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saving XMM Register State on a Procedure or Function Call. . . . . . . . . . . . . . . . . . Caller-Save Requirement for Procedure and Function Calls. . . . . . . . . . . . . . . . . . . Updating Existing MMX Technology Routines Using 128-Bit SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branching on Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cacheability Hint Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effect of Instruction Prefixes on the SSE/SSE2 Instructions. . . . . . . . . . . . . . . . . . . . .

11-27 11-28 11-28 11-29 11-30 11-30 11-31 11-32 11-32 11-34 11-34 11-34 11-35 11-35 11-36 11-36 11-37

CHAPTER 12 PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 12.1 SSE3/SSSE3 PROGRAMMING ENVIRONMENT AND DATA TYPES . . . . . . . . . . . . . . . . . . . . . 12-1 12.1.1 SSE3/SSSE3 in 64-Bit Mode and Compatibility Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-1 12.1.2 Compatibility of SSE3/SSSE3 with MMX Technology, the x87 FPU Environment, and SSE/SSE2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2 12.1.3 Horizontal and Asymmetric Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-2 12.2 OVERVIEW OF SSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.3 SSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.3.1 x87 FPU Instruction for Integer Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4 12.3.2 SIMD Integer Instruction for Specialized 128-bit Unaligned Data Load . . . . . . . . . . . . .12-4 12.3.3 SIMD Floating-Point Instructions That Enhance LOAD/MOVE/DUPLICATE Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-4 12.3.4 SIMD Floating-Point Instructions Provide Packed Addition/Subtraction. . . . . . . . . . . . .12-5 12.3.5 SIMD Floating-Point Instructions Provide Horizontal Addition/Subtraction . . . . . . . . .12-5 12.3.6 Two Thread Synchronization Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7 12.4 WRITING APPLICATIONS WITH SSE3 EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7 12.4.1 Guidelines for Using SSE3 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7 12.4.2 Checking for SSE3 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-7 12.4.3 Enable FTZ and DAZ for SIMD Floating-Point Computation . . . . . . . . . . . . . . . . . . . . . . . .12-9 12.4.4 Programming SSE3 with SSE/SSE2 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12-9 12.5 OVERVIEW OF SSSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9 12.6 SSSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.6.1 Horizontal Addition/Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.6.2 Packed Absolute Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 12.6.3 Multiply and Add Packed Signed and Unsigned Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 12.6.4 Packed Multiply High with Round and Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12

Vol. 1 xiii

CONTENTS PAGE

12.6.5 12.6.6 12.6.7 12.7 12.7.1 12.7.2 12.8 12.8.1 12.8.2 12.8.3

Packed Shuffle Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 Packed Sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 Packed Align Right . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 WRITING APPLICATIONS WITH SSSE3 EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 Guidelines for Using SSSE3 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 Checking for SSSE3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 SSE3/SSSE3 EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Device Not Available (DNA) Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Numeric Error flag and IGNNE# . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15

CHAPTER 13 INPUT/OUTPUT 13.1 I/O PORT ADDRESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 13.2 I/O PORT HARDWARE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 13.3 I/O ADDRESS SPACE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2 13.3.1 Memory-Mapped I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2 13.4 I/O INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3 13.5 PROTECTED-MODE I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 13.5.1 I/O Privilege Level. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 13.5.2 I/O Permission Bit Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5 13.6 ORDERING I/O. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7 CHAPTER 14 PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION 14.1 USING THE CPUID INSTRUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1 14.1.1 Notes on Where to Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1 14.1.2 Identification of Earlier IA-32 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2 APPENDIX A EFLAGS CROSS-REFERENCE A.1 EFLAGS AND INSTRUCTIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 APPENDIX B EFLAGS CONDITION CODES B.1 CONDITION CODES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 APPENDIX C FLOATING-POINT EXCEPTIONS SUMMARY C.1 OVERVIEW. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1 C.2 X87 FPU INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2 C.3 SSE INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4 C.4 SSE2 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7 C.5 SSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-11 C.6 SSSE3 INSTRUCTIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12

xiv Vol. 1

CONTENTS PAGE

APPENDIX D GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS D.1 MS-DOS COMPATIBILITY SUB-MODE FOR HANDLING X87 FPU EXCEPTIONS. . . . . . . . . . . D-2 D.2 IMPLEMENTATION OF THE MS-DOS COMPATIBILITY SUB-MODE IN THE INTEL486, PENTIUM, AND P6 PROCESSOR FAMILY, AND PENTIUM 4 PROCESSORS . . . . D-3 D.2.1 MS-DOS Compatibility Sub-mode in the Intel486 and Pentium Processors. . . . . . . . . . . D-3 D.2.1.1 Basic Rules: When FERR# Is Generated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-4 D.2.1.2 Recommended External Hardware to Support the MS-DOS Compatibility Sub-mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5 D.2.1.3 No-Wait x87 FPU Instructions Can Get x87 FPU Interrupt in Window . . . . . . . . . . . . D-8 D.2.2 MS-DOS Compatibility Sub-mode in the P6 Family and Pentium 4 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-10 D.3 RECOMMENDED PROTOCOL FOR MS-DOS* COMPATIBILITY HANDLERS . . . . . . . . . . . . . . D-11 D.3.1 Floating-Point Exceptions and Their Defaults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-12 D.3.2 Two Options for Handling Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-12 D.3.2.1 Automatic Exception Handling: Using Masked Exceptions . . . . . . . . . . . . . . . . . . . . . .D-12 D.3.2.2 Software Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-14 D.3.3 Synchronization Required for Use of x87 FPU Exception Handlers . . . . . . . . . . . . . . . .D-15 D.3.3.1 Exception Synchronization: What, Why and When . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-16 D.3.3.2 Exception Synchronization Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-17 D.3.3.3 Proper Exception Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-17 D.3.4 x87 FPU Exception Handling Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-18 D.3.5 Need for Storing State of IGNNE# Circuit If Using x87 FPU and SMM . . . . . . . . . . . . . .D-22 D.3.6 Considerations When x87 FPU Shared Between Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . .D-23 D.3.6.1 Speculatively Deferring x87 FPU Saves, General Overview . . . . . . . . . . . . . . . . . . . .D-24 D.3.6.2 Tracking x87 FPU Ownership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-25 D.3.6.3 Interaction of x87 FPU State Saves and Floating-Point Exception Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-25 D.3.6.4 Interrupt Routing From the Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-28 D.3.6.5 Special Considerations for Operating Systems that Support Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-28 D.4 DIFFERENCES FOR HANDLERS USING NATIVE MODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-29 D.4.1 Origin with the Intel 286 and Intel 287, and Intel386 and Intel 387 Processors . . . .D-29 D.4.2 Changes with Intel486, Pentium and Pentium Pro Processors with CR0.NE[bit 5] = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-30 D.4.3 Considerations When x87 FPU Shared Between Tasks Using Native Mode . . . . . . . . .D-30 APPENDIX E GUIDELINES FOR WRITING SIMD FLOATING-POINT EXCEPTION HANDLERS E.1 TWO OPTIONS FOR HANDLING FLOATING-POINT EXCEPTIONS . . . . . . . . . . . . . . . . . . . . . . . . E-1 E.2 SOFTWARE EXCEPTION HANDLING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-1 E.3 EXCEPTION SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-3 E.4 SIMD FLOATING-POINT EXCEPTIONS AND THE IEEE STANDARD 754 . . . . . . . . . . . . . . . . . . E-4 E.4.1 Floating-Point Emulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-4 E.4.2 SSE/SSE2/SSE3 Response To Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . E-6 E.4.2.1 Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-7

Vol. 1 xv

CONTENTS PAGE

E.4.2.2 E.4.2.3 E.4.3

xvi Vol. 1

Results of Operations with NaN Operands or a NaN Result for SSE/SSE2/SSE3 Numeric Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .E-7 Condition Codes, Exception Flags, and Response for Masked and Unmasked Numeric Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-12 Example SIMD Floating-Point Emulation Implementation. . . . . . . . . . . . . . . . . . . . . . . . . . E-21

CONTENTS PAGE

FI GURES Figure 1-1. Figure 1-2. Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 2-6. Figure 3-1. Figure 3-2. Figure 3-3. Figure 3-4. Figure 3-5. Figure 3-6. Figure 3-7. Figure 3-8. Figure 3-9. Figure 3-10. Figure 3-11. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 4-6. Figure 4-7. Figure 4-8. Figure 4-9. Figure 4-10. Figure 4-11. Figure 4-12. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 6-5. Figure 6-6. Figure 6-7. Figure 6-8. Figure 6-9. Figure 6-10. Figure 7-1. Figure 7-2.

Bit and Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-4 Syntax for CPUID, CR, and MSR Data Presentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-7 The P6 Processor Microarchitecture with Advanced Transfer Cache Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 The Intel NetBurst Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-10 The Intel Core Microarchitecture Pipeline Functionality . . . . . . . . . . . . . . . . . . . . . . . .2-13 SIMD Extensions, Register Layouts, and Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . .2-16 Comparison of an IA-32 Processor Supporting Hyper-Threading Technology and a Traditional Dual Processor System . . . . . . . . . . . . . . . . . . . . . . . . . .2-17 IA-32 Processors that Support Dual-Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-19 IA-32 Basic Execution Environment for Non-64-bit Modes . . . . . . . . . . . . . . . . . . . . . . 3-4 64-Bit Mode Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Three Memory Management Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 General System and Application Programming Registers. . . . . . . . . . . . . . . . . . . . . . .3-15 Alternate General-Purpose Register Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-16 Use of Segment Registers for Flat Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-18 Use of Segment Registers in Segmented Memory Model. . . . . . . . . . . . . . . . . . . . . . .3-19 EFLAGS Register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-21 Memory Operand Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-28 Memory Operand Address in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-29 Offset (or Effective Address) Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-31 Fundamental Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Bytes, Words, Doublewords, Quadwords, and Double Quadwords in Memory . . . . . 4-2 Numeric Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 Pointer Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9 Pointers in 64-Bit Mode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10 Bit Field Data Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-10 64-Bit Packed SIMD Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-11 128-Bit Packed SIMD Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-12 BCD Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-13 Binary Real Number System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-16 Binary Floating-Point Format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-16 Real Numbers and NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-18 Stack Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 Stack on Near and Far Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Protection Rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 Stack Switch on a Call to a Different Privilege Level . . . . . . . . . . . . . . . . . . . . . . . . . . .6-10 Stack Usage on Transfers to Interrupt and Exception Handling Routines . . . . . . .6-16 Nested Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-22 Stack Frame After Entering the MAIN Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23 Stack Frame After Entering Procedure A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-23 Stack Frame After Entering Procedure B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-24 Stack Frame After Entering Procedure C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-25 Operation of the PUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 Operation of the PUSHA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8

Vol. 1 xvii

CONTENTS PAGE

Figure 7-3. Figure 7-4. Figure 7-5. Figure 7-6. Figure 7-7. Figure 7-8. Figure 7-9. Figure 7-10. Figure 7-11. Figure 8-1. Figure 8-2. Figure 8-3. Figure 8-4. Figure 8-5. Figure 8-6. Figure 8-7. Figure 8-8. Figure 8-9. Figure 8-10. Figure 8-11. Figure 8-12. Figure 8-13. Figure 9-1. Figure 9-2. Figure 9-3. Figure 9-4. Figure 10-1. Figure 10-2. Figure 10-3. Figure 10-4. Figure 10-5. Figure 10-6. Figure 10-7. Figure 10-8. Figure 10-9. Figure 11-1. Figure 11-2. Figure 11-3. Figure 11-4. Figure 11-5. Figure 11-6. Figure 11-7. Figure 11-8. Figure 11-9. Figure 12-1. Figure 12-2. Figure 12-3.

xviii Vol. 1

Operation of the POP Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8 Operation of the POPA Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 Sign Extension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 SHL/SAL Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15 SHR Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15 SAR Instruction Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 SHLD and SHRD Instruction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 ROL, ROR, RCL, and RCR Instruction Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 Flags Affected by the PUSHF, POPF, PUSHFD, and POPFD Instructions. . . . . . . . . 7-29 x87 FPU Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 x87 FPU Data Register Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-4 Example x87 FPU Dot Product Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 x87 FPU Status Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 Moving the Condition Codes to the EFLAGS Register . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 x87 FPU Control Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 x87 FPU Tag Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13 Contents of x87 FPU Opcode Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15 Protected Mode x87 FPU State Image in Memory, 32-Bit Format . . . . . . . . . . . . . . 8-16 Real Mode x87 FPU State Image in Memory, 32-Bit Format . . . . . . . . . . . . . . . . . . . 8-16 Protected Mode x87 FPU State Image in Memory, 16-Bit Format . . . . . . . . . . . . . . 8-17 Real Mode x87 FPU State Image in Memory, 16-Bit Format . . . . . . . . . . . . . . . . . . . 8-17 x87 FPU Data Type Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 MMX Technology Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2 MMX Register Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Data Types Introduced with the MMX Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 SIMD Execution Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5 SSE Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 XMM Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 MXCSR Control/Status Register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 128-Bit Packed Single-Precision Floating-Point Data Type. . . . . . . . . . . . . . . . . . . . . 10-8 Packed Single-Precision Floating-Point Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10 Scalar Single-Precision Floating-Point Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10 SHUFPS Instruction, Packed Shuffle Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-14 UNPCKHPS Instruction, High Unpack and Interleave Operation. . . . . . . . . . . . . . . . 10-15 UNPCKLPS Instruction, Low Unpack and Interleave Operation . . . . . . . . . . . . . . . . 10-15 Steaming SIMD Extensions 2 Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . 11-3 Data Types Introduced with the SSE2 Extensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-5 Packed Double-Precision Floating-Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7 Scalar Double-Precision Floating-Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-7 SHUFPD Instruction, Packed Shuffle Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11 UNPCKHPD Instruction, High Unpack and Interleave Operation . . . . . . . . . . . . . . . 11-11 UNPCKLPD Instruction, Low Unpack and Interleave Operation . . . . . . . . . . . . . . . . 11-12 SSE and SSE2 Conversion Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13 Example Masked Response for Packed Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24 Asymmetric Processing in ADDSUBPD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 Horizontal Data Movement in HADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 Horizontal Data Movement in PHADDD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10

CONTENTS PAGE

Figure 13-1. Figure 13-2. Figure D-1. Figure D-2. Figure D-3. Figure D-4. Figure D-5. Figure D-6. Figure E-1.

Memory-Mapped I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-3 I/O Permission Bit Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-6 Recommended Circuit for MS-DOS Compatibility x87 FPU Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7 Behavior of Signals During x87 FPU Exception Handling . . . . . . . . . . . . . . . . . . . . . . . . D-8 Timing of Receipt of External Interrupt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9 Arithmetic Example Using Infinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-13 General Program Flow for DNA Exception Handler . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-26 Program Flow for a Numeric Exception Dispatch Routine . . . . . . . . . . . . . . . . . . . . . .D-27 Control Flow for Handling Unmasked Floating-Point Exceptions . . . . . . . . . . . . . . . . . E-6

Vol. 1 xix

CONTENTS PAGE

TABLES Table 2-1. Table 2-2. Table 2-3. Table 3-1. Table 3-2. Table 3-3. Table 3-4. Table 3-5. Table 4-1. Table 4-2. Table 4-3. Table 4-4. Table 4-5. Table 4-6. Table 4-7. Table 4-8. Table 4-9. Table 4-10. Table 4-11. Table 5-1. Table 6-1. Table 7-1. Table 7-2. Table 7-3. Table 7-4. Table 8-1. Table 8-2. Table 8-3. Table 8-4. Table 8-5. Table 8-6. Table 8-7. Table 8-8. Table 8-9. Table 8-10. Table 8-11. Table 9-1. Table 9-2. Table 9-3. Table 10-1. Table 11-1. Table 11-2.

xx Vol. 1

Key Features of Most Recent IA-32 Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 Key Features of Most Recent Intel 64 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 Key Features of Previous Generations of IA-32 Processors . . . . . . . . . . . . . . . . . . . 2-23 Instruction Pointer Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 Addressable General Purpose Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 Effective Operand- and Address-Size Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 Effective Operand- and Address-Size Attributes in 64-Bit Mode . . . . . . . . . . . . . . . 3-26 Default Segment Selection Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 Signed Integer Encodings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 Length, Precision, and Range of Floating-Point Data Types . . . . . . . . . . . . . . . . . . . . . 4-7 Floating-Point Number and NaN Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8 Packed Decimal Integer Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Real and Floating-Point Number Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Denormalization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 Rules for Handling NaNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 Rounding Modes and Encoding of Rounding Control (RC) Field . . . . . . . . . . . . . . . . . 4-24 Numeric Overflow Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29 Masked Responses to Numeric Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-29 Numeric Underflow (Normalized) Thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-30 Instruction Groups and IA-32 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 Exceptions and Interrupts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Move Instruction Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4 Conditional Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 Bit Test and Modify Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 Conditional Jump Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 Condition Code Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8 Precision Control Field (PC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12 Unsupported Double Extended-Precision Floating-Point Encodings and Pseudo-Denormals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 Data Transfer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23 Floating-Point Conditional Move Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24 Setting of x87 FPU Condition Code Flags for Floating-Point Number Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27 Setting of EFLAGS Status Flags for Floating-Point Number Comparisons . . . . . . . 8-28 TEST Instruction Constants for Conditional Branching . . . . . . . . . . . . . . . . . . . . . . . . . 8-29 Arithmetic and Non-arithmetic Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-35 Invalid Arithmetic Operations and the Masked Responses to Them . . . . . . . . . . . . 8-38 Divide-By-Zero Conditions and the Masked Responses to Them . . . . . . . . . . . . . . . 8-40 Data Range Limits for Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-6 MMX Instruction Set Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-7 Effect of Prefixes on MMX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-15 PREFETCHh Instructions Caching Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-20 Masked Responses of SSE/SSE2/SSE3 Instructions to Invalid Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20 SSE and SSE2 State Following a Power-up/Reset or INIT . . . . . . . . . . . . . . . . . . . . . 11-30

CONTENTS PAGE

Table 11-3. Table 13-1. Table A-1. Table A-2. Table B-1. Table C-1. Table C-2. Table C-3. Table C-4. Table C-5. Table E-1.

Table E-2. Table E-3. Table E-4. Table E-5. Table E-6. Table E-7. Table E-8.

Table E-9. Table E-10. Table E-11. Table E-12. Table E-13. Table E-14. Table E-15. Table E-16. Table E-17. Table E-18.

Effect of Prefixes on SSE, SSE2, and SSE3 Instructions . . . . . . . . . . . . . . . . . . . . . . 11-38 I/O Instruction Serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-8 Codes Describing Flags. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 EFLAGS Cross-Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 EFLAGS Condition Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 x87 FPU and SIMD Floating-Point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1 Exceptions Generated with x87 FPU Floating-Point Instructions . . . . . . . . . . . . . . . . C-2 Exceptions Generated with SSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4 Exceptions Generated with SSE2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7 Exceptions Generated with SSE3 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-11 ADDPS, ADDSS, SUBPS, SUBSS, MULPS, MULSS, DIVPS, DIVSS, ADDPD, ADDSD, SUBPD, SUBSD, MULPD, MULSD, DIVPD, DIVSD, ADDSUBPS, ADDSUBPD, HADDPS, HADDPD, HSUBPS, HSUBPD. . . . . . . . . . . . . . . . . . E-8 CMPPS.EQ, CMPSS.EQ, CMPPS.ORD, CMPSS.ORD, CMPPD.EQ, CMPSD.EQ, CMPPD.ORD, CMPSD.ORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-9 CMPPS.NEQ, CMPSS.NEQ, CMPPS.UNORD, CMPSS.UNORD, CMPPD.NEQ, CMPSD.NEQ, CMPPD.UNORD, CMPSD.UNORD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-9 CMPPS.LT, CMPSS.LT, CMPPS.LE, CMPSS.LE, CMPPD.LT, CMPSD.LT, CMPPD.LE, CMPSD.LE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-9 CMPPS.NLT, CMPSS.NLT, CMPPS.NLE, CMPSS.NLE, CMPPD.NLT, CMPSD.NLT, CMPPD.NLE, CMPSD.NLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-10 COMISS, COMISD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-10 UCOMISS, UCOMISD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-10 CVTPS2PI, CVTSS2SI, CVTTPS2PI, CVTTSS2SI, CVTPD2PI, CVTSD2SI, CVTTPD2PI, CVTTSD2SI, CVTPS2DQ, CVTTPS2DQ, CVTPD2DQ, CVTTPD2DQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-11 MAXPS, MAXSS, MINPS, MINSS, MAXPD, MAXSD, MINPD, MINSD . . . . . . . . . . . . . . . . E-11 SQRTPS, SQRTSS, SQRTPD, SQRTSD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-11 CVTPS2PD, CVTSS2SD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-12 CVTPD2PS, CVTSD2SS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-12 #I - Invalid Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-13 #Z - Divide-by-Zero. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-15 #D - Denormal Operand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-16 #O - Numeric Overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-17 #U - Numeric Underflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-19 #P - Inexact Result (Precision) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E-20

Vol. 1 xxi

CONTENTS PAGE

xxii Vol. 1

CHAPTER 1 ABOUT THIS MANUAL The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1: Basic Archit ect ure ( order num ber 253665) is part of a set t hat describes t he archit ect ure and program m ing environm ent of I nt el ® 64 and I A- 32 archit ect ure processors. Ot her volum es in t his set are:



The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B: I nst ruct ion Set Reference ( order num bers 253666 and 253667) .



The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B: Syst em Program m ing Guide ( order num ber 253668 and 253669) .

The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1, describes t he basic archit ect ure and program m ing environm ent of an I nt el 64 and I A- 32 processor. The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, describe t he inst ruct ion set of t he processor and t he opcode st ruct ure. These volum es apply t o applicat ion program m ers and t o program m ers who writ e operat ing syst em s or execut ives. The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B, describe t he operat ingsyst em support environm ent of I nt el 64 and I A- 32 processors. These volum es t arget operat ing- syst em and BI OS designers. I n addit ion, t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, addresses t he program m ing environm ent for classes of soft ware t hat host operat ing syst em s.

1.1

INTEL® 64 AND IA-32 PROCESSORS COVERED IN THIS MANUAL

This m anual set includes inform at ion pert aining prim arily t o t he m ost recent I nt el 64 and I A- 32 processors, which include:

• • • • • • • • • •

Pent ium ® processors P6 fam ily processors Pent ium ® 4 processors Pent ium ® M processors I nt el ® Xeon ® processors Pent ium ® D processors Pent ium ® processor Ext rem e Edit ions 64- bit I nt el ® Xeon ® processors I nt el ® CoreTM Duo processor I nt el ® CoreTM Solo processor

Vol. 1 1-1

ABOUT THIS MANUAL

• • •

Dual- Core I nt el ® Xeon ® processor LV I nt el ® CoreTM2 Duo processor I nt el ® Xeon ® processor 5100 series

P6 fam ily processors are I A- 32 processors based on t he P6 fam ily m icroarchit ect ure. This includes t he Pent ium ® Pro, Pent ium ® I I , Pent ium ® III, and Pent ium ® III Xeon ® processors. The Pent ium ® 4, Pent ium ® D, and Pent ium ® processor Ext rem e Edit ions are based on t he I nt el Net Burst ® m icroarchit ect ure. Most early I nt el ® Xeon ® processors are based on t he I nt el Net Burst ® m icroarchit ect ure. The I nt el ® CoreTM Duo, I nt el ® CoreTM Solo and dual- core I nt el ® Xeon ® processor LV are based on an im proved Pent ium ® M processor m icroarchit ect ure. The I nt el ® Xeon ® processor 5100 series, I nt el ® CoreTM2 Duo, and I nt el ® CoreTM2 Ext rem e processors are based on I nt el ® CoreTM m icroarchit ect ure. P6 fam ily, Pent ium ® M, I nt el ® CoreTM Solo, I nt el ® CoreTM Duo processors, dual- core I nt el ® Xeon ® processor LV, and early generat ions of Pent ium 4 and I nt el Xeon processors support I A- 32 archit ect ure. The I nt el ® Xeon ® processor 5100 series, I nt el ® CoreTM2 Duo, I nt el ® CoreTM2 Ext rem e processors, newer generat ions of Pent ium 4 and I nt el Xeon processor fam ily support I nt el ® 64 archit ect ure. I A- 32 archit ect ure is t he inst ruct ion set archit ect ure and program m ing environm ent for I nt el's 32- bit m icroprocessors. I nt el ® 64 archit ect ure is t he inst ruct ion set archit ect ure and program m ing environm ent which is t he superset of I nt el’s 32- bit and 64- bit archit ect ures. I t is com pat ible wit h t he I A- 32 archit ect ure.

1.2

OVERVIEW OF VOLUME 1: BASIC ARCHITECTURE

A descript ion of t his m anual’s cont ent follows: Cha pt e r 1 — Abou t Th is M a n ua l. Gives an overview of all five volum es of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual. I t also describes t he not at ional convent ions in t hese m anuals and list s relat ed I nt el m anuals and docum ent at ion of int erest t o program m ers and hardware designers. Cha pt e r 2 — I n t e l ® 6 4 a n d I A- 3 2 Ar chit e ct ur e s. I nt roduces t he I nt el 64 and I A32 archit ect ures along wit h t he fam ilies of I nt el processors t hat are based on t hese archit ect ures. I t also gives an overview of t he com m on feat ures found in t hese processors and brief hist ory of t he I nt el 64 and I A- 32 archit ect ures. Cha pt e r 3 — Ba sic Ex e cut ion Envir onm e nt . I nt roduces t he m odels of m em ory organizat ion and describes t he regist er set used by applicat ions.

1-2 Vol. 1

ABOUT THIS MANUAL

Cha pt e r 4 — D a t a Type s. Describes t he dat a t ypes and addressing m odes recognized by t he processor; provides an overview of real num bers and float ing- point form at s and of float ing- point except ions. Ch a pt e r 5 — I n st r uct ion Se t Su m m a r y. List s all I nt el 64 and I A- 32 inst ruct ions, divided int o t echnology groups. Cha pt e r 6 — Pr oce dur e Ca lls, I n t e r r u pt s, a n d Ex ce pt ions. Describes t he procedure st ack and m echanism s provided for m aking procedure calls and for servicing int errupt s and except ions. Ch a pt e r 7 — Pr ogr a m m in g w it h Ge ne r a l- Pur pose I nst r uct ions. Describes basic load and st ore, program cont rol, arit hm et ic, and st ring inst ruct ions t hat operat e on basic dat a t ypes, general- purpose and segm ent regist ers; also describes syst em inst ruct ions t hat are execut ed in prot ect ed m ode. Ch a pt e r 8 — Pr ogr a m m in g w it h t he x 8 7 FPU. Describes t he x87 float ing- point unit ( FPU) , including float ing- point regist ers and dat a t ypes; gives an overview of t he float ing- point inst ruct ion set and describes t he processor's float ing- point except ion condit ions. Ch a pt e r 9 — Pr ogr a m m in g w it h I n t e l ® M M X™ Te chn ology. Describes I nt el MMX t echnology, including MMX regist ers and dat a t ypes; also provides an overview of t he MMX inst ruct ion set . Cha pt e r 1 0 — Pr ogr a m m ing w it h St r e a m ing SI M D Ex t e nsions ( SSE) . Describes SSE ext ensions, including XMM regist ers, t he MXCSR regist er, and packed single- precision float ing- point dat a t ypes; provides an overview of t he SSE inst ruct ion set and gives guidelines for writ ing code t hat accesses t he SSE ext ensions. Cha pt e r 1 1 — Pr ogr a m m ing w it h St r e a m ing SI M D Ex t e nsions 2 ( SSE2 ) . Describes SSE2 ext ensions, including XMM regist ers and packed double- precision float ing- point dat a t ypes; provides an overview of t he SSE2 inst ruct ion set and gives guidelines for writ ing code t hat accesses SSE2 ext ensions. This chapt er also describes SI MD float ing- point except ions t hat can be generat ed wit h SSE and SSE2 inst ruct ions. I t also provides general guidelines for incorporat ing support for SSE and SSE2 ext ensions int o operat ing syst em and applicat ions code. Ch a pt e r 1 2 — Pr ogr a m m ing w it h SSE3 a nd Supple m e n t a l SSE3 . Describes SSE3 ext ensions; provides an overview of t he SSE3 inst ruct ion set , Supplem ent al SSE3 and guidelines for writ ing code t hat accesses t hese ext ensions. Cha pt e r 1 3 — I nput / Out put . Describes t he processor ’s I / O m echanism , including I / O port addressing, I / O inst ruct ions, and I / O prot ect ion m echanism s. Ch a pt e r 1 4 — Pr oce ssor I de n t ifica t ion a n d Fe a t u r e D e t e r m ina t ion. Describes how t o det erm ine t he CPU t ype and feat ures available in t he processor. Appe ndix A — EFLAGS Cr oss- Re fe r e nce . Sum m arizes how t he I A- 32 inst ruct ions affect t he flags in t he EFLAGS regist er. Appe ndix B — EFLAGS Condit ion Code s. Sum m arizes how condit ional j um p, m ove, and ‘byt e set on condit ion code’ inst ruct ions use condit ion code flags ( OF, CF, ZF, SF, and PF) in t he EFLAGS regist er.

Vol. 1 1-3

ABOUT THIS MANUAL

Appe n dix C — Floa t in g- Poin t Ex ce pt ion s Su m m a r y. Sum m arizes except ions raised by t he x87 FPU float ing- point and SSE/ SSE2/ SSE3 float ing- point inst ruct ions. Appe ndix D — Guide line s for W r it ing x 8 7 FPU Ex ce pt ion H a ndle r s. Describes how t o design and writ e MS- DOS* com pat ible except ion handling facilit ies for FPU except ions ( includes soft ware and hardware requirem ent s and assem bly- language code exam ples) . This appendix also describes general t echniques for writ ing robust FPU except ion handlers. Appe ndix E — Guide line s for W r it ing SI M D Floa t ing- Point Ex ce pt ion H a ndle r s. Gives guidelines for writ ing except ion handlers for except ions generat ed by SSE/ SSE2/ SSE3 float ing- point inst ruct ions.

1.3

NOTATIONAL CONVENTIONS

This m anual uses specific not at ion for dat a- st ruct ure form at s, for sym bolic represent at ion of inst ruct ions, and for hexadecim al and binary num bers. This not at ion is described below.

1.3.1

Bit and Byte Order

I n illust rat ions of dat a st ruct ures in m em ory, sm aller addresses appear t oward t he bot t om of t he figure; addresses increase t oward t he t op. Bit posit ions are num bered from right t o left . The num erical value of a set bit is equal t o t wo raised t o t he power of t he bit posit ion. I nt el 64 and I A- 32 processors are “ lit t le endian” m achines; t his m eans t he byt es of a word are num bered st art ing from t he least significant byt e. See Figure 1- 1.

Data Structure Highest Address

32

24 23

Byte 3

16 15

Byte 2

8 7

Byte 1

0

Byte 0

Bit offset 28 24 20 16 12 8 4 0

Lowest Address

Byte Offset

Figure 1-1. Bit and Byte Order

1-4 Vol. 1

ABOUT THIS MANUAL

1.3.2

Reserved Bits and Software Compatibility

I n m any regist er and m em ory layout descript ions, cert ain bit s are m arked as r e se r ve d. When bit s are m arked as reserved, it is essent ial for com pat ibilit y wit h fut ure processors t hat soft ware t reat t hese bit s as having a fut ure, t hough unknown, effect . The behavior of reserved bit s should be regarded as not only undefined, but unpredict able. Soft ware should follow t hese guidelines in dealing wit h reserved bit s:



Do not depend on t he st at es of any reserved bit s when t est ing t he values of regist ers t hat cont ain such bit s. Mask out t he reserved bit s before t est ing.



Do not depend on t he st at es of any reserved bit s when st oring t o m em ory or t o a regist er.

• •

Do not depend on t he abilit y t o ret ain inform at ion writ t en int o any reserved bit s. When loading a regist er, always load t he reserved bit s wit h t he values indicat ed in t he docum ent at ion, if any, or reload t hem wit h values previously read from t he sam e regist er.

NOTE Avoid any soft ware dependence upon t he st at e of reserved bit s in I nt el 64 and I A- 32 regist ers. Depending upon t he values of reserved regist er bit s will m ake soft ware dependent upon t he unspecified m anner in which t he processor handles t hese bit s. Program s t hat depend upon reserved values risk incom pat ibilit y wit h fut ure processors.

1.3.2.1

Instruction Operands

When inst ruct ions are represent ed sym bolically, a subset of t he I A- 32 assem bly language is used. I n t his subset , an inst ruct ion has t he following form at : label: mnemonic argument1, argument2, argument3 where:

• •

A la be l is an ident ifier which is followed by a colon.



The operands a r gu m e n t 1 , a r gum e nt 2 , and a r gum e nt 3 are opt ional. There m ay be from zero t o t hree operands, depending on t he opcode. When present , t hey t ake t he form of eit her lit erals or ident ifiers for dat a it em s. Operand ident ifiers are eit her reserved nam es of regist ers or are assum ed t o be assigned t o dat a it em s declared in anot her part of t he program ( which m ay not be shown in t he exam ple) .

A m ne m on ic is a reserved nam e for a class of inst ruct ion opcodes which have t he sam e funct ion.

When t wo operands are present in an arit hm et ic or logical inst ruct ion, t he right operand is t he source and t he left operand is t he dest inat ion.

Vol. 1 1-5

ABOUT THIS MANUAL

For exam ple: LOADREG: MOV EAX, SUBTOTAL I n t his exam ple, LOADREG is a label, MOV is t he m nem onic ident ifier of an opcode, EAX is t he dest inat ion operand, and SUBTOTAL is t he source operand. Som e assem bly languages put t he source and dest inat ion in reverse order.

1.3.3

Hexadecimal and Binary Numbers

Base 16 ( hexadecim al) num bers are represent ed by a st ring of hexadecim al digit s followed by t he charact er H ( for exam ple, 0F82EH) . A hexadecim al digit is a charact er from t he following set : 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, and F. Base 2 ( binary) num bers are represent ed by a st ring of 1s and 0s, som et im es followed by t he charact er B ( for exam ple, 1010B) . The “ B” designat ion is only used in sit uat ions where confusion as t o t he t ype of num ber m ight arise.

1.3.4

Segmented Addressing

The processor uses byt e addressing. This m eans m em ory is organized and accessed as a sequence of byt es. Whet her one or m ore byt es are being accessed, a byt e address is used t o locat e t he byt e or byt es m em ory. The range of m em ory t hat can be addressed is called an a ddr e ss spa ce . The processor also support s segm ent ed addressing. This is a form of addressing where a program m ay have m any independent address spaces, called se gm e nt s. For exam ple, a program can keep it s code ( inst ruct ions) and st ack in separat e segm ent s. Code addresses would always refer t o t he code space, and st ack addresses would always refer t o t he st ack space. The following not at ion is used t o specify a byt e address wit hin a segm ent : Segment-register:Byte-address For exam ple, t he following segm ent address ident ifies t he byt e at address FF79H in t he segm ent point ed by t he DS regist er: DS:FF79H The following segm ent address ident ifies an inst ruct ion address in t he code segm ent . The CS regist er point s t o t he code segm ent and t he EI P regist er cont ains t he address of t he inst ruct ion. CS:EIP

1-6 Vol. 1

ABOUT THIS MANUAL

1.3.5

A New Syntax for CPUID, CR, and MSR Values

Obt ain feat ure flags, st at us, and syst em inform at ion by using t he CPUI D inst ruct ion, by checking cont rol regist er bit s, and by reading m odel- specific regist ers. We are m oving t oward a new synt ax t o represent t his inform at ion. See Figure 1- 2.

&38,',QSXWDQG2XWSXW &38,'+(&;66(>ELW@  ,QSXWYDOXHVIRU($; (&;UHJLVWHUV ,IRQO\RQHYDOXH($;LVLPSOLHG 2XWSXWUHJLVWHUDQGIHDWXUHIODJRUILHOG QDPHZLWKELWSRVLWLRQ V 9DOXH RUUDQJH RIRXWSXW &RQWURO5HJLVWHU9DOXHV

&526);65>ELW@ 

([DPSOH&5QDPH )HDWXUHIODJRUILHOGQDPH ZLWKELWSRVLWLRQ V 9DOXH RUUDQJH RIRXWSXW 0RGHO6SHFLILF5HJLVWHU9DOXHV ,$B0,6&B(1$%/(6(1$%/()23&2'(>ELW@  ([DPSOH065QDPH )HDWXUHIODJRUILHOGQDPHZLWKELWSRVLWLRQ V 9DOXH RUUDQJH RIRXWSXW 20

Figure 1-2. Syntax for CPUID, CR, and MSR Data Presentation

Vol. 1 1-7

ABOUT THIS MANUAL

1.3.6

Exceptions

An except ion is an event t hat t ypically occurs when an inst ruct ion causes an error. For exam ple, an at t em pt t o divide by zero generat es an except ion. However, som e except ions, such as breakpoint s, occur under ot her condit ions. Som e t ypes of except ions m ay provide error codes. An error code report s addit ional inform at ion about t he error. An exam ple of t he not at ion used t o show an except ion and error code is shown below: #PF(fault code) This exam ple refers t o a page- fault except ion under condit ions where an error code nam ing a t ype of fault is report ed. Under som e condit ions, except ions t hat produce error codes m ay not be able t o report an accurat e code. I n t his case, t he error code is zero, as shown below for a general- prot ect ion except ion: #GP(0)

1.4

RELATED LITERATURE

Lit erat ure relat ed t o I nt el 64 and I A- 32 processors is list ed on- line at : ht t p: / / developer.int el.com / product s/ processor/ index.ht m Som e of t he docum ent s list ed at t his web sit e can be viewed on- line; ot hers can be ordered. The lit erat ure available is list ed by I nt el processor and t hen by t he following lit erat ure t ypes: applicat ions not es, dat a sheet s, m anuals, papers, and specificat ion updat es. See also:

• • • •

The dat a sheet for a part icular I nt el 64 or I A- 32 processor The specificat ion updat e for a part icular I nt el 64 or I A- 32 processor AP- 485, I nt el Processor I dent ificat ion and t he CPUI D I nst ruct ion, Order Num ber 241618 I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual, Order Num ber 248966

1-8 Vol. 1

INTEL®

CHAPTER 2 64 AND IA-32 ARCHITECTURES

The exponent ial growt h of com put ing power and ownership has m ade t he com put er one of t he m ost im port ant forces shaping business and societ y. I nt el 64 and I A- 32 archit ect ures have been at t he forefront of t he com put er revolut ion and is t oday t he preferred com put er archit ect ure, as m easured by com put ers in use and t he t ot al com put ing power available in t he world.

2.1

BRIEF HISTORY OF INTEL® 64 AND IA-32 ARCHITECTURE

The following sect ions provide a sum m ary of t he m aj or t echnical evolut ions from I A- 32 t o I nt el 64 archit ect ure: st art ing from t he I nt el 8086 processor t o t he lat est I nt el Core 2 Duo and I nt el Xeon processor 5100 series. Obj ect code creat ed for processors released as early as 1978 st ill execut es on t he lat est processors in t he I nt el 64 and I A- 32 archit ect ure fam ilies.

2.1.1

16-bit Processors and Segmentation (1978)

The I A- 32 archit ect ure fam ily was preceded by 16- bit processors, t he 8086 and 8088. The 8086 has 16- bit regist ers and a 16- bit ext ernal dat a bus, wit h 20- bit addressing giving a 1- MByt e address space. The 8088 is sim ilar t o t he 8086 except it has an 8- bit ext ernal dat a bus. The 8086/ 8088 int roduced segm ent at ion t o t he I A- 32 archit ect ure. Wit h segm ent at ion, a 16- bit segm ent regist er cont ains a point er t o a m em ory segm ent of up t o 64 KByt es. Using four segm ent regist ers at a t im e, 8086/ 8088 processors are able t o address up t o 256 KByt es wit hout swit ching bet ween segm ent s. The 20- bit addresses t hat can be form ed using a segm ent regist er and an addit ional 16- bit point er provide a t ot al address range of 1 MByt e.

2.1.2

The Intel® 286 Processor (1982)

The I nt el 286 processor int roduced prot ect ed m ode operat ion int o t he I A- 32 archit ect ure. Prot ect ed m ode uses t he segm ent regist er cont ent as select ors or point ers int o descript or t ables. Descript ors provide 24- bit base addresses wit h a physical m em ory size of up t o 16 MByt es, support for virt ual m em ory m anagem ent on a segm ent swapping basis, and a num ber of prot ect ion m echanism s. These m echanism s include:



Segm ent lim it checking

Vol. 1 2-1

INTEL® 64 AND IA-32 ARCHITECTURES

• •

Read- only and execut e- only segm ent opt ions Four privilege levels

2.1.3

The Intel386™ Processor (1985)

The I nt el386 processor was t he first 32- bit processor in t he I A- 32 archit ect ure fam ily. I t int roduced 32- bit regist ers for use bot h t o hold operands and for addressing. The lower half of each 32- bit I nt el386 regist er ret ains t he propert ies of t he 16- bit regist ers of earlier generat ions, perm it t ing backward com pat ibilit y. The processor also provides a virt ual- 8086 m ode t hat allows for even great er efficiency when execut ing program s creat ed for 8086/ 8088 processors. I n addit ion, t he I nt el386 processor has support for:

• • •

A 32- bit address bus t hat support s up t o 4- GByt es of physical m em ory



Support for parallel st ages

A segm ent ed- m em ory m odel and a flat m em ory m odel Paging, wit h a fixed 4- KByt e page size providing a m et hod for virt ual m em ory m anagem ent

2.1.4

The Intel486™ Processor (1989)

The I nt el486 ™ processor added m ore parallel execut ion capabilit y by expanding t he I nt el386 processor ’s inst ruct ion decode and execut ion unit s int o five pipelined st ages. Each st age operat es in parallel wit h t he ot hers on up t o five inst ruct ions in different st ages of execut ion. I n addit ion, t he processor added:



An 8- KByt e on- chip first- level cache t hat increased t he percent of inst ruct ions t hat could execut e at t he scalar rat e of one per clock

• •

An int egrat ed x87 FPU Power saving and syst em m anagem ent capabilit ies

2.1.5

The Intel® Pentium® Processor (1993)

The int roduct ion of t he I nt el Pent ium processor added a second execut ion pipeline t o achieve superscalar perform ance ( t wo pipelines, known as u and v, t oget her can execut e t wo inst ruct ions per clock) . The on- chip first- level cache doubled, wit h 8 KByt es devot ed t o code and anot her 8 KByt es devot ed t o dat a. The dat a cache uses t he MESI prot ocol t o support m ore efficient writ e- back cache in addit ion t o t he writ et hrough cache previously used by t he I nt el486 processor. Branch predict ion wit h an on- chip branch t able was added t o increase perform ance in looping const ruct s.

2-2 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

I n addit ion, t he processor added:



Ext ensions t o m ake t he virt ual- 8086 m ode m ore efficient and allow for 4- MByt e as well as 4- KByt e pages

• • • •

I nt ernal dat a pat hs of 128 and 256 bit s add speed t o int ernal dat a t ransfers Burst able ext ernal dat a bus was increased t o 64 bit s An API C t o support syst em s wit h m ult iple processors A dual processor m ode t o support glueless t wo processor syst em s

A subsequent st epping of t he Pent ium fam ily int roduced I nt el MMX t echnology ( t he Pent ium Processor wit h MMX t echnology) . I nt el MMX t echnology uses t he singleinst ruct ion, m ult iple- dat a ( SI MD) execut ion m odel t o perform parallel com put at ions on packed int eger dat a cont ained in 64- bit regist ers. See Sect ion 2.2.4, “ SI MD I nst ruct ions.”

2.1.6

The P6 Family of Processors (1995-1999)

The P6 fam ily of processors was based on a superscalar m icroarchit ect ure t hat set new perform ance st andards; see also Sect ion 2.2.1, “ P6 Fam ily Microarchit ect ure.” One of t he goals in t he design of t he P6 fam ily m icroarchit ect ure was t o exceed t he perform ance of t he Pent ium processor significant ly while using t he sam e 0.6m icrom et er, four- layer, m et al BI CMOS m anufact uring process. Mem bers of t his fam ily include t he following:



The I nt e l Pe nt iu m Pr o pr oce ssor is t hree- way superscalar. Using parallel processing t echniques, t he processor is able on average t o decode, dispat ch, and com plet e execut ion of ( ret ire) t hree inst ruct ions per clock cycle. The Pent ium Pro int roduced t he dynam ic execut ion ( m icro- dat a flow analysis, out- of- order execut ion, superior branch predict ion, and speculat ive execut ion) in a superscalar im plem ent at ion. The processor was furt her enhanced by it s caches. I t has t he sam e t wo on- chip 8- KByt e 1st- Level caches as t he Pent ium processor and an addit ional 256- KByt e Level 2 cache in t he sam e package as t he processor.



The I n t e l Pe n t ium I I pr oce ssor added I nt el MMX t echnology t o t he P6 fam ily processors along wit h new packaging and several hardware enhancem ent s. The processor core is packaged in t he single edge cont act cart ridge ( SECC) . The Level l dat a and inst ruct ion caches were enlarged t o 16 KByt es each, and Level 2 cache sizes of 256 KByt es, 512 KByt es, and 1 MByt e are support ed. A half- clock speed backside bus connect s t he Level 2 cache t o t he processor. Mult iple low- power st at es such as Aut oHALT, St op- Grant , Sleep, and Deep Sleep are support ed t o conserve power when idling.



The Pe n t iu m I I Xe on pr oce ssor com bined t he prem ium charact erist ics of previous generat ions of I nt el processors. This includes: 4- way, 8- way ( and up) scalabilit y and a 2 MByt e 2nd- Level cache running on a full- clock speed backside bus.

Vol. 1 2-3

INTEL® 64 AND IA-32 ARCHITECTURES



The I n t e l Ce le r on pr oce ssor fam ily focused on t he value PC m arket segm ent . I t s int roduct ion offers an int egrat ed 128 KByt es of Level 2 cache and a plast ic pin grid array ( P.P.G.A.) form fact or t o lower syst em design cost .



The I nt e l Pe nt ium III pr oce ssor int roduced t he St ream ing SI MD Ext ensions ( SSE) t o t he I A- 32 archit ect ure. SSE ext ensions expand t he SI MD execut ion m odel int roduced wit h t he I nt el MMX t echnology by providing a new set of 128- bit regist ers and t he abilit y t o perform SI MD operat ions on packed single- precision float ing- point values. See Sect ion 2.2.4, “ SI MD I nst ruct ions.”



The Pe n t iu m III Xe on pr oce ssor ext ended t he perform ance levels of t he I A- 32 processors wit h t he enhancem ent of a full- speed, on- die, and Advanced Transfer Cache.

2.1.7

The Intel® Pentium® 4 Processor Family (2000-2006)

The I nt el Pent ium 4 processor fam ily is based on I nt el Net Bur st m icroarchit ect ure; see Sect ion 2.2.2, “ I nt el Net Burst ® Microarchit ect ure.” The I nt el Pent ium 4 processor int roduced St ream ing SI MD Ext ensions 2 ( SSE2) ; see Sect ion 2.2.4, “ SI MD I nst ruct ions.” The I nt el Pent ium 4 processor 3.40 GHz, support ing Hyper-Threading Technology int roduced St ream ing SI MD Ext ensions 3 ( SSE3) ; see Sect ion 2.2.4, “ SI MD I nst ruct ions.” I nt el 64 archit ect ure was int roduced in t he I nt el Pent ium 4 Processor Ext rem e Edit ion support ing Hyper-Threading Technology and in t he I nt el Pent ium 4 Processor 6xx and 5xx sequences. I nt el ® Virt ualizat ion Technology ( I nt el ® VT) was int roduced in t he I nt el Pent ium 4 processor 672 and 662.

2.1.8

The Intel® Xeon® Processor (2001-2006)

I nt el Xeon processors ( wit h except ion for dual- core I nt el Xeon processor LV, I nt el Xeon processor 5100 series) are based on t he I nt el Net Burst m icroarchit ect ure; see Sect ion 2.2.2, “ I nt el Net Burst ® Microarchit ect ure.” As a fam ily, t his group of I A- 32 processors ( m ore recent ly I nt el 64 processors) is designed for use in m ult i- processor server syst em s and high- perform ance workst at ions. The I nt el Xeon processor MP int r oduced support for Hyper-Threading Technology; see Sect ion 2.2.5, “ Hyper-Threading Technology.” The 64- bit I nt el Xeon processor 3.60 GHz ( wit h an 800 MHz Syst em Bus) was used t o int roduce I nt el 64 archit ect ure. The Dual- Core I nt el Xeon processor includes dual core t echnology. The I nt el Xeon processor 70xx series includes I nt el Virt ualizat ion Technology. The I nt el Xeon processor 5100 series int roduces power- efficient , high perform ance I nt el Core m icroarchit ect ure. This processor is based on I nt el 64 archit ect ure; it includes I nt el Virt ualizat ion Technology and dual- core t echnology.

2-4 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

2.1.9

The Intel® Pentium® M Processor (2003-Current)

The I nt el Pent ium M processor fam ily is a high perform ance, low power m obile processor fam ily wit h m icroarchit ect ural enhancem ent s over previous generat ions of I A- 32 I nt el m obile processors. This fam ily is designed for ext ending bat t ery life and seam less int egrat ion wit h plat form innovat ions t hat enable new usage m odels ( such as ext ended m obilit y, ult ra t hin form - fact ors, and int egrat ed wireless net working) . I t s enhanced m icroarchit ect ure includes:

• • • • • • • •

Support for I nt el Archit ect ure wit h Dynam ic Execut ion A high perform ance, low- power core m anufact ured using I nt el’s advanced process t echnology wit h copper int erconnect On- die, prim ary 32- KByt e inst ruct ion cache and 32- KByt e writ e- back dat a cache On- die, second- level cache ( up t o 2 MByt e) wit h Advanced Transfer Cache Archit ect ure Advanced Branch Predict ion and Dat a Prefet ch Logic Support for MMX t echnology, St ream ing SI MD inst ruct ions, and t he SSE2 inst ruct ion set A 400 or 533 MHz, Source- Synchronous Processor Syst em Bus Advanced power m anagem ent using Enhanced I nt el SpeedSt ep ® t echnology

2.1.10

The Intel® Pentium® Processor Extreme Edition (2005-Current)

The I nt el Pent ium processor Ext r em e Edit ion int roduced dual- core t echnology. This t echnology provides advanced hardware m ult i- t hreading support . The processor is based on I nt el Net Bur st m icroarchit ect ure and support s SSE, SSE2, SSE3, HyperThreading Technology, and I nt el 64 archit ect ure. See also:

• • • • • •

Sect ion 2.2.2, “ I nt el Net Burst ® Microarchit ect ure” Sect ion 2.2.3, “ I nt el ® Core ™ Microarchit ect ure” Sect ion 2.2.4, “ SI MD I nst ruct ions” Sect ion 2.2.5, “ Hyper-Threading Technology” Sect ion 2.2.6, “ Mult i- Core Technology” Sect ion 2.2.7, “ I nt el ® 64 Archit ect ure”

Vol. 1 2-5

INTEL® 64 AND IA-32 ARCHITECTURES

2.1.11

The Intel® Core™ Duo and Intel® Core™ Solo Processors (2006-Current)

The I nt el Core Duo processor offers power- efficient , dual- core perform ance wit h a low- power design t hat ext ends bat t ery life. This fam ily and t he single- core I nt el Core Solo processor offer m icroarchit ect ural enhancem ent s over Pent ium M processor fam ily. I t s enhanced m icroarchit ect ure includes:



I nt el ® Sm art Cache which allows for efficient dat a sharing bet ween t wo processor cores

• •

I m proved decoding and SI MD execut ion



I nt el ® Advanced Therm al Manager which feat ures digit al t herm al sensor int erfaces



Support for power- opt im ized 667 MHz bus

I nt el ® Dynam ic Power Coordinat ion and Enhanced I nt el ® Deeper Sleep t o reduce power consum pt ion

The dual- core I nt el Xeon processor LV is based on t he sam e m icroarchit ect ure as I nt el Core Duo processor, and support s I A- 32 archit ect ure.

2.1.12

The Intel® Xeon® Processor 5100 Series and Intel® Core™2 Processor Family (2006-Current)

The I nt el Xeon processor 5100 series, I nt el Core 2 Ext rem e processor, and I nt el Core 2 Duo processor fam ily support I nt el 64 archit ect ure; and t hey are based on t he high- perform ance, power- efficient I nt el ® Core m icroarchit ect ure. The I nt el Core m icroarchit ect ure includes t he following innovat ive feat ures:



I nt el ® Wide Dynam ic Execut ion t o increase perform ance and execut ion t hroughput

• •

I nt el ® I nt elligent Power Capabilit y t o reduce power consum pt ion I nt el ® Advanced Sm art Cache which allows for efficient dat a sharing bet ween t wo processor cores



I nt el ® Sm art Mem ory Access t o increase dat a bandwidt h and hide lat ency of m em ory accesses



I nt el ® Advanced Digit al Media Boost which im proves applicat ion perform ance using m ult iple generat ions of St ream ing SI MD ext ensions

2.2

MORE ON SPECIFIC ADVANCES

The following sect ions provide m ore inform at ion on m aj or innovat ions.

2-6 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

2.2.1

P6 Family Microarchitecture

The Pent ium Pro processor int roduced a new m icroarchit ect ure com m only referred t o as P6 processor m icroarchit ect ure. The P6 processor m icroarchit ect ure was lat er enhanced wit h an on- die, Level 2 cache, called Advanced Transfer Cache. The m icroarchit ect ure is a t hree- way superscalar, pipelined archit ect ure. Three- way superscalar m eans t hat by using parallel processing t echniques, t he processor is able on average t o decode, dispat ch, and com plet e execut ion of ( ret ire) t hree inst ruct ions per clock cycle. To handle t his level of inst ruct ion t hroughput , t he P6 processor fam ily uses a decoupled, 12- st age superpipeline t hat support s out- of- order inst ruct ion execut ion. Figure 2- 1 shows a concept ual view of t he P6 processor m icroarchit ect ure pipeline wit h t he Advanced Transfer Cache enhancem ent .

System Bus

Frequently used Bus Unit

2nd Level Cache On-die, 8-way

Less frequently used

1st Level Cache 4-way, low latency

Front End

Fetch/ Decode

Execution Instruction Cache Microcode ROM

Execution Out-of-Order Core

Retirement

Branch History Update BTSs/Branch Prediction OM16520

Figure 2-1. The P6 Processor Microarchitecture with Advanced Transfer Cache Enhancement To ensure a st eady supply of inst ruct ions and dat a for t he inst ruct ion execut ion pipeline, t he P6 processor m icroarchit ect ure incorporat es t wo cache levels. The Level 1 cache provides an 8- KByt e inst ruct ion cache and an 8- KByt e dat a cache, bot h closely

Vol. 1 2-7

INTEL® 64 AND IA-32 ARCHITECTURES

coupled t o t he pipeline. The Level 2 cache provides 256- KByt e, 512- KByt e, or 1- MByt e st at ic RAM t hat is coupled t o t he core processor t hrough a full clock- speed 64- bit cache bus. The cent erpiece of t he P6 processor m icroarchit ect ure is an out- of- order execut ion m echanism called dynam ic execut ion. Dynam ic execut ion incorporat es t hree dat aprocessing concept s:



D e e p br a nch pr e dict ion allows t he processor t o decode inst ruct ions beyond branches t o keep t he inst ruct ion pipeline full. The P6 processor fam ily im plem ent s highly opt im ized branch predict ion algorit hm s t o predict t he direct ion of t he inst ruct ion.



D yna m ic da t a flow a na lysis requires real- t im e analysis of t he flow of dat a t hrough t he processor t o det erm ine dependencies and t o det ect opport unit ies for out- of- order inst ruct ion execut ion. The out- of- order execut ion core can m onit or m any inst ruct ions and execut e t hese inst ruct ions in t he order t hat best opt im izes t he use of t he processor ’s m ult iple execut ion unit s, while m aint aining t he dat a int egrit y.



Spe cu la t ive e x e cu t ion refers t o t he processor ’s abilit y t o execut e inst ruct ions t hat lie beyond a condit ional branch t hat has not yet been resolved, and ult im at ely t o com m it t he result s in t he order of t he original inst ruct ion st ream . To m ake speculat ive execut ion possible, t he P6 processor m icroarchit ect ure decouples t he dispat ch and execut ion of inst ruct ions from t he com m it m ent of result s. The processor ’s out- of- order execut ion core uses dat a- flow analysis t o execut e all available inst ruct ions in t he inst ruct ion pool and st ore t he result s in t em porary regist ers. The ret irem ent unit t hen linearly searches t he inst ruct ion pool for com plet ed inst ruct ions t hat no longer have dat a dependencies wit h ot her inst ruct ions or unresolved branch predict ions. When com plet ed inst ruct ions are found, t he ret irem ent unit com m it s t he result s of t hese inst ruct ions t o m em ory and/ or t he I A- 32 regist ers ( t he processor ’s eight general- purpose regist ers and eight x87 FPU dat a regist ers) in t he order t hey were originally issued and ret ires t he inst ruct ions from t he inst ruct ion pool.

2.2.2

Intel NetBurst® Microarchitecture

The I nt el Net Burst m icroarchit ect ure provides:



The Rapid Execut ion Engine — Arit hm et ic Logic Unit s ( ALUs) run at t wice t he processor frequency — Basic int eger operat ions can dispat ch in 1/ 2 processor clock t ick



Hyper- Pipelined Technology — Deep pipeline t o enable indust ry- leading clock rat es for deskt op PCs and servers — Frequency headroom and scalabilit y t o cont inue leadership int o t he fut ure

2-8 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES



Advanced Dynam ic Execut ion — Deep, out- of- order, speculat ive execut ion engine

• •

Up t o 126 inst ruct ions in flight Up t o 48 loads and 24 st ores in pipeline 1

— Enhanced branch predict ion capabilit y

• • •



Reduces t he m ispredict ion penalt y associat ed wit h deeper pipelines Advanced branch predict ion algorit hm 4K- ent ry branch t arget array

New cache subsyst em — First level caches

• •

Advanced Execut ion Trace Cache st ores decoded inst ruct ions Execut ion Trace Cache rem oves decoder lat ency from m ain execut ion loops



Execut ion Trace Cache int egrat es pat h of program execut ion flow int o a single line



Low lat ency dat a cache

— Second level cache

• •



Full- speed, unified 8- way Level 2 on- die Advance Transfer Cache Bandwidt h and perform ance increases wit h processor frequency

High- perform ance, quad- pum ped bus int erface t o t he I nt el Net Burst m icroarchit ect ure syst em bus — Support s quad- pum ped, scalable bus clock t o achieve up t o 4X effect ive speed — Capable of delivering up t o 8.5 GByt es of bandwidt h per second

• •

Superscalar issue t o enable parallelism



64- byt e cache line size ( t ransfers dat a up t o t wo lines per sect or)

Expanded hardware regist ers wit h renam ing t o avoid regist er nam e space lim it at ions

Figure 2- 2 is an overview of t he I nt el Net Burst m icroarchit ect ure. This m icroarchit ect ure pipeline is m ade up of t hree sect ions: ( 1) t he front end pipeline, ( 2) t he out- oforder execut ion core, and ( 3) t he ret irem ent unit .

1. Intel 64 and IA-32 processors based on the Intel NetBurst microarchitecture at 90 nm process can handle more than 24 stores in flight.

Vol. 1 2-9

INTEL® 64 AND IA-32 ARCHITECTURES

System Bus Frequently used paths Less frequently used paths Bus Unit

3rd Level Cache Optional

2nd Level Cache 8-Way

1st Level Cache 4-way

Front End

Fetch/Decode

Trace Cache Microcode ROM

Execution Out-Of-Order Core

Retirement

Branch History Update BTBs/Branch Prediction

OM16521

Figure 2-2. The Intel NetBurst Microarchitecture

2.2.2.1

The Front End Pipeline

The front end supplies inst ruct ions in program order t o t he out- of- order execut ion core. I t perform s a num ber of funct ions:

• • • • • •

Prefet ches inst ruct ions t hat are likely t o be execut ed Fet ches inst ruct ions t hat have not already been prefet ched Decodes inst ruct ions int o m icro- operat ions Generat es m icrocode for com plex inst ruct ions and special- purpose code Delivers decoded inst ruct ions from t he execut ion t race cache Predict s branches using highly advanced algorit hm

The pipeline is designed t o address com m on problem s in high- speed, pipelined m icroprocessors. Two of t hese problem s cont ribut e t o m aj or sources of delays:



t im e t o decode inst ruct ions fet ched from t he t arget

2-10 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES



wast ed decode bandwidt h due t o branches or branch t arget in t he m iddle of cache lines

The operat ion of t he pipeline’s t race cache addresses t hese issues. I nst ruct ions are const ant ly being fet ched and decoded by t he t ranslat ion engine ( part of t he fet ch/ decode logic) and built int o sequences of µops called t races. At any t im e, m ult iple t races ( represent ing prefet ched branches) are being st ored in t he t race cache. The t race cache is searched for t he inst ruct ion t hat follows t he act ive branch. I f t he inst ruct ion also appears as t he first inst ruct ion in a pre- fet ched branch, t he fet ch and decode of inst ruct ions from t he m em ory hierarchy ceases and t he prefet ched branch becom es t he new source of inst ruct ions ( see Figure 2- 2) . The t race cache and t he t ranslat ion engine have cooperat ing branch predict ion hardware. Branch t arget s are predict ed based on t heir linear addresses using branch t arget buffers ( BTBs) and fet ched as soon as possible.

2.2.2.2

Out-Of-Order Execution Core

The out- of- order execut ion core’s abilit y t o execut e inst ruct ions out of order is a key fact or in enabling parallelism . This feat ure enables t he processor t o reorder inst ruct ions so t hat if one µop is delayed, ot her µops m ay proceed around it . The processor em ploys several buffers t o sm oot h t he flow of µops. The core is designed t o facilit at e parallel execut ion. I t can dispat ch up t o six µops per cycle ( t his exceeds t race cache and ret irem ent µop bandwidt h) . Most pipelines can st art execut ing a new µop every cycle, so several inst ruct ions can be in flight at a t im e for each pipeline. A num ber of arit hm et ic logical unit ( ALU) inst ruct ions can st art at t wo per cycle; m any float ing- point inst ruct ions can st art once every t wo cycles.

2.2.2.3

Retirement Unit

The ret irem ent unit receives t he result s of t he execut ed µops from t he out- of- order execut ion core and processes t he result s so t hat t he archit ect ural st at e updat es according t o t he original program order. When a µop com plet es and writ es it s result , it is ret ired. Up t o t hree µops m ay be ret ired per cycle. The Reorder Buffer ( ROB) is t he unit in t he processor which buffers com plet ed µops, updat es t he archit ect ural st at e in order, and m anages t he ordering of except ions. The ret irem ent sect ion also keeps t rack of branches and sends updat ed branch t arget inform at ion t o t he BTB. The BTB t hen purges pre- fet ched t races t hat are no longer needed.

Vol. 1 2-11

INTEL® 64 AND IA-32 ARCHITECTURES

2.2.3

Intel® Core™ Microarchitecture

I nt el Core m icroarchit ect ure int roduces t he following feat ures t hat enable high perform ance and power- efficient perform ance for single- t hreaded as well as m ult it hreaded workloads:



I n t e l ® W ide D yna m ic Ex e cut ion enable each processor core t o fet ch, dispat ch, execut e in high bandwidt hs t o support ret irem ent of up t o four inst ruct ions per cycle. — Fourt een- st age efficient pipeline — Three arit hm et ic logical unit s — Four decoders t o decode up t o five inst ruct ion per cycle — Macro- fusion and m icro- fusion t o im prove front- end t hroughput — Peak issue rat e of dispat ching up t o six m icro- ops per cycle — Peak ret irem ent bandwidt h of up t o 4 m icro- ops per cycle — Advanced branch predict ion — St ack point er t racker t o im prove efficiency of execut ing funct ion/ procedure ent ries and exit s



I n t e l ® Adva nce d Sm a r t Ca che delivers higher bandwidt h from t he second level cache t o t he core, and opt im al perform ance and flexibilit y for singlet hreaded and m ult i- t hreaded applicat ions. — Large second level cache up t o 4 MB and 16- way associat ivit y — Opt im ized for m ult icore and single- t hreaded execut ion environm ent s — 256 bit int ernal dat a pat h t o im prove bandwidt h from L2 t o first- level dat a cache



I n t e l ® Sm a r t M e m or y Acce ss prefet ches dat a from m em ory in response t o dat a access pat t erns and reduces cache- m iss exposure of out- of- order execut ion. — Hardware prefet chers t o reduce effect ive lat ency of second- level cache m isses — Hardware prefet chers t o reduce effect ive lat ency of first- level dat a cache m isses — Mem ory disam biguat ion t o im prove efficiency of speculat ive execut ion execut ion engine



I n t e l ® Adva n ce d D igit a l M e dia Boost im proves m ost 128- bit SI MD inst ruct ion wit h single- cycle t hroughput and float ing- point operat ions. — Single- cycle t hroughput of m ost 128- bit SI MD inst ruct ions — Up t o eight float ing- point operat ion per cycle — Three issue port s available t o dispat ching SI MD inst ruct ions for execut ion

2-12 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

I nt el Core 2 Ext rem e, I nt el Core 2 Duo processors and I nt el Xeon processor 5100 series im plem ent t wo processor cores based on t he I nt el Core m icroarchit ect ure, t he funct ionalit y of t he subsyst em s in each core are depict ed in Figure 2- 3.

Figure 2-3. The Intel Core Microarchitecture Pipeline Functionality Instruction Fetch and P reD ecode

Instruction Q ueue M icrocode ROM

D ecode S hared L2 C ache U p to 10.7 G B /s FS B

R enam e/A lloc R etirem ent U nit (R e-O rder B uffer)

S cheduler

A LU B ranch M M X /S S E /FP M ove

ALU FAdd M M X /S S E

A LU FM ul M M X/S S E

Load

S tore

L1D C ache and D TLB

2.2.3.1

The Front End

The front end of I nt el Core m icroarchit ect ure provides several enhancem ent s t o feed t he I nt el Wide Dynam ic Execut ion engine:



I nst ruct ion fet ch unit prefet ches inst ruct ions int o an inst ruct ion queue t o m aint ain st eady supply of inst ruct ion t o t he decode unit s.



Four- wide decode unit can decode 4 inst ruct ions per cycle or 5 inst ruct ions per cycle wit h Macrofusion.



Macrofusion fuses com m on sequence of t wo inst ruct ions as one decoded inst ruct ion ( m icro- ops) t o increase decoding t hroughput .



Microfusion fuses com m on sequence of t wo m icro- ops as one m icro- ops t o im prove ret irem ent t hroughput .

Vol. 1 2-13

INTEL® 64 AND IA-32 ARCHITECTURES

• •

I nst ruct ion queue provides caching of short loops t o im prove efficiency.



Branch predict ion unit em ploys dedicat ed hardware t o handle different t ypes of branches for im proved branch predict ion.



Advanced branch predict ion algorit hm direct s inst ruct ion fet ch unit t o fet ch inst ruct ions likely in t he archit ect ural code pat h for decoding.

St ack point er t racker im proves efficiency of execut ing procedure/ funct ion ent ries and exit s.

2.2.3.2

Execution Core

The execut ion core of t he I nt el Core m icroarchit ect ure is superscalar and can process inst ruct ions out of order t o increases t he overall rat e of inst ruct ions execut ed per cycle ( I PC) . The execut ion core em ploys t he following feat ure t o im prove execut ion t hroughput and efficiency:

• • • • •

Up t o six m icro- ops can be dispat ched t o execut e per cycle

• •

Up t o eight float ing- point operat ion per cycle



Reduced exposure t o dat a access delays using I nt el Sm art Mem ory Access

Up t o four inst ruct ions can be ret ired per cycle Three full arit hm et ic logical unit s SI MD inst ruct ions can be dispat ched t hrough t hree issue port s Most SI MD inst ruct ions have 1- cycle t hroughput ( including 128- bit SI MD inst ruct ions) Many long- lat ency com put at ion operat ion are pipelined in hardware t o increase overall t hroughput

2.2.4

SIMD Instructions

Beginning wit h t he Pent ium I I and Pent ium wit h I nt el MMX t echnology processor fam ilies, five ext ensions have been int roduced int o t he I nt el 64 and I A- 32 archit ect ures t o perform single- inst ruct ion m ult iple- dat a ( SI MD) operat ions. These ext ensions include t he MMX t echnology, SSE ext ensions, SSE2 ext ensions, SSE3 ext ensions, and Supplem ent al St ream ing SI MD Ext ensions 3. Each of t hese ext ensions provides a group of inst ruct ions t hat perform SI MD operat ions on packed int eger and/ or packed float ing- point dat a elem ent s. SI MD int eger operat ions can use t he 64- bit MMX or t he 128- bit XMM regist ers. SI MD float ing- point operat ions use 128- bit XMM regist ers. Figure 2- 4 shows a sum m ary of t he various SI MD ext ensions ( MMX t echnology, SSE, SSE2, SSE3, and SSSE3) , t he dat a t ypes t hey operat e on, and how t he dat a t ypes are packed int o MMX and XMM regist ers. The I nt el MMX t echnology was int roduced in t he Pent ium I I and Pent ium wit h MMX t echnology processor fam ilies. MMX inst ruct ions perform SI MD operat ions on packed

2-14 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

byt e, word, or doubleword int egers locat ed in MMX regist ers. These inst ruct ions are useful in applicat ions t hat operat e on int eger arrays and st ream s of int eger dat a t hat lend t hem selves t o SI MD processing. SSE ext ensions were int roduced in t he Pent ium III processor fam ily. SSE inst ruct ions operat e on packed single- precision float ing- point values cont ained in XMM regist ers and on packed int egers cont ained in MMX regist ers. Several SSE inst ruct ions provide st at e m anagem ent , cache cont rol, and m em ory ordering operat ions. Ot her SSE inst ruct ions are t arget ed at applicat ions t hat operat e on arrays of single- precision float ing- point dat a elem ent s ( 3- D geom et ry, 3- D rendering, and video encoding and decoding applicat ions) . SSE2 ext ensions were int roduced in Pent ium 4 and I nt el Xeon processors. SSE2 inst ruct ions operat e on packed double- precision float ing- point values cont ained in XMM regist ers and on packed int egers cont ained in MMX and XMM regist ers. SSE2 int eger inst ruct ions ext end I A- 32 SI MD operat ions by adding new 128- bit SI MD int eger operat ions and by expanding exist ing 64- bit SI MD int eger operat ions t o 128- bit XMM capabilit y. SSE2 inst ruct ions also provide new cache cont rol and m em ory ordering operat ions. SSE3 ext ensions were int roduced wit h t he Pent ium 4 processor support ing HyperThreading Technology ( built on 90 nm process t echnology) . SSE3 offers 13 inst ruct ions t hat accelerat e perform ance of St ream ing SI MD Ext ensions t echnology, St ream ing SI MD Ext ensions 2 t echnology, and x87- FP m at h capabilit ies. SSSE3 ext ensions were int roduced wit h t he I nt el Xeon processor 5100 series and I nt el Core 2 processor fam ily. SSSE3 offers 32 inst ruct ions t o accelerat e processing of SI MD int eger dat a. I nt el 64 archit ect ure allows four generat ions of 128- bit SI MD ext ensions t o access up t o 16 XMM regist ers. I A- 32 archit ect ure provides 8 XMM regist ers. See also:



Sect ion 5.4, “ MMX™ I nst ruct ions,” and Chapt er 9, “ Program m ing wit h I nt el® MMX™ Technology”



Sect ion 5.5, “ SSE I nst ruct ions,” and Chapt er 10, “ Program m ing wit h St ream ing SI MD Ext ensions ( SSE) ”



Sect ion 5.6, “ SSE2 I nst ruct ions,” and Chapt er 11, “ Program m ing wit h St ream ing SI MD Ext ensions 2 ( SSE2) ”



Sect ion 5.7, “ SSE3 I nst ruct ions,” and Chapt er 12, “ Program m ing wit h SSE3 and Supplem ent al SSE3”

Vol. 1 2-15

INTEL® 64 AND IA-32 ARCHITECTURES

SIMD Extension

Register Layout

Data Type

MMX Registers 8 Packed Byte Integers

MMX Technology

4 Packed Word Integers 2 Packed Doubleword Integers Quadword

MMX Registers SSE

8 Packed Byte Integers 4 Packed Word Integers 2 Packed Doubleword Integers Quadword XMM Registers

4 Packed Single-Precision Floating-Point Values

MMX Registers SSE2/SSE3/SSSE3

2 Packed Doubleword Integers Quadword XMM Registers 2 Packed Double-Precision Floating-Point Values 16 Packed Byte Integers 8 Packed Word Integers 4 Packed Doubleword Integers 2 Quadword Integers Double Quadword

Figure 2-4. SIMD Extensions, Register Layouts, and Data Types

2-16 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

2.2.5

Hyper-Threading Technology

Hyper-Threading ( HT) Technology was developed t o im prove t he perform ance of I A- 32 processors when execut ing m ult i- t hreaded operat ing syst em and applicat ion code or single- t hreaded applicat ions under m ult i- t asking environm ent s. The t echnology enables a single physical processor t o execut e t wo or m ore separat e code st ream s ( t hreads) concurrent ly using shared execut ion resources. HT Technology is one form of hardware m ult i- t hreading capabilit y in I A- 32 processor fam ilies. I t differs from m ult i- processor capabilit y using separat e physically dist inct packages wit h each physical processor package m at ed wit h a physical socket . HT Technology provides hardware m ult i- t hreading capabilit y wit h a single physical package by using shared execut ion resources in a processor core. Archit ect urally, an I A- 32 processor t hat support s HT Technology consist s of t wo or m ore logical processors, each of which has it s own I A- 32 archit ect ural st at e. Each logical processor consist s of a full set of I A- 32 dat a regist ers, segm ent regist ers, cont rol regist ers, debug regist ers, and m ost of t he MSRs. Each also has it s own advanced program m able int errupt cont roller ( API C) . Figure 2- 5 shows a com parison of a processor t hat support s HT Technology ( im plem ent ed wit h t wo logical processors) and a t radit ional dual processor syst em .

IA-32 Processor Supporting Hyper-Threading Technology AS

Traditional Multiple Processor (MP) System

AS

AS

AS

Processor Core

Processor Core

Processor Core

IA-32 processor

IA-32 processor

IA-32 processor

Two logical processors that share a single core

Each processor is a separate physical package

AS = IA-32 Architectural State OM16522

Figure 2-5. Comparison of an IA-32 Processor Supporting Hyper-Threading Technology and a Traditional Dual Processor System Unlike a t radit ional MP syst em configurat ion t hat uses t wo or m ore separat e physical I A- 32 processors, t he logical processors in an I A- 32 processor support ing HT Technology share t he core resources of t he physical processor. This includes t he execut ion

Vol. 1 2-17

INTEL® 64 AND IA-32 ARCHITECTURES

engine and t he syst em bus int erface. Aft er power up and init ializat ion, each logical processor can be independent ly direct ed t o execut e a specified t hread, int errupt ed, or halt ed. HT Technology leverages t he process and t hread- level parallelism found in cont em porary operat ing syst em s and high- perform ance applicat ions by providing t wo or m ore logical processors on a single chip. This configurat ion allows t wo or m ore t hreads1 t o be execut ed sim ult aneously on each a physical processor. Each logical processor execut es inst ruct ions from an applicat ion t hread using t he resources in t he processor core. The core execut es t hese t hreads concurrent ly, using out- of- order inst ruct ion scheduling t o m axim ize t he use of execut ion unit s during each clock cycle.

2.2.5.1

Some Implementation Notes

All HT Technology configurat ions require:

• • •

A processor t hat support s HT Technology A chipset and BI OS t hat ut ilize t he t echnology Operat ing syst em opt im izat ions

See ht t p: / / www.int el.com / product s/ ht / hypert hreading_m ore.ht m for inform at ion. At t he firm ware ( BI OS) level, t he basic procedures t o init ialize t he logical processors in a processor support ing HT Technology are t he sam e as t hose for a t radit ional DP or MP plat form . The m echanism s t hat are described in t he Mult iprocessor Specificat ion, Version 1.4 t o power- up and init ialize physical processors in an MP syst em also apply t o logical processors in a processor t hat support s HT Technology. An operat ing syst em designed t o run on a t radit ional DP or MP plat form m ay use CPUI D t o det erm ine t he presence of hardware m ult i- t hreading support feat ure and t he num ber of logical processors t hey provide. Alt hough exist ing operat ing syst em and applicat ion code should run corr ect ly on a processor t hat support s HT Technology, som e code m odificat ions ar e recom m ended t o get t he opt im um benefit . These m odificat ions ar e discussed in Chapt er 7, “ Mult iple- Pr ocessor Managem ent ,” I nt el® 64 and I A- 32 Ar chit ect ures Soft ware Developer’s Manual, Volum e 3A.

2.2.6

Multi-Core Technology

Mult i- core t echnology is anot her form of hardware m ult i- t hreading capabilit y in I A- 32 processor fam ilies. Mult i- core t echnology enhances hardware m ult i- t hreading capabilit y by providing t wo or m ore execut ion cores in a physical package. The I nt el Pent ium processor Ext rem e Edit ion is t he first m em ber in t he I A- 32 processor fam ily t o int roduce m ult i- core t echnology. The processor provides hard1. In the remainder of this document, the term “thread” will be used as a general term for the terms “process” and “thread.”

2-18 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

ware m ult i- t hreading support wit h bot h t wo processor cores and Hyper-Threading Technology. This m eans t hat t he I nt el Pent ium processor Ext rem e Edit ion provides four logical processors in a physical package ( t wo logical processors for each processor core) . The Dual- Core I nt el Xeon processor feat ures m ult i- core, HyperThreading Technology and support s m ult i- processor plat form s. The I nt el Pent ium D processor also feat ures m ult i- core t echnology. This processor provides hardware m ult i- t hreading support wit h t wo processor cores but does not offer Hyper-Threading Technology. This m eans t hat t he I nt el Pent ium D processor provides t wo logical processors in a physical package, wit h each logical processor owning t he com plet e execut ion resources of a processor core. The I nt el Core 2 processor fam ily, I nt el Xeon processor 5100 series, and I nt el Core Duo processor offer power- efficient m ult i- core t echnology. The processor cont ains t wo cores t hat share a sm art second level cache. The Level 2 cache enables efficient dat a sharing bet ween t wo cores t o reduce m em ory t raffic t o t he syst em bus.

Intel Core Duo Processor

Pentium D Processor

ArchitectualState Architectual State

Architectual State Architectual State

Execution Engine Execution Engine

Execution Engine Execution Engine

Local APIC

Local APIC

Second Level Cache Bus Interface

System Bus

Local APIC

Local APIC

Second Level

Second Level

Pentium Processor Extreme Edition Arch.

Arch.

Arch.

Arch.

State

State

State

State

Execution Engine Execution Engine Local

Local

Local

Local

APIC

APIC

APIC

APIC

Second Level

Second Level

Cache

Cache

Cache

Cache

Bus Interface

Bus Interface

Bus Interface

Bus Interface

System Bus

System Bus

Figure 2-6. IA-32 Processors that Support Dual-Core

2.2.7

Intel® 64 Architecture

I nt el 64 archit ect ure increases t he linear address space for soft ware t o 64 bit s and support s physical address space up t o 40 bit s. The t echnology also int roduces a new operat ing m ode referred t o as I A- 32e m ode. I A- 32e m ode operat es in one of t wo sub- m odes: ( 1) com pat ibilit y m ode enables a 64- bit operat ing syst em t o run m ost legacy 32- bit soft ware unm odified, ( 2) 64- bit m ode enables a 64- bit operat ing syst em t o run applicat ions writ t en t o access 64- bit address space. I n t he 64- bit m ode, applicat ions m ay access:



64- bit flat linear addressing

Vol. 1 2-19

INTEL® 64 AND IA-32 ARCHITECTURES

• •

8 addit ional general- purpose regist ers ( GPRs)

• • • •

64- bit- wide GPRs and inst ruct ion point ers

8 addit ional regist ers for st ream ing SI MD ext ensions ( SSE, SSE2, SSE3 and SSSE3) uniform byt e- regist er addressing fast int errupt- priorit izat ion m echanism a new inst ruct ion- point er relat ive- addressing m ode

An I nt el 64 archit ect ure processor support s exist ing I A- 32 soft ware because it is able t o run all non- 64- bit legacy m odes support ed by I A- 32 archit ect ure. Most exist ing I A- 32 applicat ions also run in com pat ibilit y m ode.

2.2.8

Intel® Virtualization Technology (Intel® VT)

I nt el ® Virt ualizat ion Technology for I nt el 64 and I A- 32 archit ect ures provide ext ensions t hat support virt ualizat ion. The ext ensions are referred t o as Virt ual Machine Ext ensions ( VMX) . An I nt el 64 or I A- 32 plat form wit h VMX can funct ion as m ult iple virt ual syst em s ( or virt ual m achines) . Each virt ual m achine can run operat ing syst em s and applicat ions in separat e part it ions. VMX also pr ovides pr ogram m ing int er face for a new layer of syst em soft war e ( called t he Vir t ual Machine Monit or ( VMM) ) used t o m anage t he operat ion of vir t ual m achines. I nfor m at ion on VMX and on t he pr ogram m ing of VMMs is in I nt el® 64 and I A- 32 Ar chit ect ur es Soft w ar e Developer ’s Manual, Volum e 3B. Chapt er 5, “ VMX I nst r uct ion Refer ence,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft w ar e Developer ’s Manual, Volum e 2B, pr ovides infor m at ion on VMX inst r uct ions.

2.3

INTEL® 64 AND IA-32 PROCESSOR GENERATIONS

I n t he m id- 1960s, I nt el cofounder and Chairm an Em erit us Gordon Moore had t his observat ion: “ t he num ber of t ransist ors t hat would be incorporat ed on a silicon die would double every 18 m ont hs for t he next several years.” Over t he past t hree and half decades, t his predict ion known as “ Moore's Law” has cont inued t o hold t rue. The com put ing power and t he com plexit y ( or roughly, t he num ber of t ransist ors per processor) of I nt el archit ect ure processors has grown in close relat ion t o Moore's law. By t aking advant age of new process t echnology and new m icroarchit ect ure designs, each new generat ion of I A- 32 processors has dem onst rat ed frequency- scaling headroom and new perform ance levels over t he previous generat ion processors. The key feat ures of t he I nt el Pent ium 4 processor, I nt el Xeon processor, I nt el Xeon processor MP, Pent ium III processor, and Pent ium III Xeon processor wit h advanced t ransfer cache are shown in Table 2- 1. Older generat ion I A- 32 processors, which do not em ploy on- die Level 2 cache, are shown in Table 2- 2.

2-20 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-1. Key Features of Most Recent IA-32 Processors Intel Processor

Date Introduced

Microarchitecture

Top-Bin Clock Frequency at Introduction

Transistors

Register Sizes1

System Bus Bandwidth

Max. Extern. Addr. Space

On-Die Caches2

Intel Pentium M Processor 7553

2004

Intel Pentium M Processor

2.00 GHz

140 M

GP: 32 FPU: 80 MMX: 64 XMM: 128

3.2 GB/s

4 GB

L1: 64 KB L2: 2 MB

Intel Core Duo Processor T26003

2006

Improved Intel Pentium M Processor Microarchitecture; Dual Core; Intel Smart Cache, Advanced Thermal Manager

2.16 GHz

152M

GP: 32 FPU: 80 MMX: 64 XMM: 128

5.3 GB/s

4 GB

L1: 64 KB L2: 2 MB (2MB Total)

NOTES: 1. The register size and external data bus size are given in bits. 2. First level cache is denoted using the abbreviation L1, 2nd level cache is denoted as L2. The size of L1 includes the first-level data cache and the instruction cache where applicable, but does not include the trace cache. 3. Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.

Table 2-2. Key Features of Most Recent Intel 64 Processors Intel Processor

Date Introduced

Microarchitecture

Top-Bin Clock Frequency at Introduction

Transistors

Register Sizes

System Bus Bandwidth

Max. Extern. Addr. Space

On-Die Caches

64-bit Intel Xeon Processor with 800 MHz System Bus

2004

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture

3.60 GHz

125 M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

6.4 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 1 MB L2

64-bit Intel Xeon Processor MP with 8MB L3

2005

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture

3.33 GHz

675M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

5.3 GB/s 1

1024 GB (1 TB)

12K µ op Execution Trace Cache; 16 KB L1; 1 MB L2, 8 MB L3

Intel Pentium 4 Processor Extreme Edition Supporting HyperThreading Technology

2005

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture

3.73 GHz

164 M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

8.5 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 2 MB L2

Intel Pentium Processor Extreme Edition 840

2005

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture; Dual-core 2

3.20 GHz

230 M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

6.4 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 1MB L2 (2MB Total)

Vol. 1 2-21

INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-2. Key Features of Most Recent Intel 64 Processors (Contd.) Intel Processor

Date Introduced

Microarchitecture

Top-Bin Clock Frequency at Introduction

Transistors

Register Sizes

System Bus Bandwidth

Max. Extern. Addr. Space

On-Die Caches

Dual-Core Intel Xeon Processor 7041

2005

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture; Dual-core 3

3.00 GHz

321M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

6.4 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 2MB L2 (4MB Total)

Intel Pentium 4 Processor 672

2005

Intel NetBurst Microarchitecture; Hyper-Threading Technology; Intel 64 Architecture; Intel Virtualization Technology.

3.80 GHz

164 M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

6.4 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 2MB L2

Intel Pentium Processor Extreme Edition 955

2006

Intel NetBurst Microarchitecture; Intel 64 Architecture; Dual Core; Intel Virtualization Technology.

3.46 GHz

376M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

8.5 GB/s

64 GB

12K µ op Execution Trace Cache; 16 KB L1; 2MB L2 (4MB Total)

Intel Core 2 Extreme Processor

2006

Intel Core Microarchitecture; Dual Core; Intel 64 Architecture; Intel Virtualization Technology.

2.93 GHz

291M

GP: 32,64 FPU: 80 MMX: 64 XMM: 128

8.5 GB/s

64 GB

L1: 64 KB L2: 4MB (4MB Total)

Intel Xeon Processor 51603

2006

Intel Core Microarchitecture; Dual Core; Intel 64 Architecture; Intel Virtualization Technology.

3.00 GHz

291M

GP: 32, 64 FPU: 80 MMX: 64 XMM: 128

10.6 GB/s

64 GB

L1: 64 KB L2: 4MB (4MB Total)

NOTES: 1. The 64-bit Intel Xeon Processor MP with an 8-MByte L3 supports a multi-processor platform with a dual system bus; this creates a platform bandwidth with 10.6 GBytes. 2. In Intel Pentium Processor Extreme Edition 840, the size of each on-die cache is listed for each core. The total size of L2 in the physical package in 2 MBytes. 3. In Dual-Core Intel Xeon Processor 7041, the size of each on-die cache is listed for each core. The total size of L2 in the physical package in 4 MBytes.

2-22 Vol. 1

INTEL® 64 AND IA-32 ARCHITECTURES

Table 2-3. Key Features of Previous Generations of IA-32 Processors Intel Processor

Date Introduced

Max. Clock Frequency/ Technology at Introduction

Transistors

Register Sizes1

Ext. Data Bus Size2

Max. Extern. Addr. Space

Caches

8086

1978

8 MHz

29 K

16 GP

16

1 MB

None

Intel 286

1982

12.5 MHz

134 K

16 GP

16

16 MB

Note 3

Intel386 DX Processor

1985

20 MHz

275 K

32 GP

32

4 GB

Note 3

Intel486 DX Processor

1989

25 MHz

1.2 M

32 GP 80 FPU

32

4 GB

L1: 8 KB

Pentium Processor

1993

60 MHz

3.1 M

32 GP 80 FPU

64

4 GB

L1:16 KB

Pentium Pro Processor

1995

200 MHz

5.5 M

32 GP 80 FPU

64

64 GB

L1: 16 KB L2: 256 KB or 512 KB

Pentium II Processor

1997

266 MHz

7M

32 GP 80 FPU 64 MMX

64

64 GB

L1: 32 KB L2: 256 KB or 512 KB

Pentium III Processor

1999

500 MHz

8.2 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

L1: 32 KB L2: 512 KB

Pentium III and Pentium III Xeon Processors

1999

700 MHz

28 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

L1: 32 KB L2: 256 KB

Pentium 4 Processor

2000

1.50 GHz, Intel NetBurst Microarchitecture

42 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

12K µ op Execution Trace Cache; L1: 8KB L2: 256 KB

Intel Xeon Processor

2001

1.70 GHz, Intel NetBurst Microarchitecture

42 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

12K µ op Execution Trace Cache; L1: 8KB L2: 512KB

Intel Xeon Processor

2002

2.20 GHz, Intel NetBurst Microarchitecture, HyperThreading Technology

55 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

12K µ op Execution Trace Cache; L1: 8KB L2: 512KB

Pentium M Processor

2003

1.60 GHz, Intel NetBurst Microarchitecture

77 M

32 GP 80 FPU 64 MMX 128 XMM

64

4 GB

L1: 64KB L2: 1 MB

Intel Pentium 4 Processor Supporting HyperThreading Technology at 90 nm process

2004

3.40 GHz, Intel NetBurst Microarchitecture, HyperThreading Technology

125 M

32 GP 80 FPU 64 MMX 128 XMM

64

64 GB

12K µ op Execution Trace Cache; L1: 16KB L2: 1 MB

NOTE: 1. The register size and external data bus size are given in bits. Note also that each 32-bit generalpurpose (GP) registers can be addressed as an 8- or a 16-bit data registers in all of the processors. 2. Internal data paths are 2 to 4 times wider than the external data bus for each processor.

Vol. 1 2-23

INTEL® 64 AND IA-32 ARCHITECTURES

2-24 Vol. 1

CHAPTER 3 BASIC EXECUTION ENVIRONMENT This chapt er describes t he basic execut ion environm ent of an I nt el 64 or I A- 32 processor as seen by assem bly- language program m ers. I t describes how t he processor execut es inst ruct ions and how it st ores and m anipulat es dat a. The execut ion environm ent described here includes m em ory ( t he address space) , generalpurpose dat a regist ers, segm ent regist ers, t he flag regist er, and t he inst ruct ion point er regist er.

3.1

MODES OF OPERATION

The I A- 32 archit ect ure support s t hree basic operat ing m odes: prot ect ed m ode, realaddress m ode, and syst em m anagem ent m ode. The operat ing m ode det erm ines which inst ruct ions and archit ect ural feat ures are accessible:



Pr ot e ct e d m ode — This m ode is t he nat ive st at e of t he processor. Am ong t he capabilit ies of prot ect ed m ode is t he abilit y t o direct ly execut e “ real- address m ode” 8086 soft ware in a prot ect ed, m ult i- t asking environm ent . This feat ure is called vir t u a l- 8 0 8 6 m ode , alt hough it is not act ually a processor m ode. Virt ual8086 m ode is act ually a prot ect ed m ode at t ribut e t hat can be enabled for any t ask.



Re a l- a ddr e ss m ode — This m ode im plem ent s t he program m ing environm ent of t he I nt el 8086 processor wit h ext ensions ( such as t he abilit y t o swit ch t o prot ect ed or syst em m anagem ent m ode) . The processor is placed in real- address m ode following power- up or a reset .



Syst e m m a na ge m e nt m ode ( SM M ) — This m ode provides an operat ing syst em or execut ive wit h a t ransparent m echanism for im plem ent ing plat form specific funct ions such as power m anagem ent and syst em securit y. The processor ent ers SMM when t he ext ernal SMM int errupt pin ( SMI # ) is act ivat ed or an SMI is received from t he advanced program m able int errupt cont roller ( API C) . I n SMM, t he processor swit ches t o a separat e address space while saving t he basic cont ext of t he current ly running program or t ask. SMM- specific code m ay t hen be execut ed t ransparent ly. Upon ret urning from SMM, t he processor is placed back int o it s st at e prior t o t he syst em m anagem ent int errupt . SMM was int roduced wit h t he I nt el386 ™ SL and I nt el486 ™ SL processors and becam e a st andard I A- 32 feat ure wit h t he Pent ium processor fam ily.

Vol. 1 3-1

BASIC EXECUTION ENVIRONMENT

3.1.1

Intel® 64 Architecture

I nt el 64 archit ect ure adds I A- 32e m ode. I A- 32e m ode has t wo sub- m odes. These are:



Com pa t ibilit y m ode ( sub- m ode of I A- 3 2 e m ode ) — Com pat ibilit y m ode perm it s m ost legacy 16- bit and 32- bit applicat ions t o run wit hout re- com pilat ion under a 64- bit operat ing syst em . For brevit y, t he com pat ibilit y sub- m ode is referred t o as com pat ibilit y m ode in I A- 32 archit ect ure. The execut ion environm ent of com pat ibilit y m ode is t he sam e as described in Sect ion 3.2. Com pat ibilit y m ode also support s all of t he privilege levels t hat are support ed in 64- bit and prot ect ed m odes. Legacy applicat ions t hat run in Virt ual 8086 m ode or use hardware t ask m anagem ent will not work in t his m ode. Com pat ibilit y m ode is enabled by t he operat ing syst em ( OS) on a code segm ent basis. This m eans t hat a single 64- bit OS can support 64- bit applicat ions running in 64- bit m ode and support legacy 32- bit applicat ions ( not recom piled for 64- bit s) running in com pat ibilit y m ode. Com pat ibilit y m ode is sim ilar t o 32- bit prot ect ed m ode. Applicat ions access only t he first 4 GByt e of linear- address space. Com pat ibilit y m ode uses 16- bit and 32- bit address and operand sizes. Like prot ect ed m ode, t his m ode allows applicat ions t o access physical m em ory great er t han 4 GByt e using PAE ( Physical Address Ext ensions) .



6 4 - bit m ode ( su b- m ode of I A- 3 2 e m ode ) — This m ode enables a 64- bit operat ing syst em t o run applicat ions writ t en t o access 64- bit linear address space. For brevit y, t he 64- bit sub- m ode is referred t o as 64- bit m ode in I A- 32 archit ect ure. 64- bit m ode ext ends t he num ber of general purpose regist ers and SI MD ext ension regist ers from 8 t o 16. General purpose regist ers are widened t o 64 bit s. The m ode also int roduces a new opcode prefix ( REX) t o access t he regist er ext ensions. See Sect ion 3.2.1 for a det ailed descript ion. 64- bit m ode is enabled by t he operat ing syst em on a code- segm ent basis. I t s default address size is 64 bit s and it s default operand size is 32 bit s. The default operand size can be overridden on an inst ruct ion- by- inst ruct ion basis using a REX opcode prefix in conj unct ion wit h an operand size override prefix. REX prefixes allow a 64- bit operand t o be specified when operat ing in 64- bit m ode. By using t his m echanism , m any exist ing inst ruct ions have been prom ot ed t o allow t he use of 64- bit regist ers and 64- bit addresses.

3-2 Vol. 1

BASIC EXECUTION ENVIRONMENT

3.2

OVERVIEW OF THE BASIC EXECUTION ENVIRONMENT

Any program or t ask running on an I A- 32 processor is given a set of resources for execut ing inst ruct ions and for st oring code, dat a, and st at e inform at ion. These resources ( described briefly in t he following paragraphs and shown in Figure 3- 1) m ake up t he basic execut ion environm ent for an I A- 32 processor. An I nt el 64 processor support s t he basic execut ion environm ent of an I A- 32 processor, and a sim ilar environm ent under I A- 32e m ode t hat can execut e 64- bit program s ( 64- bit sub- m ode) and 32- bit program s ( com pat ibilit y sub- m ode) . The basic execut ion environm ent is used j oint ly by t he applicat ion program s and t he operat ing syst em or execut ive running on t he processor.



Addr e ss spa ce — Any t ask or program running on an I A- 32 processor can address a linear address space of up t o 4 GByt es ( 2 32 byt es) and a physical address space of up t o 64 GByt es ( 2 36 byt es) . See Sect ion 3.3.6, “ Ext ended Physical Addressing in Prot ect ed Mode,” for m ore inform at ion about addressing an address space great er t han 4 GByt es.



Ba sic pr ogr a m e x e cut ion r e gist e r s — The eight general- purpose regist ers, t he six segm ent regist ers, t he EFLAGS regist er, and t he EI P ( inst ruct ion point er) regist er com prise a basic execut ion environm ent in which t o execut e a set of general- purpose inst ruct ions. These inst ruct ions perform basic int eger arit hm et ic on byt e, word, and doubleword int egers, handle program flow cont rol, operat e on bit and byt e st rings, and address m em ory. See Sect ion 3.4, “ Basic Program Execut ion Regist ers,” for m ore inform at ion about t hese regist ers.



x 8 7 FPU r e gist e r s — The eight x87 FPU dat a regist ers, t he x87 FPU cont rol regist er, t he st at us regist er, t he x87 FPU inst ruct ion point er regist er, t he x87 FPU operand ( dat a) point er regist er, t he x87 FPU t ag regist er, and t he x87 FPU opcode regist er provide an execut ion environm ent for operat ing on single- precision, double- precision, and double ext ended- precision float ing- point values, word int egers, doubleword int egers, quadword int egers, and binary coded decim al ( BCD) values. See Sect ion 8.1, “ x87 FPU Execut ion Environm ent ,” for m ore inform at ion about t hese regist ers.



M M X r e gist e r s — The eight MMX regist ers support execut ion of singleinst ruct ion, m ult iple- dat a ( SI MD) operat ions on 64- bit packed byt e, word, and doubleword int egers. See Sect ion 9.2, “ The MMX Technology Program m ing Environm ent ,” for m ore inform at ion about t hese regist ers.



XM M r e gist e r s — The eight XMM dat a regist ers and t he MXCSR regist er support execut ion of SI MD operat ions on 128- bit packed single- precision and doubleprecision float ing- point values and on 128- bit packed byt e, word, doubleword, and quadword int egers. See Sect ion 10.2, “ SSE Program m ing Environm ent ,” for m ore inform at ion about t hese regist ers.

Vol. 1 3-3

BASIC EXECUTION ENVIRONMENT

Basic Program Execution Registers

Address Space* 232 -1

Eight 32-bit Registers

General-Purpose Registers

Six 16-bit Registers

Segment Registers

32-bits

EFLAGS Register

32-bits

EIP (Instruction Pointer Register)

FPU Registers Floating-Point Data Registers

Eight 80-bit Registers

0

16 bits

Control Register

16 bits

Status Register

16 bits

Tag Register

*The address space can be flat or segmented. Using the physical address extension mechanism, a physical address space of 236 - 1 can be addressed.

Opcode Register (11-bits) 48 bits

FPU Instruction Pointer Register

48 bits

FPU Data (Operand) Pointer Register

MMX Registers Eight 64-bit Registers

MMX Registers

XMM Registers

Eight 128-bit Registers

XMM Registers

32-bits

MXCSR Register

Figure 3-1. IA-32 Basic Execution Environment for Non-64-bit Modes

3-4 Vol. 1

BASIC EXECUTION ENVIRONMENT



St a ck — To support procedure or subrout ine calls and t he passing of param et ers bet ween procedures or subrout ines, a st ack and st ack m anagem ent resources are included in t he execut ion environm ent . The st ack ( not shown in Figure 3- 1) is locat ed in m em ory. See Sect ion 6.2, “ St acks,” for m ore inform at ion about st ack st ruct ure.

I n addit ion t o t he resources provided in t he basic execut ion environm ent , t he I A- 32 archit ect ure provides t he following resources as part of it s syst em - level archit ect ure. They provide ext ensive support for operat ing- syst em and syst em - developm ent soft ware. Except for t he I / O port s, t he syst em resources are described in det ail in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B.



I / O por t s — The I A- 32 archit ect ure support s a t ransfers of dat a t o and from input / out put ( I / O) port s. See Chapt er 13, “ I nput / Out put ,” in t his volum e.



Con t r ol r e gist e r s — The five cont rol regist ers ( CR0 t hrough CR4) det erm ine t he operat ing m ode of t he processor and t he charact erist ics of t he current ly execut ing t ask. See Chapt er 2, “ Syst em Archit ect ure Overview,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.



M e m or y m a na ge m e n t r e gist e r s — The GDTR, I DTR, t ask regist er, and LDTR specify t he locat ions of dat a st ruct ures used in prot ect ed m ode m em ory m anagem ent . See Chapt er 2, “ Syst em Archit ect ure Overview,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.



D e bug r e gist e r s — The debug regist ers ( DR0 t hrough DR7) cont rol and allow m onit oring of t he processor ’s debugging operat ions. See Chapt er 18, “ Debugging and Perform ance Monit oring,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.



M e m or y t ype r a n ge r e gist e r s ( M TRRs) — The MTRRs are used t o assign m em ory t ypes t o regions of m em ory. See t he sect ions on MTRRs in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.



M a chine spe cific r e gist e r s ( M SRs) — The processor provides a variet y of m achine specific regist ers t hat are used t o cont rol and report on processor perform ance. Virt ually all MSRs handle syst em relat ed funct ions and are not accessible t o an applicat ion program . One except ion t o t his rule is t he t im e- st am p count er. The MSRs are described in Appendix B, “ Model- Specific Regist ers ( MSRs) ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.



M a chine che ck r e gist e r s — The m achine check regist ers consist of a set of cont rol, st at us, and error- report ing MSRs t hat are used t o det ect and report on hardware ( m achine) errors. See Chapt er 14, “ Machine- Check Archit ect ure,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.



Pe r for m a n ce m onit or in g cou nt e r s — The perform ance m onit oring count ers allow processor perform ance event s t o be m onit ored. See Chapt er 18, “ Debugging and Perform ance Monit oring,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.

The rem ainder of t his chapt er describes t he organizat ion of m em ory and t he address space, t he basic program execut ion regist ers, and addressing m odes. Refer t o t he

Vol. 1 3-5

BASIC EXECUTION ENVIRONMENT

following chapt ers in t his volum e for descript ions of t he ot her program execut ion resources shown in Figure 3- 1:

• •

x 8 7 FPU r e gist e r s — See Chapt er 8, “ Program m ing wit h t he x87 FPU.”



XM M r e gist e r s — See Chapt er 10, “ Program m ing wit h St ream ing SI MD Ext ensions ( SSE) ,” Chapt er 11, “ Program m ing wit h St ream ing SI MD Ext ensions 2 ( SSE2) ,” and Chapt er 12, “ Program m ing wit h SSE3 and Supplem ent al SSE3.”



St a ck im ple m e n t a t ion a n d pr oce du r e ca lls — See Chapt er 6, “ Procedure Calls, I nt errupt s, and Except ions.”

M M X Re gist e r s — See Chapt er 9, “ Program m ing wit h I nt el® MMX™ Technology.”

3.2.1

64-Bit Mode Execution Environment

The execut ion environm ent for 64- bit m ode is sim ilar t o t hat described in Sect ion 3.2. The following paragraphs describe t he differences t hat apply.



Addr e ss spa ce — A t ask or program running in 64- bit m ode on an I A- 32 processor can address linear address space of up t o 2 64 byt es ( subj ect t o t he canonical addressing requirem ent described in Sect ion 3.3.7.1) and physical address space of up t o 2 40 byt es. Soft ware can query CPUI D for t he physical address size support ed by a processor.



Ba sic pr ogr a m e x e cu t ion r e gist e r s — The num ber of general- purpose regist ers ( GPRs) available is 16. GPRs are 64- bit s wide and t hey support operat ions on byt e, word, doubleword and quadword int egers. Accessing byt e regist ers is done uniform ly t o t he lowest 8 bit s. The inst ruct ion point er regist er becom es 64 bit s. The EFLAGS regist er is ext ended t o 64 bit s wide, and is referred t o as t he RFLAGS regist er. The upper 32 bit s of RFLAGS is reserved. The lower 32 bit s of RFLAGS is t he sam e as EFLAGS. See Figure 3- 2.



XM M r e gist e r s — There are 16 XMM dat a regist ers for SI MD operat ions. See Sect ion 10.2, “ SSE Program m ing Environm ent ,” for m ore inform at ion about t hese regist ers.



St a ck — The st ack point er size is 64 bit s. St ack size is not cont rolled by a bit in t he SS descript or ( as it is in non- 64- bit m odes) nor can t he point er size be overridden by an inst ruct ion prefix.



Cont r ol r e gist e r s — Cont rol regist ers expand t o 64 bit s. A new cont rol regist er ( t he t ask priorit y regist er: CR8 or TPR) has been added. See Chapt er 2, “ I nt el® 64 and I A- 32 I nt el® Archit ect ures,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.



D e bu g r e gist e r s — Debug regist ers expand t o 64 bit s. See Chapt er 18, “ Debugging and Perform ance Monit oring,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.



D e scr ipt or t a ble r e gist e r s — The global descript or t able regist er ( GDTR) and int errupt descript or t able regist er ( I DTR) expand t o 10 byt es so t hat t hey can

3-6 Vol. 1

BASIC EXECUTION ENVIRONMENT

hold a full 64- bit base address. The local descript or t able regist er ( LDTR) and t he t ask regist er ( TR) also expand t o hold a full 64- bit base address.

Basic Program Execution Registers

Address Space 264 -1

Sixteen 64-bit Registers

Six 16-bit Registers

General-Purpose Registers

Segment Registers

64-bits

RFLAGS Register

64-bits

RIP (Instruction Pointer Register)

FPU Registers Floating-Point Data Registers

Eight 80-bit Registers

0

16 bits

Control Register

16 bits

Status Register

16 bits

Tag Register Opcode Register (11-bits)

64 bits

FPU Instruction Pointer Register

64 bits

FPU Data (Operand) Pointer Register

MMX Registers Eight 64-bit Registers

MMX Registers

XMM Registers Sixteen 128-bit Registers

XMM Registers

32-bits

MXCSR Register

Figure 3-2. 64-Bit Mode Execution Environment

Vol. 1 3-7

BASIC EXECUTION ENVIRONMENT

3.3

MEMORY ORGANIZATION

The m em ory t hat t he processor addresses on it s bus is called physica l m e m or y. Physical m em ory is organized as a sequence of 8- bit byt es. Each byt e is assigned a unique address, called a p h y sica l a d d r e ss. The p h y sica l a d d r e ss sp a ce ranges fr om zer o t o a m ax im um of 2 36 − 1 ( 64 GBy t es) if t he pr ocessor does not suppor t I nt el 64 ar chit ect ur e. I nt el 64 ar chit ect ur e int r oduces a changes in phy sical and linear addr ess space; t hese ar e descr ibed in Sect ion 3. 3. 3, Sect ion 3 . 3 . 4, and Sect ion 3. 3. 7. Virt ually any operat ing syst em or execut ive designed t o work wit h an I A- 32 or I nt el 64 processor will use t he processor ’s m em ory m anagem ent facilit ies t o access m em ory. These facilit ies provide feat ures such as segm ent at ion and paging, which allow m em ory t o be m anaged efficient ly and reliably. Mem ory m anagem ent is described in det ail in Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. The following paragraphs describe t he basic m et hods of addressing m em ory when m em ory m anagem ent is used.

3.3.1

IA-32 Memory Models

When em ploying t he processor ’s m em ory m anagem ent facilit ies, program s do not direct ly address physical m em ory. I nst ead, t hey access m em ory using one of t hree m em ory m odels: flat , segm ent ed, or real address m ode:



Fla t m e m or y m ode l — Mem ory appears t o a program as a single, cont inuous address space ( Figure 3- 3) . This space is called a line a r a ddr e ss spa ce . Code, dat a, and st acks are all cont ained in t his address space. Linear address space is byt e addressable, wit h addresses running cont iguously from 0 t o 2 32 - 1 ( if not in 64- bit m ode) . An address for any byt e in linear address space is called a lin e a r a ddr e ss.



Se gm e n t e d m e m or y m ode l — Mem ory appears t o a program as a group of independent address spaces called segm ent s. Code, dat a, and st acks are t ypically cont ained in separat e segm ent s. To address a byt e in a segm ent , a program issues a logical address. This consist s of a segm ent select or and an offset ( logical addresses are oft en referred t o as far point ers) . The segm ent select or ident ifies t he segm ent t o be accessed and t he offset ident ifies a byt e in t he address space of t he segm ent . Program s running on an I A- 32 processor can address up t o 16,383 segm ent s of different sizes and t ypes, and each segm ent can be as large as 2 32 byt es. I nt ernally, all t he segm ent s t hat are defined for a syst em are m apped int o t he processor ’s linear address space. To access a m em ory locat ion, t he processor t hus t ranslat es each logical address int o a linear address. This t ranslat ion is t ransparent t o t he applicat ion program . The prim ary reason for using segm ent ed m em ory is t o increase t he reliabilit y of program s and syst em s. For exam ple, placing a program ’s st ack in a separat e

3-8 Vol. 1

BASIC EXECUTION ENVIRONMENT

segm ent prevent s t he st ack from growing int o t he code or dat a space and overwrit ing inst ruct ions or dat a, respect ively.



Re a l- a ddr e ss m ode m e m or y m ode l — This is t he m em ory m odel for t he I nt el 8086 processor. I t is support ed t o provide com pat ibilit y wit h exist ing program s writ t en t o run on t he I nt el 8086 processor. The real- address m ode uses a specific im plem ent at ion of segm ent ed m em ory in which t he linear address space for t he program and t he operat ing syst em / execut ive consist s of an array of segm ent s of up t o 64 KByt es in size each. The m axim um size of t he linear address space in real- address m ode is 2 20 byt es. See also: Chapt er 15, “ 8086 Em ulat ion,” I nt el® Soft ware Developer’s Manual, Volum e 3A.

64 and I A- 32 Archit ect ures

Flat Model Linear Address

Linear Address Space*

Segmented Model Segments Linear Address Space*

Offset (effective address) Logical Address Segment Selector

Real-Address Mode Model Offset Logical Address

Linear Address Space Divided Into Equal Sized Segments

Segment Selector

* The linear address space can be paged when using the flat or segmented model.

Figure 3-3. Three Memory Management Models

Vol. 1 3-9

BASIC EXECUTION ENVIRONMENT

3.3.2

Paging and Virtual Memory

Wit h t he flat or t he segm ent ed m em ory m odel, linear address space is m apped int o t he processor ’s physical address space eit her direct ly or t hrough paging. When using direct m apping ( paging disabled) , each linear address has a one- t o- one correspondence wit h a physical address. Linear addresses are sent out on t he processor ’s address lines wit hout t ranslat ion. When using t he I A- 32 archit ect ure’s paging m echanism ( paging enabled) , linear address space is divided int o pages which are m apped t o virt ual m em ory. The pages of virt ual m em ory are t hen m apped as needed int o physical m em ory. When an operat ing syst em or execut ive uses paging, t he paging m echanism is t ransparent t o an applicat ion program . All t hat t he applicat ion sees is linear address space. I n addit ion, I A- 32 archit ect ure’s paging m echanism includes ext ensions t hat support :



Page Address Ext ensions ( PAE) t o address physical address space great er t han 4 GByt es.



Page Size Ext ensions ( PSE) t o m ap linear address t o physical address in 4- MByt es pages.

See also: Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

3.3.3

Memory Organization in 64-Bit Mode

I nt el 64 archit ect ure support s physical address space great er t han 64 GByt es; t he act ual physical address size of I A- 32 processors is im plem ent at ion specific. I n 64- bit m ode, t here is archit ect ural support for 64- bit linear address space. However, processors support ing I nt el 64 archit ect ure m ay im plem ent less t han 64- bit s ( see Sect ion 3.3.7.1) . The linear address space is m apped int o t he processor physical address space t hrough t he PAE paging m echanism .

3.3.4

Modes of Operation vs. Memory Model

When writ ing code for an I A- 32 or I nt el 64 processor, a program m er needs t o know t he operat ing m ode t he processor is going t o be in when execut ing t he code and t he m em ory m odel being used. The relat ionship bet ween operat ing m odes and m em ory m odels is as follows:



Pr ot e ct e d m ode — When in prot ect ed m ode, t he processor can use any of t he m em ory m odels described in t his sect ion. ( The real- addressing m ode m em ory m odel is ordinarily used only when t he processor is in t he virt ual- 8086 m ode.) The m em ory m odel used depends on t he design of t he operat ing syst em or execut ive. When m ult it asking is im plem ent ed, individual t asks can use different m em ory m odels.

3-10 Vol. 1

BASIC EXECUTION ENVIRONMENT



Re a l- a ddr e ss m ode — When in real- address m ode, t he processor only support s t he real- address m ode m em ory m odel.



Syst e m m a na ge m e nt m ode — When in SMM, t he processor swit ches t o a separat e address space, called t he syst em m anagem ent RAM ( SMRAM) . The m em ory m odel used t o address byt es in t his address space is sim ilar t o t he realaddress m ode m odel. See Chapt er 24, “ Syst em Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, for m ore inform at ion on t he m em ory m odel used in SMM.



Com pa t ibilit y m ode — Soft ware t hat needs t o run in com pat ibilit y m ode should observe t he sam e m em ory m odel as t hose t arget ed t o run in 32- bit prot ect ed m ode. The effect of segm ent at ion is t he sam e as it is in 32- bit prot ect ed m ode sem ant ics.



6 4 - bit m ode — Segm ent at ion is generally ( but not com plet ely) disabled, creat ing a flat 64- bit linear- address space. Specifically, t he processor t reat s t he segm ent base of CS, DS, ES, and SS as zero in 64- bit m ode ( t his m akes a linear address equal an effect ive address) . Segm ent ed and real address m odes are not available in 64- bit m ode.

3.3.5

32-Bit and 16-Bit Address and Operand Sizes

I A- 32 processors in prot ect ed m ode can be configured for 32- bit or 16- bit address and operand sizes. Wit h 32- bit address and operand sizes, t he m axim um linear address or segm ent offset is FFFFFFFFH ( 2 32 - 1) ; operand sizes are t ypically 8 bit s or 32 bit s. Wit h 16- bit address and operand sizes, t he m axim um linear address or segm ent offset is FFFFH ( 2 16 - 1) ; operand sizes are t ypically 8 bit s or 16 bit s. When using 32- bit addressing, a logical address ( or far point er) consist s of a 16- bit segm ent select or and a 32- bit offset ; when using 16- bit addressing, an address consist s of a 16- bit segm ent select or and a 16- bit offset . I nst ruct ion prefixes allow t em porary overrides of t he default address and/ or operand sizes from wit hin a program . When operat ing in prot ect ed m ode, t he segm ent descript or for t he current ly execut ing code segm ent defines t he default address and operand size. A segm ent descript or is a syst em dat a st ruct ure not norm ally visible t o applicat ion code. Assem bler direct ives allow t he default addressing and operand size t o be chosen for a program . The assem bler and ot her t ools t hen set up t he segm ent descript or for t he code segm ent appropriat ely. When operat ing in real- address m ode, t he default addressing and operand size is 16 bit s. An address- size override can be used in real- address m ode t o enable 32- bit addressing. However, t he m axim um allowable 32- bit linear address is st ill 000FFFFFH ( 2 20 - 1) .

Vol. 1 3-11

BASIC EXECUTION ENVIRONMENT

3.3.6

Extended Physical Addressing in Protected Mode

Beginning wit h P6 fam ily processors, t he I A- 32 archit ect ure support s addressing of up t o 64 GByt es ( 2 36 byt es) of physical m em ory. A program or t ask could not address locat ions in t his address space direct ly. I nst ead, it addresses individual linear address spaces of up t o 4 GByt es t hat m apped t o 64- GByt e physical address space t hrough a virt ual m em ory m anagem ent m echanism . Using t his m echanism , an operat ing syst em can enable a program t o swit ch 4- GByt e linear address spaces wit hin 64- GByt e physical address space. The use of ext ended physical addressing requires t he processor t o operat e in prot ect ed m ode and t he operat ing syst em t o provide a virt ual m em ory m anagem ent syst em . See “ 36- Bit Physical Addressing Using t he PAE Paging Mechanism ” in Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

3.3.7

Address Calculations in 64-Bit Mode

I n m ost cases, 64- bit m ode uses flat address space for code, dat a, and st acks. I n 64- bit m ode ( if t here is no address- size override) , t he size of effect ive address calculat ions is 64 bit s. An effect ive- address calculat ion uses a 64- bit base and index regist ers and sign- ext end displacem ent s t o 64 bit s. I n t he flat address space of 64- bit m ode, linear addresses are equal t o effect ive addresses because t he base address is zero. I n t he event t hat FS or GS segm ent s are used wit h a non-zero base, t his rule does not hold. I n 64- bit m ode, t he effect ive address com ponent s are added and t he effect ive address is t runcat ed ( See for exam ple t he inst ruct ion LEA) before adding t he full 64- bit segm ent base. The base is never t runcat ed, regardless of addressing m ode in 64- bit m ode. The inst ruct ion point er is ext ended t o 64 bit s t o support 64- bit code offset s. The 64- bit inst ruct ion point er is called t he RI P. Table 3- 1 shows t he relat ionship bet ween RI P, EI P, and I P.

Table 3-1. Instruction Pointer Sizes Bits 63:32 16-bit instruction pointer

Not Modified

32-bit instruction pointer

Zero Extension

64-bit instruction pointer

RIP

Bits 31:16

Bits 15:0 IP

EIP

Generally, displacem ent s and im m ediat es in 64- bit m ode are not ext ended t o 64 bit s. They are st ill lim it ed t o 32 bit s and sign- ext ended during effect ive- address calculat ions. I n 64- bit m ode, how ever, support is provided for 64- bit displacem ent and im m ediat e for m s of t he MOV inst r uct ion. All 16- bit and 32- bit address calculat ions are zero- ext ended in I A- 32e m ode t o form 64- bit addresses. Address calculat ions are first t runcat ed t o t he effect ive address

3-12 Vol. 1

BASIC EXECUTION ENVIRONMENT

size of t he current m ode ( 64- bit m ode or com pat ibilit y m ode) , as overridden by any address- size prefix. The result is t hen zero- ext ended t o t he full 64- bit address widt h. Because of t his, 16- bit and 32- bit applicat ions running in com pat ibilit y m ode can access only t he low 4 GByt es of t he 64- bit m ode effect ive addresses. Likewise, a 32- bit address generat ed in 64- bit m ode can access only t he low 4 GByt es of t he 64- bit m ode effect ive addresses.

3.3.7.1

Canonical Addressing

I n 64- bit m ode, an address is considered t o be in canonical form if address bit s 63 t hrough t o t he m ost- significant im plem ent ed bit by t he m icroarchit ect ure are set t o eit her all ones or all zeros. I nt el 64 archit ect ure defines a 64- bit linear address. I m plem ent at ions can support less. The first im plem ent at ion of I A- 32 processors wit h I nt el 64 archit ect ure support s a 48- bit linear address. This m eans a canonical address m ust have bit s 63 t hrough 48 set t o zeros or ones ( depending on whet her bit 47 is a zero or one) . Alt hough im plem ent at ions m ay not use all 64 bit s of t he linear address, t hey should check bit s 63 t hrough t he m ost- significant im plem ent ed bit t o see if t he address is in canonical form . I f a linear- m em ory reference is not in canonical form , t he im plem ent at ion should generat e an except ion. I n m ost cases, a general- prot ect ion except ion ( # GP) is generat ed. However, in t he case of explicit or im plied st ack references, a st ack fault ( # SS) is generat ed. I nst ruct ions t hat have im plied st ack references, by default , use t he SS segm ent regist er. These include PUSH/ POP- relat ed inst ruct ions and inst ruct ions using RSP/ RBP as base regist ers. I n t hese cases, t he canonical fault is # SF. I f an inst ruct ion uses base regist ers RSP/ RBP and uses a segm ent override prefix t o specify a non- SS segm ent , a canonical fault generat es a # GP ( inst ead of an # SF) . I n 64- bit m ode, only FS and GS segm ent- overrides are applicable in t his sit uat ion. Ot her segm ent override prefixes ( CS, DS, ES and SS) are ignored. Not e t hat t his also m eans t hat an SS segm ent- override applied t o a “ non- st ack” regist er reference is ignored. Such a sequence st ill produces a # GP for a canonical fault ( and not an # SF) .

3.4

BASIC PROGRAM EXECUTION REGISTERS

I A- 32 archit ect ure provides 16 basic program execut ion regist ers for use in general syst em and applicat ion program ing ( see Figure 3- 4) . These regist ers can be grouped as follows:



Ge n e r a l- pur pose r e gist e r s. These eight regist ers are available for st oring operands and point ers.



Se gm e nt r e gist e r s. These regist ers hold up t o six segm ent select ors.

Vol. 1 3-13

BASIC EXECUTION ENVIRONMENT



EFLAGS ( pr ogr a m st a t us a nd cont r ol) r e gist e r . The EFLAGS regist er report on t he st at us of t he program being execut ed and allows lim it ed ( applicat ionprogram level) cont rol of t he processor.



EI P ( inst r uct ion point e r ) r e gist e r . The EI P regist er cont ains a 32- bit point er t o t he next inst ruct ion t o be execut ed.

3.4.1

General-Purpose Registers

The 32- bit general- purpose regist ers EAX, EBX, ECX, EDX, ESI , EDI , EBP, and ESP are provided for holding t he following it em s:

• • •

Operands for logical and arit hm et ic operat ions Operands for address calculat ions Mem ory point ers

Alt hough all of t hese regist ers are available for general st orage of operands, result s, and point ers, caut ion should be used when referencing t he ESP regist er. The ESP regist er holds t he st ack point er and as a general rule should not be used for anot her purpose. Many inst ruct ions assign specific regist ers t o hold operands. For exam ple, st ring inst ruct ions use t he cont ent s of t he ECX, ESI , and EDI regist ers as operands. When using a segm ent ed m em ory m odel, som e inst ruct ions assum e t hat point ers in cert ain regist ers are relat ive t o specific segm ent s. For inst ance, som e inst ruct ions assum e t hat a point er in t he EBX regist er point s t o a m em ory locat ion in t he DS segm ent .

3-14 Vol. 1

BASIC EXECUTION ENVIRONMENT

31

General-Purpose Registers

0 EAX EBX ECX EDX ESI EDI EBP ESP

Segment Registers 0 15 CS DS SS ES FS GS Program Status and Control Register 0 31 EFLAGS 31

Instruction Pointer

0 EIP

Figure 3-4. General System and Application Programming Registers The special uses of general- purpose regist ers by inst ruct ions are described in Chapt er 5, “ I nst ruct ion Set Sum m ary,” in t his volum e. See also: Chapt er 3 and Chapt er 4 of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B. The following is a sum m ary of special uses:

• • • • • •

EAX — Accum ulat or for operands and result s dat a EBX — Point er t o dat a in t he DS segm ent ECX — Count er for st ring and loop operat ions ED X — I / O point er ESI — Point er t o dat a in t he segm ent point ed t o by t he DS regist er; source point er for st ring operat ions ED I — Point er t o dat a ( or dest inat ion) in t he segm ent point ed t o by t he ES regist er; dest inat ion point er for st ring operat ions

Vol. 1 3-15

BASIC EXECUTION ENVIRONMENT

• •

ESP — St ack point er ( in t he SS segm ent ) EBP — Point er t o dat a on t he st ack ( in t he SS segm ent )

As shown in Figure 3- 5, t he lower 16 bit s of t he general- purpose regist ers m ap direct ly t o t he regist er set found in t he 8086 and I nt el 286 processors and can be referenced wit h t he nam es AX, BX, CX, DX, BP, SI , DI , and SP. Each of t he lower t wo byt es of t he EAX, EBX, ECX, and EDX regist ers can be referenced by t he nam es AH, BH, CH, and DH ( high byt es) and AL, BL, CL, and DL ( low byt es) .

31

General-Purpose Registers 8 7 16 15

0 16-bit 32-bit

AH

AL

AX

EAX

BH

BL

BX

EBX

CH

CL

CX

ECX

DH

DL

DX

EDX

BP

EBP

SI

ESI

DI

EDI ESP

SP

Figure 3-5. Alternate General-Purpose Register Names

3.4.1.1

General-Purpose Registers in 64-Bit Mode

I n 64- bit m ode, t here are 16 general purpose regist ers and t he default operand size is 32 bit s. However, general- purpose regist ers are able t o work wit h eit her 32- bit or 64- bit operands. I f a 32- bit operand size is specified: EAX, EBX, ECX, EDX, EDI , ESI , EBP, ESP, R8D - R15D are available. I f a 64- bit operand size is specified: RAX, RBX, RCX, RDX, RDI , RSI , RBP, RSP, R8- R15 are available. R8D- R15D/ R8- R15 represent eight new general- purpose regist ers. All of t hese regist ers can be accessed at t he byt e, word, dword, and qword level. REX prefixes are used t o generat e 64- bit operand sizes or t o reference regist ers R8- R15.

3-16 Vol. 1

BASIC EXECUTION ENVIRONMENT

Table 3-2. Addressable General Purpose Registers Register Type

Without REX

With REX

Byte Registers

AL, BL, CL, DL, AH, BH, CH, DH

AL, BL, CL, DL, DIL, SIL, BPL, SPL, R8L - R15L

Word Registers

AX, BX, CX, DX, DI, SI, BP, SP

AX, BX, CX, DX, DI, SI, BP, SP, R8W - R15W

Doubleword Registers

EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP

EAX, EBX, ECX, EDX, EDI, ESI, EBP, ESP, R8D - R15D

Quadword Registers

N.A.

RAX, RBX, RCX, RDX, RDI, RSI, RBP, RSP, R8 - R15

I n 64- bit m ode, t here are lim it at ions on accessing byt e regist ers. An inst ruct ion cannot reference legacy high- byt es ( for exam ple: AH, BH, CH, DH) and one of t he new byt e regist ers at t he sam e t im e ( for exam ple: t he low byt e of t he RAX regist er) . However, inst ruct ions m ay reference legacy low- byt es ( for exam ple: AL, BL, CL or DL) and new byt e regist ers at t he sam e t im e ( for exam ple: t he low byt e of t he R8 regist er, or RBP) . The archit ect ure enforces t his lim it at ion by changing high- byt e references ( AH, BH, CH, DH) t o low byt e references ( BPL, SPL, DI L, SI L: t he low 8 bit s for RBP, RSP, RDI and RSI ) for inst ruct ions using a REX prefix. When in 64- bit m ode, operand size det erm ines t he num ber of valid bit s in t he dest inat ion general- purpose regist er:



64- bit operands generat e a 64- bit result in t he dest inat ion general- purpose regist er.



32- bit operands generat e a 32- bit result , zero- ext ended t o a 64- bit result in t he dest inat ion general- purpose regist er.



8- bit and 16- bit operands generat e an 8- bit or 16- bit result . The upper 56 bit s or 48 bit s ( respect ively) of t he dest inat ion general- purpose regist er are not be m odified by t he operat ion. I f t he result of an 8- bit or 16- bit operat ion is int ended for 64- bit address calculat ion, explicit ly sign- ext end t he regist er t o t he full 64- bit s.

Because t he upper 32 bit s of 64- bit general- purpose regist ers are undefined in 32- bit m odes, t he upper 32 bit s of any general- purpose regist er are not preserved when swit ching from 64- bit m ode t o a 32- bit m ode ( t o prot ect ed m ode or com pat ibilit y m ode) . Soft ware m ust not depend on t hese bit s t o m aint ain a value aft er a 64- bit t o 32- bit m ode swit ch.

3.4.2

Segment Registers

The segm ent regist ers ( CS, DS, SS, ES, FS, and GS) hold 16- bit segm ent select ors. A segm ent select or is a special point er t hat ident ifies a segm ent in m em ory. To access a part icular segm ent in m em ory, t he segm ent select or for t hat segm ent m ust be present in t he appropriat e segm ent regist er.

Vol. 1 3-17

BASIC EXECUTION ENVIRONMENT

When writ ing applicat ion code, program m ers generally creat e segm ent select ors wit h assem bler direct ives and sym bols. The assem bler and ot her t ools t hen creat e t he act ual segm ent select or values associat ed wit h t hese direct ives and sym bols. I f writ ing syst em code, program m ers m ay need t o creat e segm ent select ors direct ly. See Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. How segm ent regist ers are used depends on t he t ype of m em ory m anagem ent m odel t hat t he operat ing syst em or execut ive is using. When using t he flat ( unsegm ent ed) m em ory m odel, segm ent regist ers are loaded wit h segm ent select ors t hat point t o overlapping segm ent s, each of which begins at address 0 of t he linear address space ( see Figure 3- 6) . These overlapping segm ent s t hen com prise t he linear address space for t he program . Typically, t wo overlapping segm ent s are defined: one for code and anot her for dat a and st acks. The CS segm ent regist er point s t o t he code segm ent and all t he ot her segm ent regist ers point t o t he dat a and st ack segm ent . When using t he segm ent ed m em ory m odel, each segm ent regist er is ordinarily loaded wit h a different segm ent select or so t hat each segm ent regist er point s t o a different segm ent wit hin t he linear address space ( see Figure 3- 7) . At any t im e, a program can t hus access up t o six segm ent s in t he linear address space. To access a segm ent not point ed t o by one of t he segm ent regist ers, a program m ust first load t he segm ent select or for t he segm ent t o be accessed int o a segm ent regist er.

Linear Address Space for Program

Segment Registers CS DS SS ES FS GS The segment selector in each segment register points to an overlapping segment in the linear address space.

Overlapping Segments of up to 4 GBytes Beginning at Address 0

Figure 3-6. Use of Segment Registers for Flat Memory Model

3-18 Vol. 1

BASIC EXECUTION ENVIRONMENT

Code Segment

Segment Registers CS DS SS ES FS GS

Data Segment Stack Segment All segments are mapped to the same linear-address space Data Segment Data Segment Data Segment

Figure 3-7. Use of Segment Registers in Segmented Memory Model Each of t he segm ent regist ers is associat ed wit h one of t hree t ypes of st orage: code, dat a, or st ack. For exam ple, t he CS regist er cont ains t he segm ent select or for t he code se gm e n t , where t he inst ruct ions being execut ed are st ored. The processor fet ches inst ruct ions from t he code segm ent , using a logical address t hat consist s of t he segm ent select or in t he CS regist er and t he cont ent s of t he EI P regist er. The EI P regist er cont ains t he offset wit hin t he code segm ent of t he next inst ruct ion t o be execut ed. The CS regist er cannot be loaded explicit ly by an applicat ion program . I nst ead, it is loaded im plicit ly by inst ruct ions or int ernal processor operat ions t hat change program cont rol ( such as, procedure calls, int errupt handling, or t ask swit ching) . The DS, ES, FS, and GS regist ers point t o four da t a se gm e n t s. The availabilit y of four dat a segm ent s perm it s efficient and secure access t o different t ypes of dat a st ruct ures. For exam ple, four separat e dat a segm ent s m ight be creat ed: one for t he dat a st ruct ures of t he current m odule, anot her for t he dat a export ed from a higherlevel m odule, a t hird for a dynam ically creat ed dat a st ruct ure, and a fourt h for dat a shared wit h anot her program . To access addit ional dat a segm ent s, t he applicat ion program m ust load segm ent select ors for t hese segm ent s int o t he DS, ES, FS, and GS regist ers, as needed. The SS regist er cont ains t he segm ent select or for t he st a ck se gm e n t , where t he procedure st ack is st ored for t he program , t ask, or handler current ly being execut ed. All st ack operat ions use t he SS regist er t o find t he st ack segm ent . Unlike t he CS regist er, t he SS regist er can be loaded explicit ly, which perm it s applicat ion program s t o set up m ult iple st acks and swit ch am ong t hem .

Vol. 1 3-19

BASIC EXECUTION ENVIRONMENT

See Sect ion 3.3, “ Mem ory Organizat ion,” for an overview of how t he segm ent regist ers are used in real- address m ode. The four segm ent regist ers CS, DS, SS, and ES are t he sam e as t he segm ent regist ers found in t he I nt el 8086 and I nt el 286 pr ocessors and t he FS and GS regist ers were int roduced int o t he I A- 32 Archit ect ure w it h t he I nt el386™ fam ily of pr ocessors.

3.4.2.1

Segment Registers in 64-Bit Mode

I n 64- bit m ode: CS, DS, ES, SS are t reat ed as if each segm ent base is 0, regardless of t he value of t he associat ed segm ent descript or base. This creat es a flat address space for code, dat a, and st ack. FS and GS are except ions. Bot h segm ent regist ers m ay be used as addit ional base regist ers in linear address calculat ions ( in t he addressing of local dat a and cert ain operat ing syst em dat a st ruct ures) . Even t hough segm ent at ion is generally disabled, segm ent regist er loads m ay cause t he processor t o perform segm ent access assist s. During t hese act ivit ies, enabled processors will st ill perform m ost of t he legacy checks on loaded values ( even if t he checks are not applicable in 64- bit m ode) . Such checks are needed because a segm ent regist er loaded in 64- bit m ode m ay be used by an applicat ion running in com pat ibilit y m ode. Lim it checks for CS, DS, ES, SS, FS, and GS are disabled in 64- bit m ode.

3.4.3

EFLAGS Register

The 32- bit EFLAGS regist er cont ains a group of st at us flags, a cont rol flag, and a group of syst em flags. Figure 3- 8 defines t he flags wit hin t his regist er. Following init ializat ion of t he processor ( eit her by assert ing t he RESET pin or t he I NI T pin) , t he st at e of t he EFLAGS regist er is 00000002H. Bit s 1, 3, 5, 15, and 22 t hrough 31 of t his regist er are reserved. Soft ware should not use or depend on t he st at es of any of t hese bit s. Som e of t he flags in t he EFLAGS regist er can be m odified direct ly, using specialpurpose inst ruct ions ( described in t he following sect ions) . There are no inst ruct ions t hat allow t he whole regist er t o be exam ined or m odified direct ly. The following inst ruct ions can be used t o m ove groups of flags t o and from t he procedure st ack or t he EAX regist er: LAHF, SAHF, PUSHF, PUSHFD, POPF, and POPFD. Aft er t he cont ent s of t he EFLAGS regist er have been t ransferred t o t he procedure st ack or EAX regist er, t he flags can be exam ined and m odified using t he processor ’s bit m anipulat ion inst ruct ions ( BT, BTS, BTR, and BTC) . When suspending a t ask ( using t he processor ’s m ult it asking facilit ies) , t he processor aut om at ically saves t he st at e of t he EFLAGS regist er in t he t ask st at e segm ent ( TSS) for t he t ask being suspended. When binding it self t o a new t ask, t he processor loads t he EFLAGS regist er wit h dat a from t he new t ask’s TSS. When a call is m ade t o an int errupt or except ion handler procedure, t he processor aut om at ically saves t he st at e of t he EFLAGS regist ers on t he procedure st ack. When

3-20 Vol. 1

BASIC EXECUTION ENVIRONMENT

an int errupt or except ion is handled wit h a t ask swit ch, t he st at e of t he EFLAGS regist er is saved in t he TSS for t he t ask being suspended.

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 V V I I I A V R 0 N 0 0 0 0 0 0 0 0 0 0 T C M F D P F

X X X X X X X X S C X X S S S S S

I O P L

O D I T S Z P C A F F F F F F 0 F 0 F 1 F

ID Flag (ID) Virtual Interrupt Pending (VIP) Virtual Interrupt Flag (VIF) Alignment Check (AC) Virtual-8086 Mode (VM) Resume Flag (RF) Nested Task (NT) I/O Privilege Level (IOPL) Overflow Flag (OF) Direction Flag (DF) Interrupt Enable Flag (IF) Trap Flag (TF) Sign Flag (SF) Zero Flag (ZF) Auxiliary Carry Flag (AF) Parity Flag (PF) Carry Flag (CF)

S Indicates a Status Flag C Indicates a Control Flag X Indicates a System Flag Reserved bit positions. DO NOT USE. Always set to values previously read.

Figure 3-8. EFLAGS Register As t he I A- 32 Archit ect ure has evolved, flags have been added t o t he EFLAGS regist er, but t he funct ion and placem ent of exist ing flags have rem ained t he sam e from one fam ily of t he I A- 32 processors t o t he next . As a result , code t hat accesses or m odifies t hese flags for one fam ily of I A- 32 processors works as expect ed when run on lat er fam ilies of processors.

3.4.3.1

Status Flags

The st at us flags ( bit s 0, 2, 4, 6, 7, and 11) of t he EFLAGS regist er indicat e t he result s of arit hm et ic inst ruct ions, such as t he ADD, SUB, MUL, and DI V inst ruct ions. The st at us flag funct ions are: CF ( bit 0 )

Ca r r y fla g — Set if an arit hm et ic operat ion generat es a carry or a borrow out of t he m ost- significant bit of t he result ; cleared ot herwise.

Vol. 1 3-21

BASIC EXECUTION ENVIRONMENT

PF ( bit 2 ) AF ( bit 4 )

ZF ( bit 6 ) SF ( bit 7 )

OF ( bit 1 1 )

This flag indicat es an overflow condit ion for unsigned- int eger arit hm et ic. I t is also used in m ult iple- precision arit hm et ic. Pa r it y fla g — Set if t he least- significant byt e of t he result cont ains an even num ber of 1 bit s; cleared ot herwise. Adj ust fla g — Set if an arit hm et ic operat ion generat es a carry or a borrow out of bit 3 of t he result ; cleared ot herwise. This flag is used in binary- coded decim al ( BCD) arit hm et ic. Ze r o fla g — Set if t he result is zero; cleared ot herwise. Sign fla g — Set equal t o t he m ost- significant bit of t he result , which is t he sign bit of a signed int eger. ( 0 indicat es a posit ive value and 1 indicat es a negat ive value.) Ove r flow fla g — Set if t he int eger result is t oo large a posit ive num ber or t oo sm all a negat ive num ber ( excluding t he sign- bit ) t o fit in t he dest inat ion operand; cleared ot herwise. This flag indicat es an overflow condit ion for signed- int eger ( t wo’s com plem ent ) arit hm et ic.

Of t hese st at us flags, only t he CF flag can be m odified direct ly, using t he STC, CLC, and CMC inst ruct ions. Also t he bit inst ruct ions ( BT, BTS, BTR, and BTC) copy a specified bit int o t he CF flag. The st at us flags allow a single arit hm et ic operat ion t o produce result s for t hree different dat a t ypes: unsigned int egers, signed int egers, and BCD int egers. I f t he result of an arit hm et ic operat ion is t reat ed as an unsigned int eger, t he CF flag indicat es an out- of- range condit ion ( carry or a borrow) ; if t reat ed as a signed int eger ( t wo’s com plem ent num ber) , t he OF flag indicat es a carry or borrow; and if t reat ed as a BCD digit , t he AF flag indicat es a carry or borrow. The SF flag indicat es t he sign of a signed int eger. The ZF flag indicat es eit her a signed- or an unsigned- int eger zero. When perform ing m ult iple- precision arit hm et ic on int egers, t he CF flag is used in conj unct ion wit h t he add wit h carry ( ADC) and subt ract wit h borrow ( SBB) inst ruct ions t o propagat e a carry or borrow from one com put at ion t o t he next . The condit ion inst ruct ions Jcc ( j um p on condit ion code cc) , SETcc ( byt e set on condit ion code cc) , LOOPcc, and CMOVcc ( condit ional m ove) use one or m ore of t he st at us flags as condit ion codes and t est t hem for branch, set- byt e, or end- loop condit ions.

3.4.3.2

DF Flag

The direct ion flag ( DF, locat ed in bit 10 of t he EFLAGS regist er) cont rols st ring inst ruct ions ( MOVS, CMPS, SCAS, LODS, and STOS) . Set t ing t he DF flag causes t he st ring inst ruct ions t o aut o- decrem ent ( t o process st rings from high addr esses t o low addr esses) . Clear ing t he DF flag causes t he st r ing inst r uct ions t o aut o- increm ent ( process st rings from low addresses t o high addresses) . The STD and CLD inst ruct ions set and clear t he DF flag, respect ively.

3-22 Vol. 1

BASIC EXECUTION ENVIRONMENT

3.4.3.3

System Flags and IOPL Field

The syst em flags and I OPL field in t he EFLAGS regist er cont rol operat ing- syst em or execut ive operat ions. Th e y shou ld n ot be m odifie d by a pplica t ion pr ogr a m s. The funct ions of t he syst em flags are as follows: TF ( bit 8 )

Tr a p fla g — Set t o enable single- st ep m ode for debugging; clear t o disable single- st ep m ode.

I F ( bit 9 )

I nt e r r u pt e n a ble fla g — Cont rols t he response of t he processor t o m askable int errupt request s. Set t o respond t o m askable int errupt s; cleared t o inhibit m askable int errupt s.

I OPL ( bit s 1 2 a n d 1 3 ) I / O pr ivile ge le ve l fie ld — I ndicat es t he I / O privilege level of t he current ly running program or t ask. The current privilege level ( CPL) of t he current ly running program or t ask m ust be less t han or equal t o t he I / O privilege level t o access t he I / O address space. This field can only be m odified by t he POPF and I RET inst ruct ions when operat ing at a CPL of 0. N T ( bit 1 4 )

N e st e d t a sk fla g — Cont rols t he chaining of int errupt ed and called t asks. Set when t he current t ask is linked t o t he previously execut ed t ask; cleared when t he current t ask is not linked t o anot her t ask.

RF ( bit 1 6 )

Re sum e fla g — Cont rols t he processor ’s response t o debug except ions.

VM ( bit 1 7 )

Vir t u a l- 8 0 8 6 m ode fla g — Set t o enable virt ual- 8086 m ode; clear t o ret urn t o prot ect ed m ode wit hout virt ual- 8086 m ode sem ant ics.

AC ( bit 1 8 )

Alignm e nt che ck fla g — Set t his flag and t he AM bit in t he CR0 regist er t o enable alignm ent checking of m em ory references; clear t he AC flag and/ or t he AM bit t o disable alignm ent checking.

VI F ( bit 1 9 )

Vir t ua l int e r r upt fla g — Virt ual im age of t he I F flag. Used in conj unct ion wit h t he VI P flag. ( To use t his flag and t he VI P flag t he virt ual m ode ext ensions are enabled by set t ing t he VME flag in cont rol regist er CR4.)

VI P ( bit 2 0 )

Vir t ua l int e r r upt pe nding fla g — Set t o indicat e t hat an int errupt is pending; clear when no int errupt is pending. ( Soft ware set s and clears t his flag; t he processor only reads it .) Used in conj unct ion wit h t he VI F flag.

I D ( bit 2 1 )

I de nt ifica t ion fla g — The abilit y of a program t o set or clear t his flag indicat es support for t he CPUI D inst ruct ion.

For a det ailed descript ion of t hese flags: see Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft w are Developer’s Manual, Volum e 3A.

Vol. 1 3-23

BASIC EXECUTION ENVIRONMENT

3.4.3.4

RFLAGS Register in 64-Bit Mode

I n 64- bit m ode, EFLAGS is ext ended t o 64 bit s and called RFLAGS. The upper 32 bit s of RFLAGS regist er is reserved. The lower 32 bit s of RFLAGS is t he sam e as EFLAGS.

3.5

INSTRUCTION POINTER

The inst ruct ion point er ( EI P) regist er cont ains t he offset in t he current code segm ent for t he next inst ruct ion t o be execut ed. I t is advanced from one inst ruct ion boundary t o t he next in st raight- line code or it is m oved ahead or backwards by a num ber of inst ruct ions when execut ing JMP, Jcc, CALL, RET, and I RET inst ruct ions. The EI P regist er cannot be accessed direct ly by soft ware; it is cont rolled im plicit ly by cont rol- t ransfer inst ruct ions ( such as JMP, Jcc, CALL, and RET) , int errupt s, and except ions. The only way t o read t he EI P regist er is t o execut e a CALL inst ruct ion and t hen read t he value of t he ret urn inst ruct ion point er from t he procedure st ack. The EI P regist er can be loaded indirect ly by m odifying t he value of a ret urn inst ruct ion point er on t he procedure st ack and execut ing a ret urn inst ruct ion ( RET or I RET) . See Sect ion 6.2.4.2, “ Ret urn I nst ruct ion Point er.” All I A- 32 processors prefet ch inst ruct ions. Because of inst ruct ion prefet ching, an inst ruct ion address read from t he bus during an inst ruct ion load does not m at ch t he value in t he EI P regist er. Even t hough different processor generat ions use different prefet ching m echanism s, t he funct ion of t he EI P regist er t o direct program flow rem ains fully com pat ible wit h all soft ware writ t en t o run on I A- 32 processors.

3.5.1

Instruction Pointer in 64-Bit Mode

I n 64- bit m ode, t he RI P regist er becom es t he inst ruct ion point er. This regist er holds t he 64- bit offset of t he next inst ruct ion t o be execut ed. 64- bit m ode also support s a t echnique called RI P- relat ive addressing. Using t his t echnique, t he effect ive address is det erm ined by adding a displacem ent t o t he RI P of t he next inst ruct ion.

3.6

OPERAND-SIZE AND ADDRESS-SIZE ATTRIBUTES

When t he processor is execut ing in prot ect ed m ode, every code segm ent has a default operand- size at t ribut e and address- size at t ribut e. These at t ribut es are select ed wit h t he D ( default size) flag in t he segm ent descript or for t he code segm ent ( see Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A) . When t he D flag is set , t he 32- bit operand- size and address- size at t ribut es are select ed; when t he flag is clear, t he 16- bit size at t ribut es are select ed. When t he processor is execut ing in realaddress m ode, virt ual- 8086 m ode, or SMM, t he default operand- size and addresssize at t ribut es are always 16 bit s.

3-24 Vol. 1

BASIC EXECUTION ENVIRONMENT

The operand- size at t ribut e select s t he size of operands. When t he 16- bit operandsize at t ribut e is in force, operands can generally be eit her 8 bit s or 16 bit s, and when t he 32- bit operand- size at t ribut e is in force, operands can generally be 8 bit s or 32 bit s. The address- size at t ribut e select s t he sizes of addresses used t o address m em ory: 16 bit s or 32 bit s. When t he 16- bit address- size at t ribut e is in force, segm ent offset s and displacem ent s are 16 bit s. This rest rict ion lim it s t he size of a segm ent t o 64 KByt es. When t he 32- bit address- size at t ribut e is in force, segm ent offset s and displacem ent s are 32 bit s, allowing up t o 4 GByt es t o be addressed. The default operand- size at t ribut e and/ or address- size at t ribut e can be overridden for a part icular inst ruct ion by adding an operand- size and/ or address- size prefix t o an inst ruct ion. See Chapt er 2, “ I nst ruct ion Form at ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. The effect of t his prefix applies only t o t he t arget ed inst ruct ion. Table 3- 4 shows effect ive operand size and address size ( when execut ing in prot ect ed m ode or com pat ibilit y m ode) depending on t he set t ings of t he D flag and t he operand- size and address- size prefixes.

Table 3-3. Effective Operand- and Address-Size Attributes D Flag in Code Segment Descriptor

0

0

0

0

1

1

1

1

Operand-Size Prefix 66H

N

N

Y

Y

N

N

Y

Y

Address-Size Prefix 67H

N

Y

N

Y

N

Y

N

Y

Effective Operand Size

16

16

32

32

32

32

16

16

Effective Address Size

16

32

16

32

32

16

32

16

NOTES: Y: Yes - this instruction prefix is present. N: No - this instruction prefix is not present.

3.6.1

Operand Size and Address Size in 64-Bit Mode

I n 64- bit m ode, t he default address size is 64 bit s and t he default operand size is 32 bit s. Default s can be overridden using prefixes. Address- size and operand- size prefixes allow m ixing of 32/ 64- bit dat a and 32/ 64- bit addresses on an inst ruct ionby- inst ruct ion basis. Table 3- 4 shows valid com binat ions of t he 66H inst ruct ion prefix and t he REX.W prefix t hat m ay be used t o specify operand- size overrides in 64- bit m ode. Not e t hat 16- bit addresses are not support ed in 64- bit m ode. REX prefixes consist of 4- bit fields t hat form 16 different values. The W- bit field in t he REX prefixes is referred t o as REX.W. I f t he REX.W field is properly set , t he prefix specifies an operand size override t o 64 bit s. Not e t hat soft ware can st ill use t he

Vol. 1 3-25

BASIC EXECUTION ENVIRONMENT

operand- size 66H prefix t o t oggle t o a 16- bit operand size. However, set t ing REX.W t akes precedence over t he operand- size prefix ( 66H) when bot h are used. I n t he case of SSE/ SSE2/ SSE3/ SSSE3 SI MD inst ruct ions: t he 66H, F2H, and F3H prefixes are m andat ory for opcode ext ensions. I n such a case, t here is no int eract ion bet ween a valid REX.W prefix and a 66H opcode ext ension prefix. See Chapt er 2, “ I nst ruct ion Form at ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

Table 3-4. Effective Operand- and Address-Size Attributes in 64-Bit Mode L Flag in Code Segment Descriptor

1

1

1

1

1

1

1

1

REX.W Prefix

0

0

0

0

1

1

1

1

Operand-Size Prefix 66H

N

N

Y

Y

N

N

Y

Y

Address-Size Prefix 67H

N

Y

N

Y

N

Y

N

Y

Effective Operand Size

32

32

16

16

64

64

64

64

Effective Address Size

64

32

64

32

64

32

64

32

NOTES: Y: Yes - this instruction prefix is present. N: No - this instruction prefix is not present.

3.7

OPERAND ADDRESSING

I A- 32 m achine- inst ruct ions act on zero or m ore operands. Som e operands are specified explicit ly and ot hers are im plicit . The dat a for a source operand can be locat ed in:

• • • •

t he inst ruct ion it self ( an im m ediat e operand) a regist er a m em ory locat ion an I / O port

When an inst ruct ion ret urns dat a t o a dest inat ion operand, it can be ret urned t o:

• • •

a regist er a m em ory locat ion an I / O port

3.7.1

Immediate Operands

Som e inst ruct ions use dat a encoded in t he inst ruct ion it self as a source operand. These operands are called im m e dia t e operands ( or sim ply im m ediat es) . For

3-26 Vol. 1

BASIC EXECUTION ENVIRONMENT

exam ple, t he following ADD inst ruct ion adds an im m ediat e value of 14 t o t he cont ent s of t he EAX regist er: ADD EAX, 14 All arit hm et ic inst ruct ions ( except t he DI V and I DI V inst ruct ions) allow t he source operand t o be an im m ediat e value. The m axim um value allowed for an im m ediat e operand varies am ong inst ruct ions, but can never be great er t han t he m axim um value of an unsigned doubleword int eger ( 2 32 ) .

3.7.2

Register Operands

Source and dest inat ion operands can be any of t he following regist ers, depending on t he inst ruct ion being execut ed:

• • • • • • • • • • •

32- bit general- purpose regist ers ( EAX, EBX, ECX, EDX, ESI , EDI , ESP, or EBP) 16- bit general- purpose regist ers ( AX, BX, CX, DX, SI , DI , SP, or BP) 8- bit general- purpose regist ers ( AH, BH, CH, DH, AL, BL, CL, or DL) segm ent regist ers ( CS, DS, SS, ES, FS, and GS) EFLAGS regist er x87 FPU regist ers ( ST0 t hrough ST7, st at us word, cont rol word, t ag word, dat a operand point er, and inst ruct ion point er) MMX regist ers ( MM0 t hrough MM7) XMM regist ers ( XMM0 t hrough XMM7) and t he MXCSR regist er cont rol regist ers ( CR0, CR2, CR3, and CR4) and syst em t able point er regist ers ( GDTR, LDTR, I DTR, and t ask regist er) debug regist ers ( DR0, DR1, DR2, DR3, DR6, and DR7) MSR regist ers

Som e inst ruct ions ( such as t he DI V and MUL inst ruct ions) use quadword operands cont ained in a pair of 32- bit regist ers. Regist er pairs are represent ed wit h a colon separat ing t hem . For exam ple, in t he regist er pair EDX: EAX, EDX cont ains t he high order bit s and EAX cont ains t he low order bit s of a quadword operand. Several inst ruct ions ( such as t he PUSHFD and POPFD inst ruct ions) are provided t o load and st ore t he cont ent s of t he EFLAGS regist er or t o set or clear individual flags in t his regist er. Ot her inst ruct ions ( such as t he Jcc inst ruct ions) use t he st at e of t he st at us flags in t he EFLAGS regist er as condit ion codes for branching or ot her decision m aking operat ions. The processor cont ains a select ion of syst em regist ers t hat are used t o cont rol m em ory m anagem ent , int errupt and except ion handling, t ask m anagem ent , processor m anagem ent , and debugging act ivit ies. Som e of t hese syst em regist ers are accessible by an applicat ion program , t he operat ing syst em , or t he execut ive t hrough a set of syst em inst ruct ions. When accessing a syst em regist er wit h a syst em inst ruct ion, t he regist er is generally an im plied operand of t he inst ruct ion.

Vol. 1 3-27

BASIC EXECUTION ENVIRONMENT

3.7.2.1

Register Operands in 64-Bit Mode

Regist er operands in 64- bit m ode can be any of t he following:



64- bit general- purpose regist ers ( RAX, RBX, RCX, RDX, RSI , RDI , RSP, RBP, or R8- R15)



32- bit general- purpose regist ers ( EAX, EBX, ECX, EDX, ESI , EDI , ESP, EBP, or R8D- R15D)

• •

16- bit general- purpose regist ers ( AX, BX, CX, DX, SI , DI , SP, BP, or R8W- R15W)

• • •

Segm ent regist ers ( CS, DS, SS, ES, FS, and GS)

• • •

MMX regist ers ( MM0 t hrough MM7)

• • •

Debug regist ers ( DR0, DR1, DR2, DR3, DR6, and DR7)

8- bit general- purpose regist ers: AL, BL, CL, DL, SI L, DI L, SPL, BPL, and R8LR15L are available using REX prefixes; AL, BL, CL, DL, AH, BH, CH, DH are available wit hout using REX prefixes. RFLAGS regist er x87 FPU regist ers ( ST0 t hrough ST7, st at us word, cont rol word, t ag word, dat a operand point er, and inst ruct ion point er) XMM regist ers ( XMM0 t hrough XMM15) and t he MXCSR regist er Cont rol regist ers ( CR0, CR2, CR3, CR4, and CR8) and syst em t able point er regist ers ( GDTR, LDTR, I DTR, and t ask regist er) MSR regist ers RDX: RAX regist er pair represent ing a 128- bit operand

3.7.3

Memory Operands

Source and dest inat ion operands in m em ory are referenced by m eans of a segm ent select or and an offset ( see Figure 3- 9) . Segm ent select ors specify t he segm ent cont aining t he operand. Offset s specify t he linear or effect ive address of t he operand. Offset s can be 32 bit s ( represent ed by t he not at ion m 16: 32) or 16 bit s ( represent ed by t he not at ion m 16: 16) .

15

0 Segment Selector

31 0 Offset (or Linear Address)

Figure 3-9. Memory Operand Address

3.7.3.1

Memory Operands in 64-Bit Mode

I n 64- bit m ode, a m em ory operand can be referenced by a segm ent select or and an offset . The offset can be 16 bit s, 32 bit s or 64 bit s ( see Figure 3- 10) .

3-28 Vol. 1

BASIC EXECUTION ENVIRONMENT

15

0

63

Segment Selector

0 Offset (or Linear Address)

Figure 3-10. Memory Operand Address in 64-Bit Mode

3.7.4

Specifying a Segment Selector

The segm ent select or can be specified eit her im plicit ly or explicit ly. The m ost com m on m et hod of specifying a segm ent select or is t o load it in a segm ent regist er and t hen allow t he processor t o select t he regist er im plicit ly, depending on t he t ype of operat ion being perform ed. The processor aut om at ically chooses a segm ent according t o t he rules given in Table 3- 5. When st oring dat a in m em ory or loading dat a from m em ory, t he DS segm ent default can be overridden t o allow ot her segm ent s t o be accessed. Wit hin an assem bler, t he segm ent override is generally handled wit h a colon “ : ” operat or. For exam ple, t he following MOV inst ruct ion m oves a value from regist er EAX int o t he segm ent point ed t o by t he ES regist er. The offset int o t he segm ent is cont ained in t he EBX regist er: MOV ES:[EBX], EAX;

Table 3-5. Default Segment Selection Rules Reference Type

Register Used

Segment Used

Default Selection Rule

Instructions

CS

Code Segment

All instruction fetches.

Stack

SS

Stack Segment

All stack pushes and pops. Any memory reference which uses the ESP or EBP register as a base register.

Local Data

DS

Data Segment

All data references, except when relative to stack or string destination.

Destination Strings

ES

Data Segment pointed to with the ES register

Destination of string instructions.

Vol. 1 3-29

BASIC EXECUTION ENVIRONMENT

At t he m achine level, a segm ent override is specified wit h a segm ent- override prefix, which is a byt e placed at t he beginning of an inst ruct ion. The following default segm ent select ions cannot be overridden:

• •

I nst ruct ion fet ches m ust be m ade from t he code segm ent .



Push and pop operat ions m ust always reference t he SS segm ent .

Dest inat ion st rings in st ring inst ruct ions m ust be st ored in t he dat a segm ent point ed t o by t he ES regist er.

Som e inst ruct ions require a segm ent select or t o be specified explicit ly. I n t hese cases, t he 16- bit segm ent select or can be locat ed in a m em ory locat ion or in a 16- bit regist er. For exam ple, t he following MOV inst ruct ion m oves a segm ent select or locat ed in regist er BX int o segm ent regist er DS: MOV DS, BX Segm ent select ors can also be specified explicit ly as part of a 48- bit far point er in m em ory. Here, t he first doubleword in m em ory cont ains t he offset and t he next word cont ains t he segm ent select or.

3.7.4.1

Segmentation in 64-Bit Mode

I n I A- 32e m ode, t he effect s of segm ent at ion depend on whet her t he processor is running in com pat ibilit y m ode or 64- bit m ode. I n com pat ibilit y m ode, segm ent at ion funct ions j ust as it does in legacy I A- 32 m ode, using t he 16- bit or 32- bit prot ect ed m ode sem ant ics described above. I n 64- bit m ode, segm ent at ion is generally ( but not com plet ely) disabled, creat ing a flat 64- bit linear- address space. The processor t reat s t he segm ent base of CS, DS, ES, SS as zero, creat ing a linear address t hat is equal t o t he effect ive address. The except ions are t he FS and GS segm ent s, whose segm ent regist ers ( which hold t he segm ent base) can be used as addit ional base regist ers in som e linear address calculat ions.

3.7.5

Specifying an Offset

The offset part of a m em ory address can be specified direct ly as a st at ic value ( called a displa ce m e nt ) or t hrough an address com put at ion m ade up of one or m ore of t he following com ponent s:

• • • •

D ispla ce m e nt — An 8- , 16- , or 32- bit value. Ba se — The value in a general- purpose regist er. I n de x — The value in a general- purpose regist er. Sca le fa ct or — A value of 2, 4, or 8 t hat is m ult iplied by t he index value.

The offset which result s from adding t hese com ponent s is called an e ffe ct ive a ddr e ss. Each of t hese com ponent s can have eit her a posit ive or negat ive ( 2s com plem ent ) value, wit h t he except ion of t he scaling fact or. Figure 3- 11 shows all

3-30 Vol. 1

BASIC EXECUTION ENVIRONMENT

t he possible ways t hat t hese com ponent s can be com bined t o creat e an effect ive address in t he select ed segm ent .

Base

Index

EAX EBX ECX EDX ESP EBP ESI EDI

EAX EBX ECX EDX EBP ESI EDI

+

Scale

Displacement

1

None

2 *

8-bit +

4

16-bit

8

32-bit

Offset = Base + (Index * Scale) + Displacement

Figure 3-11. Offset (or Effective Address) Computation The uses of general- purpose regist ers as base or index com ponent s are rest rict ed in t he following m anner:

• •

The ESP regist er cannot be used as an index regist er. When t he ESP or EBP regist er is used as t he base, t he SS segm ent is t he default segm ent . I n all ot her cases, t he DS segm ent is t he default segm ent .

The base, index, and displacem ent com ponent s can be used in any com binat ion, and any of t hese com ponent s can be NULL. A scale fact or m ay be used only when an index also is used. Each possible com binat ion is useful for dat a st ruct ures com m only used by program m ers in high- level languages and assem bly language. The following addressing m odes suggest uses for com m on com binat ions of address com ponent s.



• •

D ispla ce m e nt ⎯ A displacem ent alone represent s a direct ( uncom put ed) offset t o t he operand. Because t he displacem ent is encoded in t he inst ruct ion, t his form of an address is som et im es called an absolut e or st at ic address. I t is com m only used t o access a st at ically allocat ed scalar operand. Ba se ⎯ A base alone represent s an indirect offset t o t he operand. Since t he value in t he base regist er can change, it can be used for dynam ic st orage of variables and dat a st ruct ures.

Ba se + D ispla ce m e nt ⎯ A base regist er and a displacem ent can be used t oget her for t wo dist inct purposes:

— As an index int o an array when t he elem ent size is not 2, 4, or 8 byt es—The displacem ent com ponent encodes t he st at ic offset t o t he beginning of t he array. The base regist er holds t he result s of a calculat ion t o det erm ine t he offset t o a specific elem ent wit hin t he array. — To access a field of a record: t he base regist er holds t he address of t he beginning of t he record, while t he displacem ent is a st at ic offset t o t he field.

Vol. 1 3-31

BASIC EXECUTION ENVIRONMENT

An im port ant special case of t his com binat ion is access t o param et ers in a procedure act ivat ion record. A procedure act ivat ion record is t he st ack fram e creat ed when a procedure is ent ered. Here, t he EBP regist er is t he best choice for t he base regist er, because it aut om at ically select s t he st ack segm ent . This is a com pact encoding for t his com m on funct ion.







( I nde x ∗ Sca le ) + D ispla ce m e nt ⎯ This address m ode offers an efficient way t o index int o a st at ic array when t he elem ent size is 2, 4, or 8 byt es. The displacem ent locat es t he beginning of t he array, t he index regist er holds t he subscript of t he desired array elem ent , and t he processor aut om at ically convert s t he subscript int o an index by applying t he scaling fact or.

Ba se + I n de x + D ispla ce m e nt ⎯ Using t wo regist ers t oget her support s eit her a t wo- dim ensional array ( t he displacem ent holds t he address of t he beginning of t he array) or one of several inst ances of an array of records ( t he displacem ent is an offset t o a field wit hin t he record) . Ba se + ( I nde x ∗ Sca le ) + D ispla ce m e nt ⎯ Using all t he addressing com ponent s t oget her allows efficient indexing of a t wo- dim ensional array when t he elem ent s of t he array are 2, 4, or 8 byt es in size.

3.7.5.1

Specifying an Offset in 64-Bit Mode

The offset part of a m em ory address in 64- bit m ode can be specified direct ly as a st at ic value or t hrough an address com put at ion m ade up of one or m ore of t he following com ponent s:

• • • •

D ispla ce m e nt — An 8- bit , 16- bit , or 32- bit value. Ba se — The value in a 32- bit ( or 64- bit if REX.W is set ) general- purpose regist er. I n de x — The value in a 32- bit ( or 64- bit if REX.W is set ) general- purpose regist er. Sca le fa ct or — A value of 2, 4, or 8 t hat is m ult iplied by t he index value.

The base and index value can be specified in one of sixt een available general- purpose regist ers in m ost cases. See Chapt er 2, “ I nst ruct ion Form at ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. The following unique com binat ion of address com ponent s is also available.



RI P + D ispla ce m e nt ⎯ I n 64- bit m ode, RI P- relat ive addressing uses a signed 32- bit displacem ent t o calculat e t he effect ive address of t he next inst ruct ion by sign- ext end t he 32- bit value and add t o t he 64- bit value in RI P.

3.7.6

Assembler and Compiler Addressing Modes

At t he m achine- code level, t he select ed com binat ion of displacem ent , base regist er, index regist er, and scale fact or is encoded in an inst ruct ion. All assem blers perm it a program m er t o use any of t he allowable com binat ions of t hese addressing com ponent s t o address operands. High- level language com pilers will select an appropriat e

3-32 Vol. 1

BASIC EXECUTION ENVIRONMENT

com binat ion of t hese com ponent s based on t he language const ruct a program m er defines.

3.7.7

I/O Port Addressing

The processor support s an I / O address space t hat cont ains up t o 65,536 8- bit I / O port s. Port s t hat are 16- bit and 32- bit m ay also be defined in t he I / O address space. An I / O port can be addressed wit h eit her an im m ediat e operand or a value in t he DX regist er. See Chapt er 13, “ I nput / Out put ,” for m ore inform at ion about I / O port addressing.

Vol. 1 3-33

BASIC EXECUTION ENVIRONMENT

3-34 Vol. 1

CHAPTER 4 DATA TYPES This chapt er int roduces dat a t ypes defined for t he I nt el 64 and I A- 32 archit ect ures. A sect ion at t he end of t his chapt er describes t he real- num ber and float ing- point concept s used in x87 FPU, SSE, SSE2, SSE3 and SSSE3 ext ensions.

4.1

FUNDAMENTAL DATA TYPES

The fundam ent al dat a t ypes are byt es, words, doublewords, quadwords, and double quadwords ( see Figure 4- 1) . A byt e is eight bit s, a word is 2 byt es ( 16 bit s) , a doubleword is 4 byt es ( 32 bit s) , a quadword is 8 byt es ( 64 bit s) , and a double quadword is 16 byt es ( 128 bit s) . A subset of t he I A- 32 archit ect ure inst ruct ions operat es on t hese fundam ent al dat a t ypes wit hout any addit ional operand t yping.

0

7

Byte N 15 8 7 0 High Low Byte Byte Word N+1 N 31

0

16 15

High Word Low Word Doubleword N+2 63 High Doubleword

0 Quadword

Low Doubleword

N+4 127

N

32 31 N

64 63 High Quadword

0 Low Quadword

N+8

Double Quadword

N

Figure 4-1. Fundamental Data Types The quadword dat a t ype was int roduced int o t he I A- 32 archit ect ure in t he I nt el486 processor; t he double quadword dat a t ype was int roduced in t he Pent ium III processor wit h t he SSE ext ensions.

Vol. 1 4-1

DATA TYPES

Figure 4- 2 shows t he byt e order of each of t he fundam ent al dat a t ypes when referenced as operands in m em ory. The low byt e ( bit s 0 t hrough 7) of each dat a t ype occupies t he lowest address in m em ory and t hat address is also t he address of t he operand.

Word at Address BH Contains FE06H Byte at Address 9H Contains 1FH

Word at Address 6H Contains 230BH

Word at Address 2H Contains 74CBH Word at Address 1H Contains CB31H

12H

EH

7AH

DH

FEH

CH

06H

BH

36H

AH

1FH

9H

A4H

8H

23H

7H

0BH

6H

45H

5H

67H

4H

74H

3H

CBH

2H

31H

1H

12H

0H

Doubleword at Address AH Contains 7AFE0636H

Quadword at Address 6H Contains 7AFE06361FA4230BH

Double quadword at Address 0H Contains 127AFE06361FA4230B456774CB3112H

Figure 4-2. Bytes, Words, Doublewords, Quadwords, and Double Quadwords in Memory

4.1.1

Alignment of Words, Doublewords, Quadwords, and Double Quadwords

Words, doublewords, and quadwords do not need t o be aligned in m em ory on nat ural boundaries. The nat ural boundaries for words, double words, and quadwords are even- num bered addresses, addresses evenly divisible by four, and addresses evenly divisible by eight , respect ively. However, t o im prove t he perform ance of program s, dat a st ruct ures ( especially st acks) should be aligned on nat ural boundaries whenever possible. The reason for t his is t hat t he processor requires t wo m em ory accesses t o m ake an unaligned m em ory access; aligned accesses require only one m em ory access. A word or doubleword operand t hat crosses a 4- byt e boundary or a quadword operand t hat crosses an 8- byt e boundary is considered unaligned and requires t wo separat e m em ory bus cycles for access.

4-2 Vol. 1

DATA TYPES

Som e inst ruct ions t hat operat e on double quadwords require m em ory operands t o be aligned on a nat ural boundary. These inst ruct ions generat e a general- prot ect ion except ion ( # GP) if an unaligned operand is specified. A nat ural boundary for a double quadword is any address evenly divisible by 16. Ot her inst ruct ions t hat operat e on double quadwords perm it unaligned access ( wit hout generat ing a general- prot ect ion except ion) . However, addit ional m em ory bus cycles are required t o access unaligned dat a from m em ory.

4.2

NUMERIC DATA TYPES

Alt hough byt es, words, and doublewords are fundam ent al dat a t ypes, som e inst ruct ions support addit ional int erpret at ions of t hese dat a t ypes t o allow operat ions t o be perform ed on num eric dat a t ypes ( signed and unsigned int egers, and float ing- point num bers) . See Figure 4- 3.

Vol. 1 4-3

DATA TYPES

Byte Unsigned Integer 7

0

Word Unsigned Integer 15

0

Doubleword Unsigned Integer 0

31

Quadword Unsigned Integer 0

63 Sign

Byte Signed Integer 76

0

Sign Word Signed Integer 15 14

0

Sign Doubleword Signed Integer 31 30

0

Sign Quadword Signed Integer 63 62

0 Sign 31 30

23 22

0

Single Precision Floating Point

Sign 63 62 Sign 79 78

52 51

Integer Bit

Double Precision Floating Point

Double Extended Precision Floating Point

64 63 62

0

Figure 4-3. Numeric Data Types

4-4 Vol. 1

0

DATA TYPES

4.2.1

Integers

The I nt el 64 and I A- 32 archit ect ures define t wo t ypes of int egers: unsigned and signed. Unsigned int egers are ordinary binary values ranging from 0 t o t he m axim um posit ive num ber t hat can be encoded in t he select ed operand size. Signed int egers are t wo’s com plem ent binary values t hat can be used t o represent bot h posit ive and negat ive int eger values. Som e int eger inst ruct ions ( such as t he ADD, SUB, PADDB, and PSUBB inst ruct ions) operat e on eit her unsigned or signed int eger operands. Ot her int eger inst ruct ions ( such as I MUL, MUL, I DI V, DI V, FI ADD, and FI SUB) operat e on only one int eger t ype. The following sect ions describe t he encodings and ranges of t he t wo t ypes of int egers.

4.2.1.1

Unsigned Integers

Unsigned int egers are unsigned binary num bers cont ained in a byt e, word, doubleword, and quadw or d. Their values range fr om 0 t o 255 for an unsigned byt e int eger, fr om 0 t o 65,535 for an unsigned w or d int eger, fr om 0 t o 2 32 – 1 for an unsigned doublew or d int eger, and fr om 0 t o 2 64 – 1 for an unsigned quadw or d int eger. Unsigned int eger s ar e som et im es r efer r ed t o as or din a ls.

4.2.1.2

Signed Integers

Signed int egers are signed binary num bers held in a byt e, word, doubleword, or quadword. All operat ions on signed int egers assum e a t wo's com plem ent represent at ion. The sign bit is locat ed in bit 7 in a byt e int eger, bit 15 in a word int eger, bit 31 in a doubleword int eger, and bit 63 in a quadword int eger ( see t he signed int eger encodings in Table 4- 1) .

Vol. 1 4-5

DATA TYPES

Table 4-1. Signed Integer Encodings Class

Two’s Complement Encoding Sign

Positive

Largest

Smallest Zero Negative

Smallest

Largest Integer indefinite

0

11..11

.

.

.

.

0

00..01

0

00..00

1

11..11

.

.

.

.

1

00..00

1

← 7 bits → ← 15 bits → ← 31 bits → ← 63 bits →

Signed Byte Integer: Signed Word Integer: Signed Doubleword Integer: Signed Quadword Integer:

00..00

The sign bit is set for negat ive int egers and cleared for posit ive int egers and zero. I nt eger values range from –128 t o + 127 for a byt e int eger, from –32,768 t o + 32,767 for a word int eger, from –2 31 t o + 2 31 – 1 for a doubleword int eger, and from –2 63 t o + 2 63 – 1 for a quadword int eger. When st oring int eger values in m em ory, word int egers are st ored in 2 consecut ive byt es; doubleword int egers are st ored in 4 consecut ive byt es; and quadword int egers are st ored in 8 consecut ive byt es. The int eger indefinit e is a special value t hat is som et im es ret urned by t he x87 FPU when operat ing on int eger values. For m ore inform at ion, see Sect ion 8.2.1, “ I ndefinit es.”

4.2.2

Floating-Point Data Types

The I A- 32 archit ect ure defines and operat es on t hree float ing- point dat a t ypes: single- precision float ing- point , double- precision float ing- point , and double- ext ended precision float ing- point ( see Figure 4- 3) . The dat a form at s for t hese dat a t ypes correspond direct ly t o form at s specified in t he I EEE St andard 754 for Binary Float ingPoint Arit hm et ic. Table 4- 2 gives t he lengt h, precision, and approxim at e norm alized range t hat can be represent ed by each of t hese dat a t ypes. Denorm al values are also support ed in each of t hese t ypes.

4-6 Vol. 1

DATA TYPES

Table 4-2. Length, Precision, and Range of Floating-Point Data Types Data Type

Length

Precision (Bits)

Approximate Normalized Range

Single Precision

32

24

2–126 to 2127

Double Precision

64

53

2–1022 to 21023

Double Extended Precision

80

64

2–16382 to 216383

Binary

Decimal

1.18 × 10–38 to 3.40 × 1038

2.23 × 10–308 to 1.79 × 10308

3.37 × 10–4932 to 1.18 × 104932

NOTE Sect ion 4.8, “ Real Num bers and Float ing- Point Form at s,” gives an overview of t he I EEE St andard 754 float ing- point form at s and defines t he t erm s int eger bit , QNaN, SNaN, and denorm al value. Table 4- 3 shows t he float ing- point encodings for zeros, denorm alized finit e num bers, norm alized finit e num bers, infinit es, and NaNs for each of t he t hree float ing- point dat a t ypes. I t also gives t he form at for t he QNaN float ing- point indefinit e value. ( See Sect ion 4.8.3.7, “ QNaN Float ing- Point I ndefinit e,” for a discussion of t he use of t he QNaN float ing- point indefinit e value.) For t he single- precision and double- precision form at s, only t he fract ion part of t he significand is encoded. The int eger is assum ed t o be 1 for all num bers except 0 and denorm alized finit e num bers. For t he double ext ended- precision form at , t he int eger is cont ained in bit 63, and t he m ost- significant fract ion bit is bit 62. Here, t he int eger is explicit ly set t o 1 for norm alized num bers, infinit ies, and NaNs, and t o 0 for zero and denorm alized num bers.

Vol. 1 4-7

DATA TYPES

Table 4-3. Floating-Point Number and NaN Encodings Class

Sign

Biased Exponent

Significand Integer

Positive

Negative

Fraction

+∞

0

11..11

1

00..00

+Normals

0 . . 0

11..10 . . 00..01

1 . . 1

11..11 . . 00..00

+Denormals

0 . . 0

00..00 . . 00..00

0 . . 0

11.11 . . 00..01

+Zero

0

00..00

0

00..00

−Denormals

1

00..00

0

00..00

1 . . 1

00..00 . . 00..00

0 . . 0

00..01 . . 11..11

−∞

1 . . 1

00..01 . . 11..10

1 . . 1

00..00 . . 11..11

1

11..11

1

00..00

SNaN

X

11..11

1

0X..XX2

QNaN

X

11..11

1

1X..XX

QNaN Floating-Point Indefinite

1

11..11

1

10..00

−Zero

−Normals

NaNs

1

Single-Precision: Double-Precision: Double Extended-Precision:

← 8 Bits → ← 11 Bits → ← 15 Bits →

← 23 Bits → ← 52 Bits → ← 63 Bits →

NOTES: 1. Integer bit is implied and not stored for single-precision and double-precision formats. 2. The fraction for SNaN encodings must be non-zero with the most-significant bit 0. The exponent of each float ing- point dat a t ype is encoded in biased form at ; see Sect ion 4.8.2.2, “ Biased Exponent .” The biasing const ant is 127 for t he singleprecision form at , 1023 for t he double- precision form at , and 16,383 for t he double ext ended- precision form at .

4-8 Vol. 1

DATA TYPES

When st oring float ing- point values in m em ory, single- precision values are st ored in 4 consecut ive byt es in m em ory; double- precision values are st ored in 8 consecut ive byt es; and double ext ended- precision values are st ored in 10 consecut ive byt es. The single- precision and double- precision float ing- point dat a t ypes are operat ed on by x87 FPU, and SSE/ SSE2/ SSE3 inst ruct ions. The double- ext ended- precision float ing- point form at is only operat ed on by t he x87 FPU. See Sect ion 11.6.8, “ Com pat ibilit y of SI MD and x87 FPU Float ing- Point Dat a Types,” for a discussion of t he com pat ibilit y of single- precision and double- precision float ing- point dat a t ypes bet ween t he x87 FPU and SSE/ SSE2/ SSE3 ext ensions.

4.3

POINTER DATA TYPES

Point ers are addresses of locat ions in m em ory. I n non- 64- bit m odes, t he archit ect ure defines t wo t ypes of point ers: a n e a r point e r and a fa r point e r. A near point er is a 32- bit ( or 16- bit ) offset ( also called an e ffe ct ive a ddr e ss) wit hin a segm ent . Near point ers are used for all m em ory references in a flat m em ory m odel or for references in a segm ent ed m odel where t he ident it y of t he segm ent being accessed is im plied. A far point er is a logical address, consist ing of a 16- bit segm ent select or and a 32- bit ( or 16- bit ) offset . Far point ers are used for m em ory references in a segm ent ed m em ory m odel where t he ident it y of a segm ent being accessed m ust be specified explicit ly. Near and far point ers wit h 32- bit offset s are shown in Figure 4- 4.

Near Pointer Offset 31

0

Far Pointer or Logical Address Segment Selector 47

Offset 32 31

0

Figure 4-4. Pointer Data Types

4.3.1

Pointer Data Types in 64-Bit Mode

I n 64- bit m ode ( a sub- m ode of I A- 32e m ode) , a ne a r poin t e r is 64 bit s. This equat es t o an effect ive address. Fa r point e r s in 64- bit m ode can be one of t hree form s:

• • •

16- bit segm ent select or, 16- bit offset if t he operand size is 32 bit s 16- bit segm ent select or, 32- bit offset if t he operand size is 32 bit s 16- bit segm ent select or, 64- bit offset if t he operand size is 64 bit s

See Figure 4- 5. Vol. 1 4-9

DATA TYPES

1HDU3RLQWHU ELW2IIVHW 

 )DU3RLQWHUZLWKELW2SHUDQG6L]H

ELW6HJPHQW6HOHFWRU 

ELW2IIVHW

 

)DU3RLQWHUZLWKELW2SHUDQG6L]H ELW6HJPHQW6HOHFWRU

ELW2IIVHW

 





 20

Figure 4-5. Pointers in 64-Bit Mode

4.4

BIT FIELD DATA TYPE

A bit fie ld ( see Figure 4- 6) is a cont iguous sequence of bit s. I t can begin at any bit posit ion of any byt e in m em ory and can cont ain up t o 32 bit s.

Bit Field

Field Length Least Significant Bit

Figure 4-6. Bit Field Data Type

4.5

STRING DATA TYPES

St rings are cont inuous sequences of bit s, byt es, words, or doublewords. A bit st r ing can begin at any bit posit ion of any byt e and can cont ain up t o 2 32 – 1 bit s. A byt e st r in g can cont ain byt es, words, or doublewords and can range from zero t o 2 32 – 1 byt es ( 4 GByt es) .

4-10 Vol. 1

DATA TYPES

4.6

PACKED SIMD DATA TYPES

I nt el 64 and I A- 32 archit ect ures define and operat e on a set of 64- bit and 128- bit packed dat a t ype for use in SI MD operat ions. These dat a t ypes consist of fundam ent al dat a t ypes ( packed byt es, words, doublewords, and quadwords) and num eric int erpret at ions of fundam ent al t ypes for use in packed int eger and packed float ingpoint operat ions.

4.6.1

64-Bit SIMD Packed Data Types

The 64- bit packed SI MD dat a t ypes were int roduced int o t he I A- 32 archit ect ure in t he I nt el MMX t echnology. They are operat ed on in MMX regist ers. The fundam ent al 64- bit packed dat a t ypes are packed byt es, packed words, and packed doublewords ( see Figure 4- 7) . When perform ing num eric SI MD operat ions on t hese dat a t ypes, t hese dat a t ypes are int erpret ed as cont aining byt e, word, or doubleword int eger values. Fundamental 64-Bit Packed SIMD Data Types Packed Bytes 0

63

Packed Words 0

63

Packed Doublewords 0

63 64-Bit Packed Integer Data Types

Packed Byte Integers 63

0 Packed Word Integers

63

0 Packed Doubleword Integers

63

0

Figure 4-7. 64-Bit Packed SIMD Data Types

Vol. 1 4-11

DATA TYPES

4.6.2

128-Bit Packed SIMD Data Types

The 128- bit packed SI MD dat a t ypes were int roduced int o t he I A- 32 archit ect ure in t he SSE ext ensions and used wit h SSE2, SSE3 and SSSE3 ext ensions. They are operat ed on prim arily in t he 128- bit XMM regist ers and m em ory. The fundam ent al 128- bit packed dat a t ypes are packed byt es, packed words, packed doublewords, and packed quadwords ( see Figure 4- 8) . When perform ing SI MD operat ions on t hese fundam ent al dat a t ypes in XMM regist ers, t hese dat a t ypes are int erpret ed as cont aining packed or scalar single- precision float ing- point or double- precision float ing- point values, or as cont aining packed byt e, word, doubleword, or quadword int eger values.

Fundamental 128-Bit Packed SIMD Data Types Packed Bytes 127

0 Packed Words

127

0

127

0

Packed Doublewords

Packed Quadwords 127

0 128-Bit Packed Floating-Point and Integer Data Types Packed Single Precision Floating Point

127

0 Packed Double Precision Floating Point

127

0 Packed Byte Integers

127

0 Packed Word Integers

127

0 Packed Doubleword Integers

127

0 Packed Quadword Integers

127

0

Figure 4-8. 128-Bit Packed SIMD Data Types

4-12 Vol. 1

DATA TYPES

4.7

BCD AND PACKED BCD INTEGERS

Binary- coded decim al int egers ( BCD int egers) are unsigned 4- bit int egers wit h valid values ranging from 0 t o 9. I A- 32 archit ect ure defines operat ions on BCD int egers locat ed in one or m ore general- purpose regist ers or in one or m ore x87 FPU regist ers ( see Figure 4- 9) .

BCD Integers X 7

BCD 0

43

Packed BCD Integers BCD BCD 7 43 0 80-Bit Packed BCD Decimal Integers

Sign X

79 78

D17 D16 D15 D14 D13 D12 D11 D10

D9

D8

D7

D6

D5

D4

D3

D2

D1

D0

72 71

0 4 Bits = 1 BCD Digit

Figure 4-9. BCD Data Types When operat ing on BCD int egers in general- purpose regist ers, t he BCD values can be unpacked ( one BCD digit per byt e) or packed ( t wo BCD digit s per byt e) . The value of an unpacked BCD int eger is t he binary value of t he low half- byt e ( bit s 0 t hrough 3) . The high half- byt e ( bit s 4 t hrough 7) can be any value during addit ion and subt ract ion, but m ust be zero during m ult iplicat ion and division. Packed BCD int egers allow t wo BCD digit s t o be cont ained in one byt e. Here, t he digit in t he high half- byt e is m ore significant t han t he digit in t he low half- byt e. When operat ing on BCD int egers in x87 FPU dat a regist ers, BCD values are packed in an 80- bit form at and referred t o as decim al int egers. I n t his form at , t he first 9 byt es hold 18 BCD digit s, 2 digit s per byt e. The least- significant digit is cont ained in t he lower half- byt e of byt e 0 and t he m ost- significant digit is cont ained in t he upper halfbyt e of byt e 9. The m ost significant bit of byt e 10 cont ains t he sign bit ( 0 = posit ive and 1 = negat ive; bit s 0 t hrough 6 of byt e 10 are don’t care bit s) . Negat ive decim al int egers are not st ored in t wo's com plem ent form ; t hey are dist inguished from posit ive decim al int egers only by t he sign bit . The range of decim al int egers t hat can be encoded in t his form at is –10 18 + 1 t o 10 18 – 1. The decim al int eger form at exist s in m em ory only. When a decim al int eger is loaded in an x87 FPU dat a regist er, it is aut om at ically convert ed t o t he double- ext endedprecision float ing- point form at . All decim al int egers are exact ly represent able in double ext ended- precision form at . Table 4- 4 gives t he possible encodings of value in t he decim al int eger dat a t ype.

Vol. 1 4-13

DATA TYPES

Table 4-4. Packed Decimal Integer Encodings Magnitude Class

Sign

Positive Largest

0

0000000

.

.

.

.

.

.

Smallest

0

0000000

0000

0000

Zero

0

0000000

0000

Negative Zero

1

0000000

Smallest

1

0000000

.

.

.

.

.

.

Largest

1

0000000

1001

1001

Packed BCD Integer Indefinit e

1

1111111

1111

1111

← 1 byte →

digit

digit

digit

digit

...

digit

1001

1001

1001

1001

...

1001

0000

0000

...

0001

0000

0000

0000

...

0000

0000

0000

0000

0000

...

0000

0000

0000

0000

0000

...

0001

1001

1001

...

1001

1100

0000

...

0000

← 9 bytes →

The packed BCD int eger indefinit e encoding ( FFFFC000000000000000H) is st ored by t he FBSTP inst ruct ion in response t o a m asked float ing- point invalid- operat ion except ion. At t em pt ing t o load t his value wit h t he FBLD inst ruct ion produces an undefined result .

4.8

REAL NUMBERS AND FLOATING-POINT FORMATS

This sect ion describes how real num bers are represent ed in float ing- point form at in x87 FPU and SSE/ SSE2/ SSE3 float ing- point inst ruct ions. I t also int roduces t erm s such as norm alized num bers, denorm alized num bers, biased exponent s, signed zeros, and NaNs. Readers who are already fam iliar wit h float ing- point processing t echniques and t he I EEE St andard 754 for Binary Float ing- Point Arit hm et ic m ay wish t o skip t his sect ion.

4-14 Vol. 1

DATA TYPES

4.8.1

Real Number System

As shown in Figure 4- 10, t he real- num ber syst em com prises t he cont inuum of real num bers from m inus infinit y ( − ∞) t o plus infinit y ( + ∞) . Because t he size and num ber of regist ers t hat any com put er can have is lim it ed, only a subset of t he real- num ber cont inuum can be used in real- num ber ( float ing- point ) calculat ions. As shown at t he bot t om of Figure 4- 10, t he subset of real num bers t hat t he I A- 32 archit ect ure support s represent s an approxim at ion of t he real num ber syst em . The range and precision of t his real- num ber subset is det erm ined by t he I EEE St andard 754 float ing- point form at s.

4.8.2

Floating-Point Format

To increase t he speed and efficiency of real- num ber com put at ions, com put ers and m icroprocessors t ypically represent real num bers in a binary float ing- point form at . I n t his form at , a real num ber has t hree part s: a sign, a significand, and an exponent ( see Figure 4- 11) . The sign is a binar y value t hat indicat es w het her t he num ber is posit ive ( 0) or negat ive ( 1) . The significand has t w o par t s: a 1- bit binar y int eger ( also r efer r ed t o as t he J- bit ) and a binar y fract ion. The int eger- bit is oft en not represent ed, but inst ead is an im plied value. The exponent is a binary int eger t hat represent s t he base- 2 power by which t he significand is m ult iplied. Table 4- 5 shows how t he real num ber 178.125 ( in ordinary decim al form at ) is st ored in I EEE St andard 754 float ing- point form at . The t able list s a progression of real num ber not at ions t hat leads t o t he single- precision, 32- bit float ing- point form at . I n t his form at , t he significand is norm alized ( see Sect ion 4.8.2.1, “ Norm alized Num bers” ) and t he exponent is biased ( see Sect ion 4.8.2.2, “ Biased Exponent ” ) . For t he single- precision float ing- point form at , t he biasing const ant is + 127.

Vol. 1 4-15

DATA TYPES

ςς

ςς

-100

Binary Real Number System 10 -1 0 -10 1

Subset of binary real numbers that can be represented with IEEE single-precision (32-bit) floating-point format 10 -1 0 100 -100 -10 1

+10

10.0000000000000000000000 Precision

1.11111111111111111111111 24 Binary Digits

Numbers within this range cannot be represented.

Figure 4-10. Binary Real Number System

Sign Exponent

Significand

Fraction Integer or J-Bit

Figure 4-11. Binary Floating-Point Format

4-16 Vol. 1

100

ςς

ςς

DATA TYPES

Table 4-5. Real and Floating-Point Number Notation Notation

Value

Ordinary Decimal

178.125

Scientific Decimal

1.78125E10 2

Scientific Binary

1.0110010001E2111

Scientific Binary (Biased Exponent) IEEE Single-Precision Format

4.8.2.1

1.0110010001E210000110

Sign

Biased Exponent

Normalized Significand

0

10000110

0110010001000000000000 0 1. (Implied)

Normalized Numbers

I n m ost cases, float ing- point num bers are encoded in norm alized form . This m eans t hat except for zero, t he significand is always m ade up of an int eger of 1 and t he following fract ion: 1.fff...ff For values less t han 1, leading zeros are elim inat ed. ( For each leading zero elim inat ed, t he exponent is decrem ent ed by one.) Represent ing num bers in norm alized form m axim izes t he num ber of significant digit s t hat can be accom m odat ed in a significand of a given widt h. To sum m arize, a norm alized real num ber consist s of a norm alized significand t hat represent s a real num ber bet ween 1 and 2 and an exponent t hat specifies t he num ber ’s binary point .

4.8.2.2

Biased Exponent

I n t he I A- 32 archit ect ure, t he exponent s of float ing- point num bers are encoded in a biased form . This m eans t hat a const ant is added t o t he act ual exponent so t hat t he biased exponent is always a posit ive num ber. The value of t he biasing const ant depends on t he num ber of bit s available for represent ing exponent s in t he float ingpoint form at being used. The biasing const ant is chosen so t hat t he sm allest norm alized num ber can be reciprocat ed wit hout overflow. See Sect ion 4.2.2, “ Float ing- Point Dat a Types,” for a list of t he biasing const ant s t hat t he I A- 32 archit ect ure uses for t he various sizes of float ing- point dat a- t ypes.

Vol. 1 4-17

DATA TYPES

4.8.3

Real Number and Non-number Encodings

A variet y of real num bers and special values can be encoded in t he I EEE St andard 754 float ing- point form at . These num bers and values are generally divided int o t he following classes:

• • • • • •

Signed zeros Denorm alized finit e num bers Norm alized finit e num bers Signed infinit ies NaNs I ndefinit e num bers

( The t erm NaN st ands for “ Not a Num ber.” ) Figure 4- 12 shows how t he encodings for t hese num bers and non- num bers fit int o t he real num ber cont inuum . The encodings shown here are for t he I EEE single- precision float ing- point form at . The t erm “ S” indicat es t he sign bit , “ E” t he biased exponent , and “ Sig” t he significand. The exponent values are given in decim al. The int eger bit is shown for t he significands, even t hough t he int eger bit is im plied in single- precision float ing- point form at . NaN −∞

S 1

E 0

1

0

NaN

− Denormalized Finite + Denormalized Finite

− Normalized Finite

− 0+ 0

+ Normalized Finite + ∞

Real Number and NaN Encodings For 32-Bit Floating-Point Format E Sig1 Sig1 S 0.000... 0.000 ... −0 0 +0 0

0.XXX...2

− Denormalized Finite − Normalized Finite

+Denormalized Finite 0

0

0.XXX...2

+Normalized 0 1...254 1.XXX... Finite

1 1...254

1.XXX...

1

1.000...

−∞

X3 255

1.0XX...2

SNaN

SNaN X3 255

1.0XX...2

X3 255

1.1XX...

QNaN

QNaN X3 255

1.1XX...

255

+∞

0

NOTES:

1. Integer bit of fraction implied for single-precision floating-point format. 2. Fraction must be non-zero. 3. Sign bit ignored.

Figure 4-12. Real Numbers and NaNs

4-18 Vol. 1

255

1.000...

DATA TYPES

An I A- 32 processor can operat e on and/ or ret urn any of t hese values, depending on t he t ype of com put at ion being perform ed. The following sect ions describe t hese num ber and non- num ber classes.

4.8.3.1

Signed Zeros

4.8.3.2

Normalized and Denormalized Finite Numbers

Zero can be represent ed as a + 0 or a −0 depending on t he sign bit . Bot h encodings are equal in value. The sign of a zero result depends on t he operat ion being perform ed and t he rounding m ode being used. Signed zeros have been provided t o aid in im plem ent ing int erval arit hm et ic. The sign of a zero m ay indicat e t he direct ion from which underflow occurred, or it m ay indicat e t he sign of an ∞ t hat has been reciprocat ed.

Non- zero, finit e num bers are divided int o t wo classes: norm alized and denorm alized. The norm alized finit e num bers com prise all t he non- zero finit e values t hat can be encoded in a norm alized real num ber form at bet ween zero and ∞. I n t he single- precision float ing- point form at shown in Figure 4- 12, t his group of num bers includes all t he num bers wit h biased exponent s ranging from 1 t o 254 10 ( unbiased, t he exponent range is from −126 10 t o + 127 10 ) . When float ing- point num bers becom e very close t o zero, t he norm alized- num ber form at can no longer be used t o represent t he num bers. This is because t he range of t he exponent is not large enough t o com pensat e for shift ing t he binary point t o t he right t o elim inat e leading zeros. When t he biased exponent is zero, sm aller num bers can only be represent ed by m aking t he int eger bit ( and perhaps ot her leading bit s) of t he significand zero. The num bers in t his range are called de nor m a liz e d ( or t in y) num bers. The use of leading zeros wit h denorm alized num bers allows sm aller num bers t o be represent ed. However, t his denorm alizat ion causes a loss of precision ( t he num ber of significant bit s in t he fract ion is reduced by t he leading zeros) . When perform ing norm alized float ing- point com put at ions, an I A- 32 processor norm ally operat es on norm alized num bers and produces norm alized num bers as result s. Denorm alized num bers represent an u nde r flow condit ion. The exact condit ions are specified in Sect ion 4.9.1.5, “ Num eric Underflow Except ion ( # U) .” A denorm alized num ber is com put ed t hrough a t echnique called gradual underflow. Table 4- 6 gives an exam ple of gradual underflow in t he denorm alizat ion process. Here t he single- precision form at is being used, so t he m inim um exponent ( unbiased) is −126 10 . The t rue result in t his exam ple requires an exponent of −129 10 in order t o have a norm alized num ber. Since −129 10 is beyond t he allowable exponent range, t he result is denorm alized by insert ing leading zeros unt il t he m inim um exponent of −126 10 is reached.

Vol. 1 4-19

DATA TYPES

Table 4-6. Denormalization Process Operation

Sign

Exponent*

True Result

0

Denormalize

0

Denormalize

0

Denormalize

0

Denormal Result

0

−129

−128

Significand 1.01011100000...00 0.10101110000...00

−127

0.01010111000...00

−126

0.00101011100...00

−126

0.00101011100...00

* Expressed as an unbiased, decimal number. I n t he ext rem e case, all t he significant bit s are shift ed out t o t he right by leading zeros, creat ing a zero result . The I nt el 64 and I A- 32 archit ect ures deal wit h denorm al values in t he following ways:

• •

I t avoids creat ing denorm als by norm alizing num bers whenever possible.



I t provides t he float ing- point denorm al- operand except ion t o perm it procedures or program s t o det ect when denorm als are being used as source operands for com put at ions.

I t provides t he float ing- point underflow except ion t o perm it program m ers t o det ect cases when denorm als are creat ed.

4.8.3.3

Signed Infinities

The t wo infinit ies, + ∞ and − ∞, represent t he m axim um posit ive and negat ive real num bers, respect ively, t hat can be represent ed in t he float ing- point form at . I nfinit y is always represent ed by a significand of 1.00...00 ( t he int eger bit m ay be im plied) and t he m axim um biased exponent allowed in t he specified form at ( for exam ple, 255 10 for t he single- precision form at ) . The signs of infinit ies are observed, and com parisons are possible. I nfinit ies are always int erpret ed in t he affine sense; t hat is, –∞ is less t han any finit e num ber and + ∞ is great er t han any finit e num ber. Arit hm et ic on infinit ies is always exact . Except ions are generat ed only when t he use of an infinit y as a source operand const it ut es an invalid operat ion. Whereas denorm alized num bers m ay represent an underflow condit ion, t he t wo ∞ num bers m ay represent t he result of an overflow condit ion. Here, t he norm alized result of a com put at ion has a biased exponent great er t han t he largest allowable exponent for t he select ed result form at .

4.8.3.4

NaNs

Since NaNs are non- num bers, t hey are not part of t he real num ber line. I n Figure 4- 12, t he encoding space for NaNs in t he float ing- point form at s is shown

4-20 Vol. 1

DATA TYPES

above t he ends of t he real num ber line. This space includes any value wit h t he m axim um allowable biased exponent and a non-zero fract ion ( t he sign bit is ignored for NaNs) . The I A- 32 archit ect ure defines t wo classes of NaNs: quiet NaNs ( QNaNs) and signaling NaNs ( SNaNs) . A QNaN is a NaN wit h t he m ost significant fract ion bit set ; an SNaN is a NaN wit h t he m ost significant fract ion bit clear. QNaNs are allowed t o propagat e t hrough m ost arit hm et ic operat ions wit hout signaling an except ion. SNaNs generally signal a float ing- point invalid- operat ion except ion whenever t hey appear as operands in arit hm et ic operat ions. SNaNs are t ypically used t o t rap or invoke an except ion handler. They m ust be insert ed by soft ware; t hat is, t he processor never generat es an SNaN as a result of a float ing- point operat ion.

4.8.3.5

Operating on SNaNs and QNaNs

When a float ing- point operat ion is perform ed on an SNaN and/ or a QNaN, t he result of t he operat ion is eit her a QNaN delivered t o t he dest inat ion operand or t he generat ion of a float ing- point invalid operat ing except ion, depending on t he following rules:



I f one of t he source operands is an SNaN and t he float ing- point invalid- operat ing except ion is not m asked ( see Sect ion 4.9.1.1, “ I nvalid Operat ion Except ion ( # I ) ” ) , t he a float ing- point invalid- operat ion except ion is signaled and no result is st ored in t he dest inat ion operand.



I f eit her or bot h of t he source operands are NaNs and float ing- point invalidoperat ion except ion is m asked, t he result is as shown in Table 4- 7. When an SNaN is convert ed t o a QNaN, t he conversion is handled by set t ing t he m ostsignificant fract ion bit of t he SNaN t o 1. Also, when one of t he source operands is an SNaN, t he float ing- point invalid- operat ion except ion flag it set . Not e t hat for som e com binat ions of source operands, t he result is different for x87 FPU operat ions and for SSE/ SSE2/ SSE3 operat ions.



When neit her of t he source operands is a NaN, but t he operat ion generat es a float ing- point invalid- operat ion except ion ( see Tables 8- 10 and 11- 1) , t he result is com m only an SNaN source operand convert ed t o a QNaN or t he QNaN float ingpoint indefinit e value.

Any except ions t o t he behavior described in Table 4- 7 are described in Sect ion 8.5.1.2, “ I nvalid Arit hm et ic Operand Except ion ( # I A) ,” and Sect ion 11.5.2.1, “ I nvalid Operat ion Except ion ( # I ) .”

Vol. 1 4-21

DATA TYPES

Table 4-7. Rules for Handling NaNs Source Operands

Result1

SNaN and QNaN

x87 FPU — QNaN source operand. SSE/SSE2/SSE3 — First operand (if this operand is an SNaN, it is converted to a QNaN)

Two SNaNs

x87 FPU—SNaN source operand with the larger significand, converted into a QNaN SSE/SSE2/SSE3 — First operand converted to a QNaN

Two QNaNs

x87 FPU — QNaN source operand with the larger significand SSE/SSE2/SSE3 — First operand

SNaN and a floating-point value

SNaN source operand, converted into a QNaN

QNaN and a floating-point value

QNaN source operand

SNaN (for instructions that take only one operand)

SNaN source operand, converted into a QNaN

QNaN (for instructions that take only one operand)

QNaN source operand

NOTE: 1. For SSE/SSE2/SSE3 instructions, the first operand is generally a source operand that becomes the destination operand. Within the Result column, the x87 FPU notation also applies to the FISTTP instruction in SSE3; the SSE3 notation applies to the SIMD floating-point instructions.

4.8.3.6

Using SNaNs and QNaNs in Applications

Except for t he rules given at t he beginning of Sect ion 4.8.3.4, “ NaNs,” for encoding SNaNs and QNaNs, soft ware is free t o use t he bit s in t he significand of a NaN for any purpose. Bot h SNaNs and QNaNs can be encoded t o carry and st ore dat a, such as diagnost ic inform at ion. By unm asking t he invalid operat ion except ion, t he program m er can use signaling NaNs t o t rap t o t he except ion handler. The generalit y of t his approach and t he large num ber of NaN values t hat are available provide t he sophist icat ed program m er wit h a t ool t hat can be applied t o a variet y of special sit uat ions. For exam ple, a com piler can use signaling NaNs as references t o uninit ialized ( real) array elem ent s. The com piler can preinit ialize each array elem ent wit h a signaling NaN whose significand cont ained t he index ( relat ive posit ion) of t he elem ent . Then, if an applicat ion program at t em pt s t o access an elem ent t hat it had not init ialized, it can use t he NaN placed t here by t he com piler. I f t he invalid operat ion except ion is unm asked, an int errupt will occur, and t he except ion handler will be invoked. The except ion handler can det erm ine which elem ent has been accessed, since t he

4-22 Vol. 1

DATA TYPES

operand address field of t he except ion point er will point t o t he NaN, and t he NaN will cont ain t he index num ber of t he array elem ent . Quiet NaNs are oft en used t o speed up debugging. I n it s early t est ing phase, a program oft en cont ains m ult iple errors. An except ion handler can be writ t en t o save diagnost ic inform at ion in m em ory whenever it was invoked. Aft er st oring t he diagnost ic dat a, it can supply a quiet NaN as t he result of t he erroneous inst ruct ion, and t hat NaN can point t o it s associat ed diagnost ic area in m em ory. The program will t hen cont inue, creat ing a different NaN for each error. When t he program ends, t he NaN result s can be used t o access t he diagnost ic dat a saved at t he t im e t he errors occurred. Many errors can t hus be diagnosed and correct ed in one t est run. I n em bedded applicat ions t hat use com put ed result s in furt her com put at ions, an undet ect ed QNaN can invalidat e all subsequent result s. Such applicat ions should t herefore periodically check for QNaNs and provide a recovery m echanism t o be used if a QNaN result is det ect ed.

4.8.3.7

QNaN Floating-Point Indefinite

For t he float ing- point dat a t ype encodings ( single- precision, double- precision, and double- ext ended- precision) , one unique encoding ( a QNaN) is reserved for represent ing t he special value QNaN float ing- point indefinit e. The x87 FPU and t he SSE/ SSE2/ SSE3 ext ensions ret urn t hese indefinit e values as responses t o som e m asked float ing- point except ions. Table 4- 3 shows t he encoding used for t he QNaN float ing- point indefinit e.

4.8.4

Rounding

When perform ing float ing- point operat ions, t he processor produces an infinit ely precise float ing- point result in t he dest inat ion form at ( single- precision, double- precision, or double ext ended- precision float ing- point ) whenever possible. However, because only a subset of t he num bers in t he real num ber cont inuum can be represent ed in I EEE St andard 754 float ing- point form at s, it is oft en t he case t hat an infinit ely precise result cannot be encoded exact ly in t he form at of t he dest inat ion operand. For exam ple, t he following value ( a) has a 24- bit fract ion. The least- significant bit of t his fract ion ( t he underlined bit ) cannot be encoded exact ly in t he single- precision form at ( which has only a 23- bit fract ion) : ( a) 1.0001 0000 1000 0011 1001 0111E2 101 To round t his result ( a) , t he processor first select s t wo represent able fract ions b and c t hat m ost closely bracket a in value ( b < a < c) . ( b) 1.0001 0000 1000 0011 1001 011E2 101 ( c) 1.0001 0000 1000 0011 1001 100E2 101

Vol. 1 4-23

DATA TYPES

The processor t hen set s t he result t o b or t o c according t o t he select ed rounding m ode. Rounding int roduces an error in a result t hat is less t han one unit in t he last place ( t he least significant bit posit ion of t he float ing- point value) t o which t he result is rounded. The I EEE St andard 754 defines four rounding m odes ( see Table 4- 8) : round t o nearest , round up, round down, and round t oward zero. The default rounding m ode ( for t he I nt el 64 and I A- 32 archit ect ures) is round t o nearest . This m ode provides t he m ost accurat e and st at ist ically unbiased est im at e of t he t rue result and is suit able for m ost applicat ions.

Table 4-8. Rounding Modes and Encoding of Rounding Control (RC) Field Rounding Mode

RC Field Setting

Round to nearest (even)

00B

Rounded result is the closest to the infinitely precise result. If two values are equally close, the result is the even value (that is, the one with the least-significant bit of zero). Default

Round down (toward −∞)

01B

Rounded result is closest to but no greater than the infinitely precise result.

Round up (toward +∞)

10B

Rounded result is closest to but no less than the infinitely precise result.

Round toward 11B zero (Truncate)

Rounded result is closest to but no greater in absolute value than the infinitely precise result.

Description

The round up and round down m odes are t erm ed dir e ct e d r oun ding and can be used t o im plem ent int erval arit hm et ic. I nt erval arit hm et ic is used t o det erm ine upper and lower bounds for t he t rue result of a m ult ist ep com put at ion, when t he int erm ediat e result s of t he com put at ion are subj ect t o rounding. The round t oward zero m ode ( som et im es called t he “ chop” m ode) is com m only used when perform ing int eger arit hm et ic wit h t he x87 FPU. The rounded result is called t he inexact result . When t he processor produces an inexact result , t he float ing- point precision ( inexact ) flag ( PE) is set ( see Sect ion 4.9.1.6, “ I nexact- Result ( Precision) Except ion ( # P) ” ) . The rounding m odes have no effect on com parison operat ions, operat ions t hat produce exact result s, or operat ions t hat produce NaN result s.

4.8.4.1

Rounding Control (RC) Fields

I n t he I nt el 64 and I A- 32 ar chit ect ures, t he rounding m ode is cont rolled by a 2- bit r ounding- cont rol ( RC) field ( Table 4- 8 shows t he encoding of t his field) . The RC field is im plem ent ed in t wo different locat ions:



x87 FPU cont rol regist er ( bit s 10 and 11)

4-24 Vol. 1

DATA TYPES



The MXCSR regist er ( bit s 13 and 14)

Alt hough t hese t wo RC fields perform t he sam e funct ion, t hey cont rol rounding for different execut ion environm ent s wit hin t he processor. The RC field in t he x87 FPU cont rol regist er cont rols rounding for com put at ions perform ed wit h t he x87 FPU inst ruct ions; t he RC field in t he MXCSR regist er cont rols rounding for SI MD float ingpoint com put at ions perform ed wit h t he SSE/ SSE2 inst ruct ions.

4.8.4.2

Truncation with SSE and SSE2 Conversion Instructions

The following SSE/ SSE2 inst ruct ions aut om at ically t runcat e t he result s of conversions from float ing- point values t o int egers when t he result it inexact : CVTTPD2DQ, CVTTPS2DQ, CVTTPD2PI , CVTTPS2PI , CVTTSD2SI , CVTTSS2SI . Here, t runcat ion m eans t he round t oward zero m ode described in Table 4- 8.

4.9

OVERVIEW OF FLOATING-POINT EXCEPTIONS

The following sect ion provides an overview of float ing- point except ions and t heir handling in t he I A- 32 archit ect ure. For inform at ion specific t o t he x87 FPU and t o t he SSE/ SSE2/ SSE3 ext ensions, refer t o t he following sect ions:

• •

Sect ion 8.4, “ x87 FPU Float ing- Point Except ion Handling” Sect ion 11.5, “ SSE, SSE2, and SSE3 Except ions”

When operat ing on float ing- point operands, t he I A- 32 archit ect ure recognizes and det ect s six classes of except ion condit ions:

• • • • • •

I nvalid operat ion ( # I ) Divide- by-zero ( # Z) Denorm alized operand ( # D) Num eric overflow ( # O) Num eric underflow ( # U) I nexact result ( precision) ( # P)

The nom enclat ure of “ # ” sym bol followed by one or t wo let t ers ( for exam ple, # P) is used in t his m anual t o indicat e except ion condit ions. I t is m erely a short- hand form and is not relat ed t o assem bler m nem onics.

NOTE All of t he except ions list ed above except t he denorm al- operand except ion ( # D) are defined in I EEE St andard 754. The invalid- operat ion, divide- by- zero and denorm al- operand except ions are precom put at ion except ions ( t hat is, t hey are det ect ed before any arit hm et ic operat ion

Vol. 1 4-25

DATA TYPES

occurs) . The num eric- underflow, num eric- overflow and precision except ions are post- com put at ion except ions. Each of t he six except ion classes has a corresponding flag bit ( I E, ZE, OE, UE, DE, or PE) and m ask bit ( I M, ZM, OM, UM, DM, or PM) . When one or m ore float ing- point except ion condit ions are det ect ed, t he processor set s t he appropriat e flag bit s, t hen t akes one of t wo possible courses of act ion, depending on t he set t ings of t he corresponding m ask bit s:



Mask bit set . Handles t he except ion aut om at ically, producing a predefined ( and oft en t im es usable) result , while allowing program execut ion t o cont inue undist urbed.



Mask bit clear. I nvokes a soft ware except ion handler t o handle t he except ion.

The m asked ( default ) responses t o except ions have been chosen t o deliver a reasonable result for each except ion condit ion and are generally sat isfact ory for m ost float ing- point applicat ions. By m asking or unm asking specific float ing- point except ions, program m ers can delegat e responsibilit y for m ost except ions t o t he processor and reserve t he m ost severe except ion condit ions for soft ware except ion handlers. Because t he except ion flags are “ st icky,” t hey provide a cum ulat ive record of t he except ions t hat have occurred since t hey were last cleared. A program m er can t hus m ask all except ions, run a calculat ion, and t hen inspect t he except ion flags t o see if any except ions were det ect ed during t he calculat ion. I n t he I A- 32 archit ect ure, float ing- point except ion flag and m ask bit s are im plem ent ed in t wo different locat ions:



x87 FPU st at us word and cont rol word. The flag bit s are locat ed at bit s 0 t hrough 5 of t he x87 FPU st at us word and t he m ask bit s are locat ed at bit s 0 t hrough 5 of t he x87 FPU cont rol word ( see Figures 8- 4 and 8- 6) .



MXCSR regist er. The flag bit s are locat ed at bit s 0 t hrough 5 of t he MXCSR regist er and t he m ask bit s are locat ed at bit s 7 t hrough 12 of t he regist er ( see Figure 10- 3) .

Alt hough t hese t w o set s of flag and m ask bit s perform t he sam e funct ion, t hey report on and cont rol except ions for different execut ion environm ent s w it hin t he processor. The flag and m ask bit s in t he x 87 FPU st at us and cont r ol words cont r ol except ion report ing and m asking for com put at ions perform ed w it h t he x87 FPU inst ruct ions; t he com panion bit s in t he MXCSR regist er cont rol except ion report ing and m asking for SI MD float ing- point com put at ions perform ed w it h t he SSE/ SSE2/ SSE3 inst ruct ions. Not e t hat when except ions are m asked, t he processor m ay det ect m ult iple except ions in a single inst ruct ion, because it cont inues execut ing t he inst ruct ion aft er perform ing it s m asked response. For exam ple, t he processor can det ect a denorm alized operand, perform it s m asked response t o t his except ion, and t hen det ect num eric underflow. See Sect ion 4.9.2, “ Float ing- Point Except ion Priorit y,” for a descript ion of t he rules for except ion precedence when m ore t han one float ing- point except ion condit ion is det ect ed for an inst ruct ion.

4-26 Vol. 1

DATA TYPES

4.9.1

Floating-Point Exception Conditions

The following sect ions describe t he various condit ions t hat cause a float ing- point except ion t o be generat ed and t he m asked response of t he processor when t hese condit ions are det ect ed. The I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B, list t he float ing- point except ions t hat can be signaled for each float ing- point inst ruct ion.

4.9.1.1

Invalid Operation Exception (#I)

The processor report s an invalid operat ion except ion in response t o one or m ore invalid arit hm et ic operands. I f t he invalid operat ion except ion is m asked, t he processor set s t he I E flag and ret urns an indefinit e value or a QNaN. This value overwrit es t he dest inat ion regist er specified by t he inst ruct ion. I f t he invalid operat ion except ion is not m asked, t he I E flag is set , a soft ware except ion handler is invoked, and t he operands rem ain unalt ered. See Sect ion 4.8.3.6, “ Using SNaNs and QNaNs in Applicat ions,” for inform at ion about t he result ret urned when an except ion is caused by an SNaN. The processor can det ect a variet y of invalid arit hm et ic operat ions t hat can be coded in a program . These operat ions generally indicat e a program m ing error, such as dividing ∞ by ∞ . See t he following sect ions for inform at ion regarding t he invalidoperat ion except ion when det ect ed while execut ing x87 FPU or SSE/ SSE2/ SSE3 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.1, “ I nvalid Operat ion Except ion” SI MD float ing- point except ions; Sect ion 11.5.2.1, “ I nvalid Operat ion Except ion (# I)”

4.9.1.2

Denormal Operand Exception (#D)

The processor report s t he denorm al- operand except ion if an arit hm et ic inst ruct ion at t em pt s t o operat e on a denorm al operand ( see Sect ion 4.8.3.2, “ Norm alized and Denorm alized Finit e Num bers” ) . When t he except ion is m asked, t he processor set s t he DE flag and proceeds wit h t he inst ruct ion. Operat ing on denorm al num bers will produce result s at least as good as, and oft en bet t er t han, what can be obt ained when denorm al num bers are flushed t o zero. Program m ers can m ask t his except ion so t hat a com put at ion m ay proceed, t hen analyze any loss of accuracy when t he final result is delivered. When a denorm al- operand except ion is not m asked, t he DE flag is set , a soft ware except ion handler is invoked, and t he operands rem ain unalt ered. When denorm al operands have reduced significance due t o loss of low- order bit s, it m ay be advisable t o not operat e on t hem . Precluding denorm al operands from com put at ions can be accom plished by an except ion handler t hat responds t o unm asked denorm aloperand except ions.

Vol. 1 4-27

DATA TYPES

See t he following sect ions for inform at ion regarding t he denorm al- operand except ion when det ect ed while execut ing x87 FPU or SSE/ SSE2/ SSE3 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.2, “ Denorm al Operand Except ion ( # D) ” SI MD float ing- point except ions; Sect ion 11.5.2.2, “ Denorm al- Operand Except ion ( # D) ”

4.9.1.3

Divide-By-Zero Exception (#Z)

The processor report s t he float ing- point divide- by-zero except ion whenever an inst ruct ion at t em pt s t o divide a finit e non- zero operand by 0. The m asked response for t he divide- by- zero except ion is t o set t he ZE flag and ret urn an infinit y signed wit h t he exclusive OR of t he sign of t he operands. I f t he divide- by- zero except ion is not m asked, t he ZE flag is set , a soft ware except ion handler is invoked, and t he operands rem ain unalt ered. See t he following sect ions for inform at ion regarding t he divide- by- zero except ion when det ect ed while execut ing x87 FPU or SSE/ SSE2 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.3, “ Divide- By- Zero Except ion ( # Z) ” SI MD float ing- point except ions; Sect ion 11.5.2.3, “ Divide- By- Zero Except ion ( # Z) ”

4.9.1.4

Numeric Overflow Exception (#O)

The processor report s a float ing- point num eric overflow except ion whenever t he rounded result of an inst ruct ion exceeds t he largest allowable finit e value t hat will fit int o t he dest inat ion operand. Table 4- 9 shows t he t hreshold range for num eric overflow for each of t he float ing- point form at s; overflow occurs when a rounded result falls at or out side t his t hreshold range.

4-28 Vol. 1

DATA TYPES

Table 4-9. Numeric Overflow Thresholds Floating-Point Format

Overflow Thresholds

| x | ≥ 1.0 ∗ 2128

Single Precision

| x | ≥ 1.0 ∗ 21024

Double Precision

| x | ≥ 1.0 ∗ 216384

Double Extended Precision

When a num eric- overflow except ion occurs and t he except ion is m asked, t he processor set s t he OE flag and ret urns one of t he values shown in Table 4- 10, according t o t he current rounding m ode. See Sect ion 4.8.4, “ Rounding.” When num eric overflow occurs and t he num eric- overflow except ion is not m asked, t he OE flag is set , a soft ware except ion handler is invoked, and t he source and dest inat ion operands eit her rem ain unchanged or a biased result is st ored in t he dest inat ion operand ( depending whet her t he overflow except ion was generat ed during an SSE/ SSE2/ SSE3 float ing- point operat ion or an x87 FPU operat ion) .

Table 4-10. Masked Responses to Numeric Overflow Rounding Mode

Sign of True Result

Result

To nearest

+

+∞



–∞

+

Largest finite positive number



–∞

+

+∞



Largest finite negative number

+

Largest finite positive number



Largest finite negative number

Toward –∞

Toward +∞

Toward zero

See t he following sect ions for inform at ion regarding t he num eric overflow except ion when det ect ed while execut ing x87 FPU inst ruct ions or while execut ing SSE/ SSE2/ SSE3 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.4, “ Num eric Overflow Except ion ( # O) ” SI MD float ing- point except ions; Sect ion 11.5.2.4, “ Num eric Overflow Except ion ( # O) ”

4.9.1.5

Numeric Underflow Exception (#U)

The processor det ect s a float ing- point num eric underflow condit ion whenever t he result of rounding wit h unbounded exponent ( t aking int o account precision cont rol for x87) is t iny; t hat is, less t han t he sm allest possible norm alized, finit e value t hat will fit int o t he dest inat ion operand. Table 4- 11 shows t he t hreshold range for

Vol. 1 4-29

DATA TYPES

num eric underflow for each of t he float ing- point form at s ( assum ing norm alized result s) ; underflow occurs when a rounded result falls st rict ly wit hin t he t hreshold range. The abilit y t o det ect and handle underflow is provided t o prevent a vary sm all result from propagat ing t hrough a com put at ion and causing anot her except ion ( such as overflow during division) t o be generat ed at a lat er t im e.

Table 4-11. Numeric Underflow (Normalized) Thresholds Floating-Point Format

Underflow Thresholds*

Single Precision

| x | < 1.0 ∗ 2−1022

Double Precision Double Extended Precision

| x | < 1.0 ∗ 2−126

| x | < 1.0 ∗ 2−16382

* Where ‘x’ is the result rounded to destination precision with an unbounded exponent range. How t he processor handles an underflow condit ion, depends on t wo relat ed condit ions:

• •

creat ion of a t iny result creat ion of an inexact result ; t hat is, a result t hat cannot be represent ed exact ly in t he dest inat ion form at

Which of t hese event s causes an underflow except ion t o be report ed and how t he processor responds t o t he except ion condit ion depends on whet her t he underflow except ion is m asked:



Un de r flow e x ce pt ion m a sk e d — The underflow except ion is report ed ( t he UE flag is set ) only when t he result is bot h t iny and inexact . The processor ret urns a denorm alized result t o t he dest inat ion operand, regardless of inexact ness.



Un de r flow e x ce pt ion n ot m a sk e d — The underflow except ion is report ed when t he result is t iny, regardless of inexact ness. The processor leaves t he source and dest inat ion operands unalt ered or st ores a biased result in t he designat ing operand ( depending whet her t he underflow except ion was generat ed during an SSE/ SSE2/ SSE3 float ing- point operat ion or an x87 FPU operat ion) and invokes a soft ware except ion handler.

See t he following sect ions for inform at ion regarding t he num eric underflow except ion when det ect ed while execut ing x87 FPU inst ruct ions or while execut ing SSE/ SSE2/ SSE3 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.5, “ Num eric Underflow Except ion ( # U) ” SI MD float ing- point except ions; Sect ion 11.5.2.5, “ Num eric Underflow Except ion ( # U) ”

4.9.1.6

Inexact-Result (Precision) Exception (#P)

The inexact- result except ion ( also called t he precision except ion) occurs if t he result of an operat ion is not exact ly represent able in t he dest inat ion form at . For exam ple, t he fract ion 1/ 3 cannot be precisely represent ed in binary float ing- point form . This

4-30 Vol. 1

DATA TYPES

except ion occurs frequent ly and indicat es t hat som e ( norm ally accept able) accuracy will be lost due t o rounding. The except ion is support ed for applicat ions t hat need t o perform exact arit hm et ic only. Because t he rounded result is generally sat isfact ory for m ost applicat ions, t his except ion is com m only m asked. I f t he inexact- result except ion is m asked when an inexact- result condit ion occurs and a num eric overflow or underflow condit ion has not occurred, t he processor set s t he PE flag and st ores t he rounded result in t he dest inat ion operand. The current rounding m ode det erm ines t he m et hod used t o round t he result . See Sect ion 4.8.4, “ Rounding.” I f t he inexact- result except ion is not m asked when an inexact result occurs and num eric overflow or underflow has not occurred, t he PE flag is set , t he rounded result is st ored in t he dest inat ion operand, and a soft ware except ion handler is invoked. I f an inexact result occurs in conj unct ion wit h num eric overflow or underflow, one of t he following operat ions is carried out :



I f an inexact result occurs along wit h m asked overflow or under flow, t he OE flag or UE flag and t he PE flag ar e set and t he r esult is st or ed as described for t he overflow or underflow except ions; see Sect ion 4.9.1.4, “ Num er ic Overflow Except ion ( # O) ,” or Sect ion 4.9.1.5, “ Num eric Underflow Except ion ( # U) .” I f t he inexact r esult except ion is unm asked, t he processor also invokes a soft ware except ion handler.



I f an inexact result occurs along wit h unm asked overflow or underflow and t he dest inat ion operand is a regist er, t he OE or UE flag and t he PE flag are set , t he result is st ored as described for t he overflow or underflow except ions, and a soft ware except ion handler is invoked.

I f an unm asked num eric overflow or underflow except ion occurs and t he dest inat ion operand is a m em ory locat ion ( which can happen only for a float ing- point st ore) , t he inexact- result condit ion is not report ed and t he C1 flag is cleared. See t he following sect ions for inform at ion regarding t he inexact- result except ion when det ect ed while execut ing x87 FPU or SSE/ SSE2/ SSE3 inst ruct ions:

• •

x87 FPU; Sect ion 8.5.6, “ I nexact- Result ( Precision) Except ion ( # P) ” SI MD float ing- point except ions; Sect ion 11.5.2.3, “ Divide- By- Zero Except ion ( # Z) ”

4.9.2

Floating-Point Exception Priority

The processor handles except ions according t o a predet erm ined precedence. When an inst ruct ion generat es t wo or m ore except ion condit ions, t he except ion precedence som et im es result s in t he higher- priorit y except ion being handled and t he lowerpriorit y except ions being ignored. For exam ple, dividing an SNaN by zero can pot ent ially signal an invalid- operat ion except ion ( due t o t he SNaN operand) and a divideby- zero except ion. Here, if bot h except ions are m asked, t he processor handles t he higher- priorit y except ion only ( t he invalid- operat ion except ion) , ret urning a QNaN t o

Vol. 1 4-31

DATA TYPES

t he dest inat ion. Alt ernat ely, a denorm al- operand or inexact- result except ion can accom pany a num eric underflow or overflow except ion wit h bot h except ions being handled. The precedence for float ing- point except ions is as follows: 1. I nvalid- operat ion except ion, subdivided as follows: a.

st ack underflow ( occurs wit h x87 FPU only)

b.

st ack overflow ( occurs wit h x87 FPU only)

c.

operand of unsupport ed form at ( occurs wit h x87 FPU only when using t he double ext ended- precision float ing- point form at )

d. SNaN operand 2. QNaN operand. Though t his is not an except ion, t he handling of a QNaN operand has precedence over lower- priorit y except ions. For exam ple, a QNaN divided by zero result s in a QNaN, not a zero- divide except ion. 3. Any ot her invalid- operat ion except ion not m ent ioned above or a divide- by- zero except ion. 4. Denorm al- operand except ion. I f m asked, t hen inst ruct ion execut ion cont inues and a lower- priorit y except ion can occur as well. 5. Num eric overflow and underflow except ions; possibly in conj unct ion wit h t he inexact- result except ion. 6. I nexact- result except ion. I nvalid operat ion, zero divide, and denorm al operand except ions are det ect ed before a float ing- point operat ion begins. Overflow, underflow, and precision except ions are not det ect ed unt il a t rue result has been com put ed. When an unm asked pr e - ope r a t ion except ion is det ect ed, t he dest inat ion operand has not yet been updat ed, and appears as if t he offending inst ruct ion has not been execut ed. When an unm asked post - ope r a t ion except ion is det ect ed, t he dest inat ion operand m ay be updat ed wit h a result , depending on t he nat ure of t he except ion ( except for SSE/ SSE2/ SSE3 inst ruct ions, which do not updat e t heir dest inat ion operands in such cases) .

4.9.3

Typical Actions of a Floating-Point Exception Handler

Aft er t he float ing- point except ion handler is invoked, t he processor handles t he except ion in t he sam e m anner t hat it handles non- float ing- point except ions. The float ing- point except ion handler is norm ally part of t he operat ing syst em or execut ive soft ware, and it usually invokes a user- regist ered float ing- point except ion handle. A t ypical act ion of t he except ion handler is t o st ore st at e inform at ion in m em ory. Ot her t ypical except ion handler act ions include:

• •

Exam ining t he st ored st at e inform at ion t o det erm ine t he nat ure of t he error Taking act ions t o correct t he condit ion t hat caused t he error

4-32 Vol. 1

DATA TYPES

• •

Clearing t he except ion flags Ret urning t o t he int errupt ed program and resum ing norm al execut ion

I n lieu of writ ing recovery procedures, t he except ion handler can do t he following:

• • •

I ncrem ent in soft ware an except ion count er for lat er display or print ing Print or display diagnost ic inform at ion ( such as t he st at e inform at ion) Halt furt her program execut ion

Vol. 1 4-33

DATA TYPES

4-34 Vol. 1

CHAPTER 5 INSTRUCTION SET SUMMARY This chapt er provides an abridged overview of I nt el 64 and I A- 32 inst ruct ions. I nst ruct ions are divided int o t he following groups:

• • • • • • • • • • •

General purpose x87 FPU x87 FPU and SI MD st at e m anagem ent I nt el MMX t echnology SSE ext ensions SSE2 ext ensions SSE3 ext ensions SSSE3 ext ensions Syst em inst ruct ions I A- 32e m ode: 64- bit m ode inst ruct ions VMX inst ruct ions

Table 5- 1 list s t he groups and I A- 32 processors t hat support each group. Wit hin t hese groups, m ost inst ruct ions are collect ed int o funct ional subgroups.

Table 5-1. Instruction Groups and IA-32 Processors Instruction Set Architecture

Intel 64 and IA-32 Processor Support

General Purpose

All Intel 64 and IA-32 processors

x87 FPU

Intel486, Pentium, Pentium with MMX Technology, Celeron, Pentium Pro, Pentium II, Pentium II Xeon, Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

x87 FPU and SIMD State Management

Pentium II, Pentium II Xeon, Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

MMX Technology

Pentium with MMX Technology, Celeron, Pentium II, Pentium II Xeon, Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

SSE Extensions

Pentium III, Pentium III Xeon, Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

SSE2 Extensions

Pentium 4, Intel Xeon processors, Pentium M, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

Vol. 1 5-1

INSTRUCTION SET SUMMARY

Table 5-1. Instruction Groups and IA-32 Processors (Contd.) Instruction Set Architecture

Intel 64 and IA-32 Processor Support

SSE3 Extensions

Pentium 4 supporting HT Technology (built on 90nm process technology), Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

SSSE3 Extensions

Intel Xeon processor 5100 series, Intel Core Solo, Intel Core Duo, Intel Core 2 Duo processors

IA-32e mode: 64-bit mode instructions

All Intel 64 processors

System Instructions

All Intel 64 and IA-32 processors

VMX Instructions

All Intel 64 and IA-32 processors supporting Intel Virtualization Technology

The following sect ions list inst ruct ions in each m aj or group and subgroup. Given for each inst ruct ion is it s m nem onic and descript ive nam es. When t wo or m ore m nem onics are given ( for exam ple, CMOVA/ CMOVNBE) , t hey represent different m nem onics for t he sam e inst ruct ion opcode. Assem blers support redundant m nem onics for som e inst ruct ions t o m ake it easier t o read code list ings. For inst ance, CMOVA ( Condit ional m ove if above) and CMOVNBE ( Condit ional m ove if not below or equal) represent t he sam e condit ion. For det ailed inform at ion about specific inst ruct ions, see t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B.

5.1

GENERAL-PURPOSE INSTRUCTIONS

The general- purpose inst ruct ions preform basic dat a m ovem ent , arit hm et ic, logic, program flow, and st ring operat ions t hat program m ers com m only use t o writ e applicat ion and syst em soft ware t o run on I nt el 64 and I A- 32 processors. They operat e on dat a cont ained in m em ory, in t he general- purpose regist ers ( EAX, EBX, ECX, EDX, EDI , ESI , EBP, and ESP) and in t he EFLAGS regist er. They also operat e on address inform at ion cont ained in m em ory, t he general- purpose regist ers, and t he segm ent regist ers ( CS, DS, SS, ES, FS, and GS) . This group of inst ruct ions includes t he dat a t ransfer, binary int eger arit hm et ic, decim al arit hm et ic, logic operat ions, shift and rot at e, bit and byt e operat ions, program cont rol, st ring, flag cont rol, segm ent regist er operat ions, and m iscellaneous subgroups. The sect ions t hat following int roduce each subgroup. For m ore det ailed inform at ion on general purpose- inst ruct ions, see Chapt er 7, “ Program m ing Wit h General- Purpose I nst ruct ions.”

5-2 Vol. 1

INSTRUCTION SET SUMMARY

5.1.1

Data Transfer Instructions

The dat a t ransfer inst ruct ions m ove dat a bet ween m em ory and t he general- purpose and segm ent regist ers. They also perform specific operat ions such as condit ional m oves, st ack access, and dat a conversion.

MOV

Move dat a bet w een general- purpose regist ers; m ove dat a bet ween m em ory and general- purpose or segm ent regist ers; m ove im m ediat es t o generalpurpose regist ers

CMOVE/ CMOVZ

Condit ional m ove if equal/ Condit ional m ove if zero

CMOVNE/ CMOVNZ

Condit ional m ove if not equal/ Condit ional m ove if not zero

CMOVA/ CMOVNBE

Condit ional m ove if above/ Condit ional m ove if not below or equal

CMOVAE/ CMOVNB

Condit ional m ove if above or equal/ Condit ional m ove if not below

CMOVB/ CMOVNAE

Condit ional m ove if below/ Condit ional m ove if not above or equal

CMOVBE/ CMOVNA

Condit ional m ove if below or equal/ Condit ional m ove if not above

CMOVG/ CMOVNLE

Condit ional m ove if great er/ Condit ional m ove if not less or equal

CMOVGE/ CMOVNL

Condit ional m ove if great er or equal/ Condit ional m ove if not less

CMOVL/ CMOVNGE

Condit ional m ove if less/ Condit ional m ove if not great er or equal

CMOVLE/ CMOVNG

Condit ional m ove if less or equal/ Condit ional m ove if not gr eat er

CMOVC

Condit ional m ove if carry

CMOVNC

Condit ional m ove if not carry

CMOVO

Condit ional m ove if overflow

CMOVNO

Condit ional m ove if not overflow

CMOVS

Condit ional m ove if sign ( negat ive)

CMOVNS

Condit ional m ove if not sign ( non- negat ive)

CMOVP/ CMOVPE

Condit ional m ove if parit y/ Condit ional m ove if parit y even

Vol. 1 5-3

INSTRUCTION SET SUMMARY

CMOVNP/ CMOVPO

Condit ional m ove if not parit y/ Condit ional m ove if parit y odd

XCHG

Exchange

BSWAP

Byt e swap

XADD

Exchange and add

CMPXCHG

Com pare and exchange

CMPXCHG8B

Com pare and exchange 8 byt es

PUSH

Push ont o st ack

POP

Pop off of st ack

PUSHA/ PUSHAD

Push general- purpose regist ers ont o st ack

POPA/ POPAD

Pop general- purpose regist ers fr om st ack

CWD/ CDQ

Convert w ord t o doubleword/ Convert doubleword t o quadword

CBW/ CWDE

Convert byt e t o w ord/ Convert w ord t o doubleword in EAX regist er

MOVSX

Move and sign ext end

MOVZX

Move and zero ext end

5.1.2

Binary Arithmetic Instructions

The binary arit hm et ic inst ruct ions perform basic binary int eger com put at ions on byt e, word, and doubleword int egers locat ed in m em ory and/ or t he general purpose regist ers. ADD

I nt eger add

ADC

Add wit h carry

SUB

Subt ract

SBB

Subt ract wit h borrow

I MUL

Signed m ult iply

MUL

Unsigned m ult iply

I DI V

Signed divide

DI V

Unsigned divide

I NC

I ncrem ent

DEC

Decrem ent

NEG

Negat e

CMP

Com pare

5-4 Vol. 1

INSTRUCTION SET SUMMARY

5.1.3

Decimal Arithmetic Instructions

The decim al arit hm et ic inst ruct ions perform decim al arit hm et ic on binary coded decim al ( BCD) dat a. DAA

Decim al adj ust aft er addit ion

DAS

Decim al adj ust aft er subt ract ion

AAA

ASCI I adj ust aft er addit ion

AAS

ASCI I adj ust aft er subt ract ion

AAM

ASCI I adj ust aft er m ult iplicat ion

AAD

ASCI I adj ust before division

5.1.4

Logical Instructions

The logical inst ruct ions perform basic AND, OR, XOR, and NOT logical operat ions on byt e, word, and doubleword values. AND

Perform bit wise logical AND

OR

Perform bit wise logical OR

XOR

Perform bit wise logical exclusive OR

NOT

Perform bit wise logical NOT

5.1.5

Shift and Rotate Instructions

The shift and rot at e inst ruct ions shift and rot at e t he bit s in word and doubleword operands. SAR

Shift arit hm et ic right

SHR

Shift logical right

SAL/ SHL

Shift arit hm et ic left / Shift logical left

SHRD

Shift right double

SHLD

Shift left double

ROR

Rot at e right

ROL

Rot at e left

RCR

Rot at e t hrough carry right

RCL

Rot at e t hrough carry left

Vol. 1 5-5

INSTRUCTION SET SUMMARY

5.1.6

Bit and Byte Instructions

Bit inst ruct ions t est and m odify individual bit s in word and doubleword operands. Byt e inst ruct ions set t he value of a byt e operand t o indicat e t he st at us of flags in t he EFLAGS regist er.

BT

Bit t est

BTS

Bit t est and set

BTR

Bit t est and reset

BTC

Bit t est and com plem ent

BSF

Bit scan forward

BSR

Bit scan reverse

SETE/ SETZ

Set byt e if equal/ Set byt e if zero

SETNE/ SETNZ

Set byt e if not equal/ Set byt e if not zero

SETA/ SETNBE

Set byt e if above/ Set byt e if not below or equal

SETAE/ SETNB/ SETNC Set byt e if above or equal/ Set byt e if not below/ Set byt e if not carry SETB/ SETNAE/ SETC Set byt e if below/ Set byt e if not above or equal/ Set byt e if carr y SETBE/ SETNA

Set byt e if below or equal/ Set byt e if not above

SETG/ SETNLE

Set byt e if great er/ Set byt e if not less or equal

SETGE/ SETNL

Set byt e if great er or equal/ Set byt e if not less

SETL/ SETNGE

Set byt e if less/ Set byt e if not great er or equal

SETLE/ SETNG

Set byt e if less or equal/ Set byt e if not great er

SETS

Set byt e if sign ( negat ive)

SETNS

Set byt e if not sign ( non- negat ive)

SETO

Set byt e if overflow

SETNO

Set byt e if not overflow

SETPE/ SETP

Set byt e if parit y even/ Set byt e if parit y

SETPO/ SETNP

Set byt e if parit y odd/ Set byt e if not parit y

TEST

Logical com pare

5-6 Vol. 1

INSTRUCTION SET SUMMARY

5.1.7

Control Transfer Instructions

The cont rol t ransfer inst ruct ions provide j um p, condit ional j um p, loop, and call and ret urn operat ions t o cont rol program flow.

JMP

Jum p

JE/ JZ

Jum p if equal/ Jum p if zero

JNE/ JNZ

Jum p if not equal/ Jum p if not zero

JA/ JNBE

Jum p if above/ Jum p if not below or equal

JAE/ JNB

Jum p if above or equal/ Jum p if not below

JB/ JNAE

Jum p if below / Jum p if not above or equal

JBE/ JNA

Jum p if below or equal/ Jum p if not above

JG/ JNLE

Jum p if great er/ Jum p if not less or equal

JGE/ JNL

Jum p if great er or equal/ Jum p if not less

JL/ JNGE

Jum p if less/ Jum p if not great er or equal

JLE/ JNG

Jum p if less or equal/ Jum p if not great er

JC

Jum p if carry

JNC

Jum p if not carr y

JO

Jum p if overflow

JNO

Jum p if not overflow

JS

Jum p if sign ( negat ive)

JNS

Jum p if not sign ( non- negat ive)

JPO/ JNP

Jum p if parit y odd/ Jum p if not parit y

JPE/ JP

Jum p if parit y even/ Jum p if parit y

JCXZ/ JECXZ

Jum p regist er CX zero/ Jum p regist er ECX zero

LOOP

Loop wit h ECX count er

LOOPZ/ LOOPE

Loop wit h ECX and zero/ Loop wit h ECX and equal

LOOPNZ/ LOOPNE

Loop wit h ECX and not zero/ Loop wit h ECX and not equal

CALL

Call procedur e

RET

Ret urn

I RET

Ret urn from int errupt

I NT

Soft ware int errupt

Vol. 1 5-7

INSTRUCTION SET SUMMARY

I NTO

I nt errupt on overflow

BOUND

Det ect value out of range

ENTER

High- level pr ocedure ent ry

LEAVE

High- level pr ocedure exit

5.1.8

String Instructions

The st ring inst ruct ions operat e on st rings of byt es, allowing t hem t o be m oved t o and from m em ory. MOVS/ MOVSB

Move st ring/ Move byt e st ring

MOVS/ MOVSW

Move st ring/ Move word st ring

MOVS/ MOVSD

Move st ring/ Move doubleword st ring

CMPS/ CMPSB

Com pare st ring/ Com pare byt e st ring

CMPS/ CMPSW

Com pare st ring/ Com pare word st ring

CMPS/ CMPSD

Com pare st ring/ Com pare doubleword st ring

SCAS/ SCASB

Scan st ring/ Scan byt e st ring

SCAS/ SCASW

Scan st ring/ Scan word st ring

SCAS/ SCASD

Scan st ring/ Scan doubleword st ring

LODS/ LODSB

Load st ring/ Load byt e st ring

LODS/ LODSW

Load st ring/ Load word st ring

LODS/ LODSD

Load st ring/ Load doubleword st ring

STOS/ STOSB

St ore st ring/ St ore byt e st ring

STOS/ STOSW

St ore st ring/ St ore word st ring

STOS/ STOSD

St ore st ring/ St ore doubleword st ring

REP

Repeat while ECX not zero

REPE/ REPZ

Repeat while equal/ Repeat while zero

REPNE/ REPNZ

Repeat while not equal/ Repeat while not zero

5.1.9

I/O Instructions

These inst ruct ions m ove dat a bet ween t he processor ’s I / O port s and a regist er or m em ory. IN

Read from a port

OUT

Writ e t o a port

I NS/ I NSB

I nput st ring from port / I nput byt e st ring from port

I NS/ I NSW

I nput st ring from port / I nput word st ring from port

I NS/ I NSD

I nput st ring from port / I nput doubleword st ring from port

5-8 Vol. 1

INSTRUCTION SET SUMMARY

OUTS/ OUTSB

Out put st ring t o port / Out put byt e st ring t o port

OUTS/ OUTSW

Out put st ring t o port / Out put word st ring t o port

OUTS/ OUTSD

Out put st ring t o port / Out put doubleword st ring t o port

5.1.10

Enter and Leave Instructions

These inst ruct ions provide m achine- language support for procedure calls in blockst ruct ured languages. ENTER

High- level procedure ent ry

LEAVE

High- level procedure exit

5.1.11

Flag Control (EFLAG) Instructions

The flag cont rol inst ruct ions operat e on t he flags in t he EFLAGS regist er. STC

Set carry flag

CLC

Clear t he carry flag

CMC

Com plem ent t he carry flag

CLD

Clear t he direct ion flag

STD

Set direct ion flag

LAHF

Load flags int o AH regist er

SAHF

St ore AH regist er int o flags

PUSHF/ PUSHFD

Push EFLAGS ont o st ack

POPF/ POPFD

Pop EFLAGS from st ack

STI

Set int errupt flag

CLI

Clear t he int errupt flag

5.1.12

Segment Register Instructions

The segm ent regist er inst ruct ions allow far point ers ( segm ent addresses) t o be loaded int o t he segm ent regist ers. LDS

Load far point er using DS

LES

Load far point er using ES

LFS

Load far point er using FS

LGS

Load far point er using GS

LSS

Load far point er using SS

Vol. 1 5-9

INSTRUCTION SET SUMMARY

5.1.13

Miscellaneous Instructions

The m iscellaneous inst ruct ions provide such funct ions as loading an effect ive address, execut ing a “ no- operat ion,” and ret rieving processor ident ificat ion inform at ion. LEA

Load effect ive address

NOP

No operat ion

UD2

Undefined inst ruct ion

XLAT/ XLATB

Table lookup t ranslat ion

CPUI D

Processor I dent ificat ion

5.2

X87 FPU INSTRUCTIONS

The x87 FPU inst ruct ions are execut ed by t he processor ’s x87 FPU. These inst ruct ions operat e on float ing- point , int eger, and binary- coded decim al ( BCD) operands. For m ore det ail on x87 FPU inst ruct ions, see Chapt er 8, “ Program m ing wit h t he x87 FPU.” These inst ruct ions are divided int o t he following subgroups: dat a t ransfer, load const ant s, and FPU cont rol inst ruct ions. The sect ions t hat follow int roduce each subgroup.

5.2.1

x87 FPU Data Transfer Instructions

The dat a t ransfer inst ruct ions m ove float ing- point , int eger, and BCD values bet ween m em ory and t he x87 FPU regist ers. They also perform condit ional m ove operat ions on float ing- point operands. FLD

Load float ing- point value

FST

St ore float ing- point value

FSTP

St ore float ing- point value and pop

FI LD

Load int eger

FI ST

St ore int eger

FI STP1

St ore int eger and pop

FBLD

Load BCD

FBSTP

St ore BCD and pop

FXCH

Exchange regist ers

FCMOVE

Float ing- point condit ional m ove if equal

FCMOVNE

Float ing- point condit ional m ove if not equal

FCMOVB

Float ing- point condit ional m ove if below

FCMOVBE

Float ing- point condit ional m ove if below or equal

1. SSE3 provides an instruction FISTTP for integer conversion.

5-10 Vol. 1

INSTRUCTION SET SUMMARY

FCMOVNB

Float ing- point condit ional m ove if not below

FCMOVNBE

Float ing- point condit ional m ove if not below or equal

FCMOVU

Float ing- point condit ional m ove if unordered

FCMOVNU

Float ing- point condit ional m ove if not unordered

5.2.2

x87 FPU Basic Arithmetic Instructions

The basic arit hm et ic inst ruct ions perform basic arit hm et ic operat ions on float ingpoint and int eger operands. FADD

Add float ing- point

FADDP

Add float ing- point and pop

FI ADD

Add int eger

FSUB

Subt ract float ing- point

FSUBP

Subt ract float ing- point and pop

FI SUB

Subt ract int eger

FSUBR

Subt ract float ing- point reverse

FSUBRP

Subt ract float ing- point reverse and pop

FI SUBR

Subt ract int eger reverse

FMUL

Mult iply float ing- point

FMULP

Mult iply float ing- point and pop

FI MUL

Mult iply int eger

FDI V

Divide float ing- point

FDI VP

Divide float ing- point and pop

FI DI V

Divide int eger

FDI VR

Divide float ing- point reverse

FDI VRP

Divide float ing- point reverse and pop

FI DI VR

Divide int eger reverse

FPREM

Part ial rem ainder

FPREM1

I EEE Part ial rem ainder

FABS

Absolut e value

FCHS

Change sign

FRNDI NT

Round t o int eger

FSCALE

Scale by power of t wo

FSQRT

Square root

FXTRACT

Ext ract exponent and significand

Vol. 1 5-11

INSTRUCTION SET SUMMARY

5.2.3

x87 FPU Comparison Instructions

The com pare inst ruct ions exam ine or com pare float ing- point or int eger operands. FCOM

Com pare float ing- point

FCOMP

Com pare float ing- point and pop

FCOMPP

Com pare float ing- point and pop t wice

FUCOM

Unordered com pare float ing- point

FUCOMP

Unordered com pare float ing- point and pop

FUCOMPP

Unordered com pare float ing- point and pop t wice

FI COM

Com pare int eger

FI COMP

Com pare int eger and pop

FCOMI

Com pare float ing- point and set EFLAGS

FUCOMI

Unordered com pare float ing- point and set EFLAGS

FCOMI P

Com pare float ing- point , set EFLAGS, and pop

FUCOMI P

Unordered com pare float ing- point , set EFLAGS, and pop

FTST

Test float ing- point ( com pare wit h 0.0)

FXAM

Exam ine float ing- point

5.2.4

x87 FPU Transcendental Instructions

The t ranscendent al inst ruct ions perform basic t rigonom et ric and logarit hm ic operat ions on float ing- point operands. FSI N

Sine

FCOS

Cosine

FSI NCOS

Sine and cosine

FPTAN

Part ial t angent

FPATAN

Part ial arct angent

F2XM1

2x − 1

FYL2X

y∗log 2 x

FYL2XP1

y∗log 2 ( x+ 1)

5.2.5

x87 FPU Load Constants Instructions

The load const ant s inst ruct ions load com m on const ant s, such as π, int o t he x87 float ing- point regist ers. FLD1

Load + 1.0

FLDZ FLDPI

Load π

FLDL2E

Load log 2 e

5-12 Vol. 1

Load + 0.0

INSTRUCTION SET SUMMARY

FLDLN2

Load log e2

FLDL2T

Load log 2 10 Load log 10 2

FLDLG2

5.2.6

x87 FPU Control Instructions

The x87 FPU cont rol inst ruct ions operat e on t he x87 FPU regist er st ack and save and rest ore t he x87 FPU st at e. FI NCSTP

I ncrem ent FPU regist er st ack point er

FDECSTP

Decrem ent FPU regist er st ack point er

FFREE

Free float ing- point regist er

FI NI T

I nit ialize FPU aft er checking error condit ions

FNI NI T

I nit ialize FPU wit hout checking error condit ions

FCLEX

Clear float ing- point except ion flags aft er checking for error condit ions

FNCLEX

Clear float ing- point except ion flags wit hout checking for error condit ions

FSTCW

St ore FPU cont rol word aft er checking error condit ions

FNSTCW

St ore FPU cont rol word wit hout checking error condit ions

FLDCW

Load FPU cont rol word

FSTENV

St ore FPU environm ent aft er checking error condit ions

FNSTENV

St ore FPU environm ent wit hout checking error condit ions

FLDENV

Load FPU environm ent

FSAVE

Save FPU st at e aft er checking error condit ions

FNSAVE

Save FPU st at e wit hout checking error condit ions

FRSTOR

Rest ore FPU st at e

FSTSW

St ore FPU st at us word aft er checking error condit ions

FNSTSW

St ore FPU st at us word wit hout checking error condit ions

WAI T/ FWAI T

Wait for FPU

FNOP

FPU no operat ion

5.3

X87 FPU AND SIMD STATE MANAGEMENT INSTRUCTIONS

Two st at e m anagem ent inst ruct ions were int roduced int o t he I A- 32 archit ect ure wit h t he Pent ium I I processor fam ily: FXSAVE

Save x87 FPU and SI MD st at e

FXRSTOR

Rest ore x87 FPU and SI MD st at e

Vol. 1 5-13

INSTRUCTION SET SUMMARY

I nit ially, t hese inst ruct ions operat ed only on t he x87 FPU ( and MMX) regist ers t o perform a fast save and rest ore, respect ively, of t he x87 FPU and MMX st at e. Wit h t he int roduct ion of SSE ext ensions in t he Pent ium III processor fam ily, t hese inst ruct ions were expanded t o also save and rest ore t he st at e of t he XMM and MXCSR regist ers. I nt el 64 archit ect ure also support s t hese inst ruct ions. See Sect ion 10.5, “ FXSAVE and FXRSTOR I nst ruct ions,” for m ore det ail.

5.4

MMX™ INSTRUCTIONS

Four ext ensions have been int roduced int o t he I A- 32 archit ect ure t o perm it I A- 32 processors t o perform single- inst ruct ion m ult iple- dat a ( SI MD) operat ions. These ext ensions include t he MMX t echnology, SSE ext ensions, SSE2 ext ensions, and SSE3 ext ensions. For a discussion t hat put s SI MD inst ruct ions in t heir hist orical cont ext , see Sect ion 2.2.4, “ SI MD I nst ruct ions.” MMX inst ruct ions operat e on packed byt e, word, doubleword, or quadword int eger operands cont ained in m em ory, in MMX regist ers, and/ or in general- purpose regist ers. For m ore det ail on t hese inst ruct ions, see Chapt er 9, “ Program m ing wit h I nt el® MMX™ Technology.” MMX inst ruct ions can only be execut ed on I nt el 64 and I A- 32 processors t hat support t he MMX t echnology. Support for t hese inst ruct ions can be det ect ed wit h t he CPUI D inst ruct ion. See t he descript ion of t he CPUI D inst ruct ion in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A. MMX inst ruct ions are divided int o t he following subgroups: dat a t ransfer, conversion, packed arit hm et ic, com parison, logical, shift and rot at e, and st at e m anagem ent inst ruct ions. The sect ions t hat follow int roduce each subgroup.

5.4.1

MMX Data Transfer Instructions

The dat a t ransfer inst ruct ions m ove doubleword and quadword operands bet ween MMX regist ers and bet ween MMX regist ers and m em ory. MOVD

Move doubleword

MOVQ

Move quadword

5.4.2

MMX Conversion Instructions

The conversion inst ruct ions pack and unpack byt es, words, and doublewords PACKSSWB

Pack words int o byt es wit h signed sat urat ion

PACKSSDW

Pack doublewords int o words wit h signed sat urat ion

PACKUSWB

Pack words int o byt es wit h unsigned sat urat ion.

PUNPCKHBW

Unpack high- order byt es

5-14 Vol. 1

INSTRUCTION SET SUMMARY

PUNPCKHWD

Unpack high- order words

PUNPCKHDQ

Unpack high- order doublewords

PUNPCKLBW

Unpack low- order byt es

PUNPCKLWD

Unpack low- order words

PUNPCKLDQ

Unpack low- order doublewords

5.4.3

MMX Packed Arithmetic Instructions

The packed arit hm et ic inst ruct ions perform packed int eger arit hm et ic on packed byt e, word, and doubleword int egers. PADDB

Add packed byt e int egers

PADDW

Add packed word int egers

PADDD

Add packed doubleword int egers

PADDSB

Add packed signed byt e int egers wit h signed sat urat ion

PADDSW

Add packed signed word int egers wit h signed sat urat ion

PADDUSB

Add packed unsigned byt e int egers wit h unsigned sat urat ion

PADDUSW

Add packed unsigned word int egers wit h unsigned sat urat ion

PSUBB

Subt ract packed byt e int egers

PSUBW

Subt ract packed word int egers

PSUBD

Subt ract packed doubleword int egers

PSUBSB

Subt ract packed signed byt e int egers wit h signed sat urat ion

PSUBSW

Subt ract packed signed word int egers wit h signed sat urat ion

PSUBUSB

Subt ract packed unsigned byt e int egers wit h unsigned sat urat ion

PSUBUSW

Subt ract packed unsigned word int egers wit h unsigned sat urat ion

PMULHW

Mult iply packed signed word int egers and st ore high result

PMULLW

Mult iply packed signed word int egers and st ore low result

PMADDWD

Mult iply and add packed word int egers

5.4.4

MMX Comparison Instructions

The com pare inst ruct ions com pare packed byt es, words, or doublewords. PCMPEQB

Com pare packed byt es for equal

PCMPEQW

Com pare packed words for equal

PCMPEQD

Com pare packed doublewords for equal

PCMPGTB

Com pare packed signed byt e int egers for great er t han

PCMPGTW

Com pare packed signed word int egers for great er t han

PCMPGTD

Com pare packed signed doubleword int egers for great er t han

Vol. 1 5-15

INSTRUCTION SET SUMMARY

5.4.5

MMX Logical Instructions

The logical inst ruct ions perfor m AND, AND NOT, OR, and XOR operat ions on quadword operands. PAND

Bit wise logical AND

PANDN

Bit wise logical AND NOT

POR

Bit wise logical OR

PXOR

Bit wise logical exclusive OR

5.4.6

MMX Shift and Rotate Instructions

The shift and rot at e inst ruct ions shift and rot at e packed byt es, words, or doublewords, or quadwords in 64- bit operands. PSLLW

Shift packed words left logical

PSLLD

Shift packed doublewords left logical

PSLLQ

Shift packed quadword left logical

PSRLW

Shift packed words right logical

PSRLD

Shift packed doublewords right logical

PSRLQ

Shift packed quadword right logical

PSRAW

Shift packed words right arit hm et ic

PSRAD

Shift packed doublewords right arit hm et ic

5.4.7

MMX State Management Instructions

The EMMS inst ruct ion clears t he MMX st at e from t he MMX regist ers. EMMS

5.5

Em pt y MMX st at e

SSE INSTRUCTIONS

SSE inst ruct ions represent an ext ension of t he SI MD execut ion m odel int roduced wit h t he MMX t echnology. For m ore det ail on t hese inst ruct ions, see Chapt er 10, “ Program m ing wit h St ream ing SI MD Ext ensions ( SSE) .” SSE inst ruct ions can only be execut ed on I nt el 64 and I A- 32 processors t hat support SSE ext ensions. Support for t hese inst ruct ions can be det ect ed wit h t he CPUI D inst ruct ion. See t he descript ion of t he CPUI D inst ruct ion in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A.

5-16 Vol. 1

INSTRUCTION SET SUMMARY

SSE inst ruct ions are divided int o four subgroups ( not e t hat t he first subgroup has subordinat e subgroups of it s own) :



SI MD single- precision float ing- point inst ruct ions t hat operat e on t he XMM regist ers

• • •

MXSCR st at e m anagem ent inst ruct ions 64- bit SI MD int eger inst ruct ions t hat operat e on t he MMX regist ers Cacheabilit y cont rol, prefet ch, and inst ruct ion ordering inst ruct ions

The following sect ions provide an overview of t hese groups.

5.5.1

SSE SIMD Single-Precision Floating-Point Instructions

These inst ruct ions operat e on packed and scalar single- precision float ing- point values locat ed in XMM regist ers and/ or m em ory. This subgroup is furt her divided int o t he following subordinat e subgroups: dat a t ransfer, packed arit hm et ic, com parison, logical, shuffle and unpack, and conversion inst ruct ions.

5.5.1.1

SSE Data Transfer Instructions

SSE dat a t ransfer inst ruct ions m ove packed and scalar single- precision float ing- point operands bet ween XMM regist ers and bet ween XMM regist ers and m em ory. MOVAPS

Move four aligned packed single- precision float ing- point values bet ween XMM regist ers or bet ween and XMM regist er and m em ory

MOVUPS

Move four unaligned packed single- precision float ing- point values bet ween XMM regist ers or bet ween and XMM regist er and m em ory

MOVHPS

Move t wo packed single- precision float ing- point values t o an from t he high quadword of an XMM regist er and m em ory

MOVHLPS

Move t wo packed single- precision float ing- point values from t he high quadword of an XMM regist er t o t he low quadword of anot her XMM regist er

MOVLPS

Move t wo packed single- precision float ing- point values t o an from t he low quadword of an XMM regist er and m em ory

MOVLHPS

Move t wo packed single- precision float ing- point values from t he low quadword of an XMM regist er t o t he high quadword of anot her XMM regist er

MOVMSKPS

Ext ract sign m ask from four packed single- precision float ingpoint values

MOVSS

Move scalar single- precision float ing- point value bet ween XMM regist ers or bet ween an XMM regist er and m em ory

Vol. 1 5-17

INSTRUCTION SET SUMMARY

5.5.1.2

SSE Packed Arithmetic Instructions

SSE packed arit hm et ic inst ruct ions perform packed and scalar arit hm et ic operat ions on packed and scalar single- precision float ing- point operands. ADDPS

Add packed single- precision float ing- point values

ADDSS

Add scalar single- precision float ing- point values

SUBPS

Subt ract packed single- precision float ing- point values

SUBSS

Subt ract scalar single- precision float ing- point values

MULPS

Mult iply packed single- precision float ing- point values

MULSS

Mult iply scalar single- precision float ing- point values

DI VPS

Divide packed single- precision float ing- point values

DI VSS

Divide scalar single- precision float ing- point values

RCPPS

Com put e reciprocals of packed single- precision float ing- point values

RCPSS

Com put e reciprocal of scalar single- precision float ing- point values

SQRTPS

Com put e square root s of packed single- precision float ing- point values

SQRTSS

Com put e square root of scalar single- precision float ing- point values

RSQRTPS

Com put e reciprocals of square root s of packed single- precision float ing- point values

RSQRTSS

Com put e reciprocal of square root of scalar single- precision float ing- point values

MAXPS

Ret urn m axim um packed single- precision float ing- point values

MAXSS

Ret urn m axim um scalar single- precision float ing- point values

MI NPS

Ret urn m inim um packed single- precision float ing- point values

MI NSS

Ret urn m inim um scalar single- precision float ing- point values

5.5.1.3

SSE Comparison Instructions

SSE com pare inst ruct ions com pare packed and scalar single- precision float ing- point operands. CMPPS

Com pare packed single- precision float ing- point values

CMPSS

Com pare scalar single- precision float ing- point values

COMI SS

Perform ordered com parison of scalar single- precision float ingpoint values and set flags in EFLAGS regist er

UCOMI SS

Perform unordered com parison of scalar single- precision float ing- point values and set flags in EFLAGS regist er

5-18 Vol. 1

INSTRUCTION SET SUMMARY

5.5.1.4

SSE Logical Instructions

SSE logical inst ruct ions perform bit wise AND, AND NOT, OR, and XOR operat ions on packed single- precision float ing- point operands. ANDPS

Perform bit wise logical AND of packed single- precision float ingpoint values

ANDNPS

Perform bit wise logical AND NOT of packed single- precision float ing- point values

ORPS

Perform bit wise logical OR of packed single- precision float ingpoint values

XORPS

Perform bit wise logical XOR of packed single- precision float ingpoint values

5.5.1.5

SSE Shuffle and Unpack Instructions

SSE shuffle and unpack inst ruct ions shuffle or int erleave single- precision float ingpoint values in packed single- precision float ing- point operands. SHUFPS

Shuffles values in packed single- precision float ing- point operands

UNPCKHPS

Unpacks and int erleaves t he t wo high- order values from t wo single- precision float ing- point operands

UNPCKLPS

Unpacks and int erleaves t he t wo low- order values from t wo single- precision float ing- point operands

5.5.1.6

SSE Conversion Instructions

SSE conversion inst ruct ions convert packed and individual doubleword int egers int o packed and scalar single- precision float ing- point values and vice versa. CVTPI 2PS

Convert packed doubleword int egers t o packed single- precision float ing- point values

CVTSI 2SS

Convert doubleword int eger t o scalar single- precision float ingpoint value

CVTPS2PI

Convert packed single- precision float ing- point values t o packed doubleword int egers

CVTTPS2PI

Convert wit h t runcat ion packed single- precision float ing- point values t o packed doubleword int egers

CVTSS2SI

Convert a scalar single- precision float ing- point value t o a doubleword int eger

CVTTSS2SI

Convert wit h t runcat ion a scalar single- precision float ing- point value t o a scalar doubleword int eger

Vol. 1 5-19

INSTRUCTION SET SUMMARY

5.5.2

SSE MXCSR State Management Instructions

MXCSR st at e m anagem ent inst ruct ions allow saving and rest oring t he st at e of t he MXCSR cont rol and st at us regist er. LDMXCSR

Load MXCSR regist er

STMXCSR

Save MXCSR regist er st at e

5.5.3

SSE 64-Bit SIMD Integer Instructions

These SSE 64- bit SI MD int eger inst ruct ions perform addit ional operat ions on packed byt es, words, or doublewords cont ained in MMX regist ers. They represent enhancem ent s t o t he MMX inst ruct ion set described in Sect ion 5.4, “ MMX™ I nst ruct ions.” PAVGB

Com put e average of packed unsigned byt e int egers

PAVGW

Com put e average of packed unsigned word int egers

PEXTRW

Ext ract word

PI NSRW

I nsert word

PMAXUB

Maxim um of packed unsigned byt e int egers

PMAXSW

Maxim um of packed signed word int egers

PMI NUB

Minim um of packed unsigned byt e int egers

PMI NSW

Minim um of packed signed word int egers

PMOVMSKB

Move byt e m ask

PMULHUW

Mult iply packed unsigned int egers and st ore high result

PSADBW

Com put e sum of absolut e differences

PSHUFW

Shuffle packed int eger word in MMX regist er

5.5.4

SSE Cacheability Control, Prefetch, and Instruction Ordering Instructions

The cacheabilit y cont rol inst ruct ions provide cont rol over t he caching of nont em poral dat a when st oring dat a from t he MMX and XMM regist ers t o m em ory. The PREFETCHh allows dat a t o be prefet ched t o a select ed cache level. The SFENCE inst ruct ion cont rols inst ruct ion ordering on st ore operat ions. MASKMOVQ

Non- t em poral st ore of select ed byt es from an MMX regist er int o m em ory

MOVNTQ

Non- t em poral st ore of quadword from an MMX regist er int o m em ory

MOVNTPS

Non- t em poral st ore of four packed single- precision float ingpoint values from an XMM regist er int o m em ory

5-20 Vol. 1

INSTRUCTION SET SUMMARY

PREFETCHh

Load 32 or m ore of byt es from m em ory t o a select ed level of t he processor ’s cache hierarchy

SFENCE

Serializes st ore operat ions

5.6

SSE2 INSTRUCTIONS

SSE2 ext ensions represent an ext ension of t he SI MD execut ion m odel int roduced wit h MMX t echnology and t he SSE ext ensions. SSE2 inst ruct ions operat e on packed double- precision float ing- point operands and on packed byt e, word, doubleword, and quadword operands locat ed in t he XMM regist ers. For m ore det ail on t hese inst ruct ions, see Chapt er 11, “ Program m ing wit h St ream ing SI MD Ext ensions 2 ( SSE2) .” SSE2 inst ruct ions can only be execut ed on I nt el 64 and I A- 32 processors t hat support t he SSE2 ext ensions. Support for t hese inst ruct ions can be det ect ed wit h t he CPUI D inst ruct ion. See t he descript ion of t he CPUI D inst ruct ion in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A. These inst ruct ions are divided int o four subgroups ( not e t hat t he first subgroup is furt her divided int o subordinat e subgroups) :

• • • •

Packed and scalar double- precision float ing- point inst ruct ions Packed single- precision float ing- point conversion inst ruct ions 128- bit SI MD int eger inst ruct ions Cacheabilit y- cont rol and inst ruct ion ordering inst ruct ions

The following sect ions give an overview of each subgroup.

5.6.1

SSE2 Packed and Scalar Double-Precision Floating-Point Instructions

SSE2 packed and scalar double- precision float ing- point inst ruct ions are divided int o t he following subordinat e subgroups: dat a m ovem ent , arit hm et ic, com parison, conversion, logical, and shuffle operat ions on double- precision float ing- point operands. These are int roduced in t he sect ions t hat follow.

5.6.1.1

SSE2 Data Movement Instructions

SSE2 dat a m ovem ent inst ruct ions m ove double- precision float ing- point dat a bet ween XMM regist ers and bet ween XMM regist ers and m em ory. MOVAPD

Move t wo aligned packed double- precision float ing- point values bet ween XMM regist ers or bet ween and XMM regist er and m em ory

Vol. 1 5-21

INSTRUCTION SET SUMMARY

MOVUPD

Move t wo unaligned packed double- precision float ing- point values bet ween XMM regist ers or bet ween and XMM regist er and m em ory

MOVHPD

Move high packed double- precision float ing- point value t o an from t he high quadword of an XMM regist er and m em ory

MOVLPD

Move low packed single- precision float ing- point value t o an from t he low quadword of an XMM regist er and m em ory

MOVMSKPD

Ext ract sign m ask from t wo packed double- precision float ingpoint values

MOVSD

Move scalar double- precision float ing- point value bet ween XMM regist ers or bet ween an XMM regist er and m em ory

5.6.1.2

SSE2 Packed Arithmetic Instructions

The arit hm et ic inst ruct ions perform addit ion, subt ract ion, m ult iply, divide, square root , and m axim um / m inim um operat ions on packed and scalar double- precision float ing- point operands. ADDPD

Add packed double- precision float ing- point values

ADDSD

Add scalar double precision float ing- point values

SUBPD

Subt ract scalar double- precision float ing- point values

SUBSD

Subt ract scalar double- precision float ing- point values

MULPD

Mult iply packed double- precision float ing- point values

MULSD

Mult iply scalar double- precision float ing- point values

DI VPD

Divide packed double- precision float ing- point values

DI VSD

Divide scalar double- precision float ing- point values

SQRTPD

Com put e packed square root s of packed double- precision float ing- point values

SQRTSD

Com put e scalar square root of scalar double- precision float ingpoint values

MAXPD

Ret urn m axim um packed double- precision float ing- point values

MAXSD

Ret urn m axim um scalar double- precision float ing- point values

MI NPD

Ret urn m inim um packed double- precision float ing- point values

MI NSD

Ret urn m inim um scalar double- precision float ing- point values

5.6.1.3

SSE2 Logical Instructions

SSE2 logical inst ruct ions preform AND, AND NOT, OR, and XOR operat ions on packed double- precision float ing- point values. ANDPD

5-22 Vol. 1

Perform bit wise logical AND of packed double- precision float ingpoint values

INSTRUCTION SET SUMMARY

ANDNPD

Perform bit wise logical AND NOT of packed double- precision float ing- point values

ORPD

Perform bit wise logical OR of packed double- precision float ingpoint values

XORPD

Perform bit wise logical XOR of packed double- precision float ingpoint values

5.6.1.4

SSE2 Compare Instructions

SSE2 com pare inst ruct ions com pare packed and scalar double- precision float ingpoint values and ret urn t he result s of t he com parison eit her t o t he dest inat ion operand or t o t he EFLAGS regist er. CMPPD

Com pare packed double- precision float ing- point values

CMPSD

Com pare scalar double- precision float ing- point values

COMI SD

Perform ordered com parison of scalar double- precision float ingpoint values and set flags in EFLAGS regist er

UCOMI SD

Perform unordered com parison of scalar double- precision float ing- point values and set flags in EFLAGS regist er.

5.6.1.5

SSE2 Shuffle and Unpack Instructions

SSE2 shuffle and unpack inst ruct ions shuffle or int erleave double- precision float ingpoint values in packed double- precision float ing- point operands. SHUFPD

Shuffles values in packed double- precision float ing- point operands

UNPCKHPD

Unpacks and int erleaves t he high values from t wo packed double- precision float ing- point operands

UNPCKLPD

Unpacks and int erleaves t he low values from t wo packed double- precision float ing- point operands

5.6.1.6

SSE2 Conversion Instructions

SSE2 conversion inst ruct ions convert packed and individual doubleword int egers int o packed and scalar double- precision float ing- point values and vice versa. They also convert bet ween packed and scalar single- precision and double- precision float ingpoint values. CVTPD2PI

Convert packed double- precision float ing- point values t o packed doubleword int egers.

CVTTPD2PI

Convert wit h t runcat ion packed double- precision float ing- point values t o packed doubleword int egers

CVTPI 2PD

Convert packed doubleword int egers t o packed double- precision float ing- point values

Vol. 1 5-23

INSTRUCTION SET SUMMARY

CVTPD2DQ

Convert packed double- precision float ing- point values t o packed doubleword int egers

CVTTPD2DQ

Convert wit h t runcat ion packed double- precision float ing- point values t o packed doubleword int egers

CVTDQ2PD

Convert packed doubleword int egers t o packed double- precision float ing- point values

CVTPS2PD

Convert packed single- precision float ing- point values t o packed double- precision float ing- point values

CVTPD2PS

Convert packed double- precision float ing- point values t o packed single- precision float ing- point values

CVTSS2SD

Convert scalar single- precision float ing- point values t o scalar double- precision float ing- point values

CVTSD2SS

Convert scalar double- precision float ing- point values t o scalar single- precision float ing- point values

CVTSD2SI

Convert scalar double- precision float ing- point values t o a doubleword int eger

CVTTSD2SI

Convert wit h t runcat ion scalar double- precision float ing- point values t o scalar doubleword int egers

CVTSI 2SD

Convert doubleword int eger t o scalar double- precision float ingpoint value

5.6.2

SSE2 Packed Single-Precision Floating-Point Instructions

SSE2 packed single- precision float ing- point inst ruct ions perform conversion operat ions on single- precision float ing- point and int eger operands. These inst ruct ions represent enhancem ent s t o t he SSE single- precision float ing- point inst ruct ions. CVTDQ2PS

Convert packed doubleword int egers t o packed single- precision float ing- point values

CVTPS2DQ

Convert packed single- precision float ing- point values t o packed doubleword int egers

CVTTPS2DQ

Convert wit h t runcat ion packed single- precision float ing- point values t o packed doubleword int egers

5.6.3

SSE2 128-Bit SIMD Integer Instructions

SSE2 SI MD int eger inst ruct ions perform addit ional operat ions on packed words, doublewords, and quadwords cont ained in XMM and MMX regist ers. MOVDQA

Move aligned double quadword.

MOVDQU

Move unaligned double quadword

MOVQ2DQ

Move quadword int eger from MMX t o XMM regist ers

MOVDQ2Q

Move quadword int eger from XMM t o MMX regist ers

5-24 Vol. 1

INSTRUCTION SET SUMMARY

PMULUDQ

Mult iply packed unsigned doubleword int egers

PADDQ

Add packed quadword int egers

PSUBQ

Subt ract packed quadword int egers

PSHUFLW

Shuffle packed low words

PSHUFHW

Shuffle packed high words

PSHUFD

Shuffle packed doublewords

PSLLDQ

Shift double quadword left logical

PSRLDQ

Shift double quadword right logical

PUNPCKHQDQ

Unpack high quadwords

PUNPCKLQDQ

Unpack low quadwords

5.6.4

SSE2 Cacheability Control and Ordering Instructions

SSE2 cacheabilit y cont rol inst ruct ions provide addit ional operat ions for caching of non- t em poral dat a when st oring dat a from XMM regist ers t o m em ory. LFENCE and MFENCE provide addit ional cont rol of inst ruct ion ordering on st ore operat ions. CLFLUSH

Flushes and invalidat es a m em ory operand and it s associat ed cache line from all levels of t he processor ’s cache hierarchy

LFENCE

Serializes load operat ions

MFENCE

Serializes load and st ore operat ions

PAUSE

I m proves t he perform ance of “ spin- wait loops”

MASKMOVDQU

Non- t em poral st ore of select ed byt es from an XMM regist er int o m em ory

MOVNTPD

Non- t em poral st ore of t wo packed double- precision float ingpoint values from an XMM regist er int o m em ory

MOVNTDQ

Non- t em poral st ore of double quadword from an XMM regist er int o m em ory

MOVNTI

Non- t em poral st ore of a doubleword from a general- purpose regist er int o m em ory

5.7

SSE3 INSTRUCTIONS

The SSE3 ext ensions offers 13 inst ruct ions t hat accelerat e perform ance of St ream ing SI MD Ext ensions t echnology, St ream ing SI MD Ext ensions 2 t echnology, and x87- FP m at h capabilit ies. These inst ruct ions can be grouped int o t he following cat egories:

• • • •

One x87FPU inst ruct ion used in int eger conversion One SI MD int eger inst ruct ion t hat addresses unaligned dat a loads Two SI MD float ing- point packed ADD/ SUB inst ruct ions Four SI MD float ing- point horizont al ADD/ SUB inst ruct ions

Vol. 1 5-25

INSTRUCTION SET SUMMARY

• •

Three SI MD float ing- point LOAD/ MOVE/ DUPLI CATE inst ruct ions Two t hread synchronizat ion inst ruct ions

SSE3 inst ruct ions can only be execut ed on I nt el 64 and I A- 32 processors t hat support SSE3 ext ensions. Support for t hese inst ruct ions can be det ect ed wit h t he CPUI D inst ruct ion. See t he descript ion of t he CPUI D inst ruct ion in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A. The sect ions t hat follow describe each subgroup.

5.7.1 FI STTP

5.7.2 LDDQU

5.7.3

SSE3 x87-FP Integer Conversion Instruction Behaves like t he FI STP inst ruct ion but uses t runcat ion, irrespect ive of t he rounding m ode specified in t he float ing- point cont rol word ( FCW)

SSE3 Specialized 128-bit Unaligned Data Load Instruction Special 128- bit unaligned load designed t o avoid cache line split s

SSE3 SIMD Floating-Point Packed ADD/SUB Instructions

ADDSUBPS

Perform s single- precision addit ion on t he second and fourt h pairs of 32- bit dat a elem ent s wit hin t he operands; single- precision subt ract ion on t he first and t hird pairs

ADDSUBPD

Perform s double- precision addit ion on t he second pair of quadwords, and double- precision subt ract ion on t he first pair

5.7.4

SSE3 SIMD Floating-Point Horizontal ADD/SUB Instructions

HADDPS

Perform s a single- precision addit ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by adding t he first and second elem ent s of t he first operand; t he second elem ent by adding t he t hird and fourt h elem ent s of t he first operand; t he t hird by adding t he first and second elem ent s of t he second operand; and t he fourt h by adding t he t hird and fourt h elem ent s of t he second operand.

HSUBPS

Perform s a single- precision subt ract ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by subt ract ing t he second elem ent of t he first operand from t he first elem ent of t he first operand; t he second elem ent by subt ract ing t he fourt h elem ent of t he first operand from t he t hird elem ent of t he first operand; t he t hird by subt ract ing t he second

5-26 Vol. 1

INSTRUCTION SET SUMMARY

elem ent of t he second operand from t he first elem ent of t he second operand; and t he fourt h by subt ract ing t he fourt h elem ent of t he second operand from t he t hird elem ent of t he second operand. HADDPD

Perform s a double- precision addit ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by adding t he first and second elem ent s of t he first operand; t he second elem ent by adding t he first and second elem ent s of t he second operand.

HSUBPD

Perform s a double- precision subt ract ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by subt ract ing t he second elem ent of t he first operand from t he first elem ent of t he first operand; t he second elem ent by subt ract ing t he second elem ent of t he second operand from t he first elem ent of t he second operand.

5.7.5

SSE3 SIMD Floating-Point LOAD/MOVE/DUPLICATE Instructions

MOVSHDUP

Loads/ m oves 128 bit s; duplicat ing t he second and fourt h 32- bit dat a elem ent s

MOVSLDUP

Loads/ m oves 128 bit s; duplicat ing t he first and t hird 32- bit dat a elem ent s

MOVDDUP

Loads/ m oves 64 bit s ( bit s[ 63: 0] if t he source is a regist er) and ret urns t he sam e 64 bit s in bot h t he lower and upper halves of t he 128- bit result regist er; duplicat es t he 64 bit s from t he source

5.7.6

SSE3 Agent Synchronization Instructions

MONI TOR

Set s up an address range used t o m onit or writ e- back st ores

MWAI T

Enables a logical processor t o ent er int o an opt im ized st at e while wait ing for a writ e- back st ore t o t he address range set up by t he MONI TOR inst ruct ion

Vol. 1 5-27

INSTRUCTION SET SUMMARY

5.8

SUPPLEMENTAL STREAMING SIMD EXTENSIONS 3 (SSSE3) INSTRUCTIONS

SSSE3 provide 32 inst ruct ions ( represent ed by 14 m nem onics) t o accelerat e com put at ions on packed int egers. These include:

• • •

Twelve inst ruct ions t hat perform horizont al addit ion or subt ract ion operat ions.



Two inst ruct ions t hat accelerat e packed- int eger m ult iply operat ions and produce int eger values wit h scaling.



Two inst ruct ions t hat perform a byt e- wise, in- place shuffle according t o t he second shuffle cont rol operand.



Six inst ruct ions t hat negat e packed int egers in t he dest inat ion operand if t he signs of t he corresponding elem ent in t he source operand is less t han zero.



Two inst ruct ions t hat align dat a from t he com posit e of t wo operands.

Six inst ruct ions t hat evaluat e absolut e values. Two inst ruct ions t hat perform m ult iply and add operat ions and speed up t he evaluat ion of dot product s.

SSSE3 inst ruct ions can only be execut ed on I nt el 64 and I A- 32 processors t hat support SSSE3 ext ensions. Support for t hese inst ruct ions can be det ect ed wit h t he CPUI D inst ruct ion. See t he descript ion of t he CPUI D inst ruct ion in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A. The sect ions t hat follow describe each subgroup.

5.8.1

Horizontal Addition/Subtraction

PHADDW

Adds t wo adj acent , signed 16- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed 16- bit result s t o t he dest inat ion operand.

PHADDSW

Adds t wo adj acent , signed 16- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed, sat urat ed 16- bit result s t o t he dest inat ion operand.

PHADDD

Adds t wo adj acent , signed 32- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed 32- bit result s t o t he dest inat ion operand.

PHSUBW

Perform s horizont al subt ract ion on each adj acent pair of 16- bit signed int egers by subt ract ing t he m ost significant word from t he least significant word of each pair in t he source and dest inat ion operands. The signed 16- bit result s are packed and writ t en t o t he dest inat ion operand.

PHSUBSW

Perform s horizont al subt ract ion on each adj acent pair of 16- bit signed int egers by subt ract ing t he m ost significant word from t he least significant word of each pair in t he source and dest ina-

5-28 Vol. 1

INSTRUCTION SET SUMMARY

t ion operands. The signed, sat urat ed 16- bit result s are packed and writ t en t o t he dest inat ion operand. PHSUBD

5.8.2

Perform s horizont al subt ract ion on each adj acent pair of 32- bit signed int egers by subt ract ing t he m ost significant doubleword from t he least significant double word of each pair in t he source and dest inat ion operands. The signed 32- bit result s are packed and writ t en t o t he dest inat ion operand.

Packed Absolute Values

PABSB

Com put es t he absolut e value of each signed byt e dat a elem ent .

PABSW

Com put es t he absolut e value of each signed 16- bit dat a elem ent .

PABSD

Com put es t he absolut e value of each signed 32- bit dat a elem ent .

5.8.3

Multiply and Add Packed Signed and Unsigned Bytes

PMADDUBSW

5.8.4 PMULHRSW

5.8.5 PSHUFB

Mult iplies each unsigned byt e value wit h t he corresponding signed byt e value t o produce an int erm ediat e, 16- bit signed int eger. Each adj acent pair of 16- bit signed values are added horizont ally. The signed, sat urat ed 16- bit result s are packed t o t he dest inat ion operand.

Packed Multiply High with Round and Scale Mult iplies vert ically each signed 16- bit int eger from t he dest inat ion operand wit h t he corresponding signed 16- bit int eger of t he source operand, producing int erm ediat e, signed 32- bit int egers. Each int erm ediat e 32- bit int eger is t runcat ed t o t he 18 m ost significant bit s. Rounding is always perform ed by adding 1 t o t he least significant bit of t he 18- bit int erm ediat e result . The final result is obt ained by select ing t he 16 bit s im m ediat ely t o t he right of t he m ost significant bit of each 18- bit int erm ediat e result and packed t o t he dest inat ion operand.

Packed Shuffle Bytes Perm ut es each byt e in place, according t o a shuffle cont rol m ask. The least significant t hree or four bit s of each shuffle cont rol byt e of t he cont rol m ask form t he shuffle index. The shuffle m ask is unaffect ed. I f t he m ost significant bit ( bit 7) of a shuffle cont rol byt e is set , t he const ant zero is writ t en in t he result byt e.

Vol. 1 5-29

INSTRUCTION SET SUMMARY

5.8.6

Packed Sign

PSI GNB/ W/ D

5.8.7 PALI GNR

5.9

Negat es each signed int eger elem ent of t he dest inat ion operand if t he sign of t he corresponding dat a elem ent in t he source operand is less t han zero.

Packed Align Right Source operand is appended aft er t he dest inat ion operand form ing an int erm ediat e value of t wice t he widt h of an operand. The result is ext ract ed from t he int erm ediat e value int o t he dest inat ion operand by select ing t he 128 bit or 64 bit value t hat are right- aligned t o t he byt e offset specified by t he im m ediat e value.

SYSTEM INSTRUCTIONS

The following syst em inst ruct ions are used t o cont rol t hose funct ions of t he processor t hat are provided t o support for operat ing syst em s and execut ives. LGDT

Load global descript or t able ( GDT) regist er

SGDT

St ore global descript or t able ( GDT) regist er

LLDT

Load local descript or t able ( LDT) regist er

SLDT

St ore local descript or t able ( LDT) regist er

LTR

Load t ask regist er

STR

St ore t ask regist er

LI DT

Load int errupt descript or t able ( I DT) regist er

SI DT

St ore int errupt descript or t able ( I DT) regist er

MOV

Load and st ore cont rol regist ers

LMSW

Load m achine st at us word

SMSW

St ore m achine st at us word

CLTS

Clear t he t ask- swit ched flag

ARPL

Adj ust request ed privilege level

LAR

Load access right s

LSL

Load segm ent lim it

VERR

Verify segm ent for reading

VERW

Verify segm ent for writ ing

MOV

Load and st ore debug regist ers

I NVD

I nvalidat e cache, no writ eback

WBI NVD

I nvalidat e cache, wit h writ eback

I NVLPG

I nvalidat e TLB Ent ry

5-30 Vol. 1

INSTRUCTION SET SUMMARY

LOCK ( prefix)

Lock Bus

HLT

Halt processor

RSM

Ret urn from syst em m anagem ent m ode ( SMM)

RDMSR

Read m odel- specific regist er

WRMSR

Writ e m odel- specific regist er

RDPMC

Read perform ance m onit oring count ers

RDTSC

Read t im e st am p count er

SYSENTER

Fast Syst em Call, t ransfers t o a flat prot ect ed m ode kernel at CPL = 0

SYSEXI T

Fast Syst em Call, t ransfers t o a flat prot ect ed m ode kernel at CPL = 3

5.10

64-BIT MODE INSTRUCTIONS

The following inst ruct ions are int roduced in 64- bit m ode. This m ode is a sub- m ode of I A- 32e m ode. CDQE

Convert doubleword t o quadword

CMPSQ

Com pare st ring operands

CMPXCHG16B

Com pare RDX: RAX wit h m 128

LODSQ

Load qword at address ( R) SI int o RAX

MOVSQ

Move qword from address ( R) SI t o ( R) DI

MOVZX ( 64- bit s)

Move doubleword t o quadword, zero- ext ension

STOSQ

St ore RAX at address RDI

SWAPGS

Exchanges current GS base regist er value wit h value in MSR address C0000102H

SYSCALL

Fast call t o privilege level 0 syst em procedures

SYSRET

Ret urn from fast syst em call

5.11 VIRTUAL-MACHINE EXTENSIONS The behavior of t he VMCS- m aint enance inst ruct ions is sum m arized below: VMPTRLD

Takes a single 64- bit source operand in m em ory. I t m akes t he referenced VMCS act ive and current .

VMPTRST

Takes a single 64- bit dest inat ion operand t hat is in m em ory. Current- VMCS point er is st ored int o t he dest inat ion operand.

VMCLEAR

Takes a single 64- bit operand in m em ory. The inst ruct ion set s t he launch st at e of t he VMCS referenced by t he operand t o “ clear ”, renders t hat VMCS inact ive, and ensures t hat dat a for

Vol. 1 5-31

INSTRUCTION SET SUMMARY

t he VMCS have been writ t en t o t he VMCS- dat a area in t he referenced VMCS region. VMREAD

Reads a com ponent from t he VMCS ( t he encoding of t hat field is given in a regist er operand) and st ores it int o a dest inat ion operand.

VMWRI TE

Writ es a com ponent t o t he VMCS ( t he encoding of t hat field is given in a regist er operand) from a source operand.

The behavior of t he VMX m anagem ent inst ruct ions is sum m arized below: VMCALL

Allows a guest in VMX non- root operat ion t o call t he VMM for service. A VM exit occurs, t ransferring cont rol t o t he VMM.

VMLAUNCH

Launches a virt ual m achine m anaged by t he VMCS. A VM ent ry occurs, t ransferring cont rol t o t he VM.

VMRESUME

Resum es a virt ual m achine m anaged by t he VMCS. A VM ent ry occurs, t ransferring cont rol t o t he VM.

VMXOFF

Causes t he processor t o leave VMX operat ion.

VMXON

Takes a single 64- bit source operand in m em ory. I t causes a logical processor t o ent er VMX root operat ion and t o use t he m em ory referenced by t he operand t o support VMX operat ion.

5-32 Vol. 1

CHAPTER 6 PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS This chapt er describes t he facilit ies in t he I nt el 64 and I A- 32 archit ect ures for execut ing calls t o procedures or subrout ines. I t also describes how int errupt s and except ions are handled from t he perspect ive of an applicat ion program m er.

6.1

PROCEDURE CALL TYPES

The processor support s procedure calls in t he following t wo different ways:

• •

CALL and RET inst ruct ions. ENTER and LEAVE inst ruct ions, in conj unct ion wit h t he CALL and RET inst ruct ions.

Bot h of t hese procedure call m echanism s use t he procedure st ack, com m only referred t o sim ply as “ t he st ack,” t o save t he st at e of t he calling procedure, pass param et ers t o t he called procedure, and st ore local variables for t he current ly execut ing procedure. The processor ’s facilit ies for handling int errupt s and except ions are sim ilar t o t hose used by t he CALL and RET inst ruct ions.

6.2

STACKS

The st ack ( see Figure 6- 1) is a cont iguous array of m em ory locat ions. I t is cont ained in a segm ent and ident ified by t he segm ent select or in t he SS regist er. When using t he flat m em ory m odel, t he st ack can be locat ed anywhere in t he linear address space for t he program . A st ack can be up t o 4 GByt es long, t he m axim um size of a segm ent . I t em s are placed on t he st ack using t he PUSH inst ruct ion and rem oved from t he st ack using t he POP inst ruct ion. When an it em is pushed ont o t he st ack, t he processor decrem ent s t he ESP regist er, t hen writ es t he it em at t he new t op of st ack. When an it em is popped off t he st ack, t he processor reads t he it em from t he t op of st ack, t hen increm ent s t he ESP regist er. I n t his m anner, t he st ack grows dow n in m em ory ( t owards lesser addresses) when it em s are pushed on t he st ack and shrinks up ( t owards great er addresses) when t he it em s are popped from t he st ack. A program or operat ing syst em / execut ive can set up m any st acks. For exam ple, in m ult it asking syst em s, each t ask can be given it s own st ack. The num ber of st acks in a syst em is lim it ed by t he m axim um num ber of segm ent s and t he available physical m em ory.

Vol. 1 6-1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

When a syst em set s up m any st acks, only one st ack—t he cur r e nt st a ck—is available at a t im e. The current st ack is t he one cont ained in t he segm ent referenced by t he SS regist er.

Stack Segment Bottom of Stack (Initial ESP Value)

Local Variables for Calling Procedure

The Stack Can Be 16 or 32 Bits Wide

Parameters Passed to Called Procedure

The EBP register is typically set to point to the return instruction pointer.

Frame Boundary

Return Instruction Pointer

EBP Register ESP Register

Top of Stack Pushes Move the Top Of Stack to Lower Addresses

Pops Move the Top Of Stack to Higher Addresses

Figure 6-1. Stack Structure The processor references t he SS regist er aut om at ically for all st ack operat ions. For exam ple, when t he ESP regist er is used as a m em ory address, it aut om at ically point s t o an address in t he current st ack. Also, t he CALL, RET, PUSH, POP, ENTER, and LEAVE inst ruct ions all perform operat ions on t he current st ack.

6.2.1

Setting Up a Stack

To set a st ack and est ablish it as t he current st ack, t he program or operat ing syst em / execut ive m ust do t he following: 1. Est ablish a st ack segm ent . 2. Load t he segm ent select or for t he st ack segm ent int o t he SS regist er using a MOV, POP, or LSS inst ruct ion.

6-2 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

3. Load t he st ack point er for t he st ack int o t he ESP regist er using a MOV, POP, or LSS inst ruct ion. The LSS inst ruct ion can be used t o load t he SS and ESP regist ers in one operat ion. See “ Segm ent Descript ors” in of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for inform at ion on how t o set up a segm ent descript or and segm ent lim it s for a st ack segm ent .

6.2.2

Stack Alignment

The st ack point er for a st ack segm ent should be aligned on 16- bit ( word) or 32- bit ( double- word) boundaries, depending on t he widt h of t he st ack segm ent . The D flag in t he segm ent descript or for t he current code segm ent set s t he st ack- segm ent widt h ( see “ Segm ent Descript ors” in Chapt er 3, “ Prot ect ed- Mode Mem ory Managem ent ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A) . The PUSH and POP inst ruct ions use t he D flag t o det erm ine how m uch t o decrem ent or increm ent t he st ack point er on a push or pop operat ion, respect ively. When t he st ack widt h is 16 bit s, t he st ack point er is increm ent ed or decrem ent ed in 16- bit increm ent s; when t he widt h is 32 bit s, t he st ack point er is increm ent ed or decrem ent ed in 32- bit increm ent s. Pushing a 16- bit value ont o a 32- bit wide st ack can result in st ack m isaligned ( t hat is, t he st ack point er is not aligned on a doubleword boundary) . One except ion t o t his rule is when t he cont ent s of a segm ent regist er ( a 16- bit segm ent select or) are pushed ont o a 32- bit wide st ack. Here, t he processor aut om at ically aligns t he st ack point er t o t he next 32- bit boundary. The processor does not check st ack point er alignm ent . I t is t he responsibilit y of t he program s, t asks, and syst em procedures running on t he processor t o m aint ain proper alignm ent of st ack point ers. Misaligning a st ack point er can cause serious perform ance degradat ion and in som e inst ances program failures.

6.2.3

Address-Size Attributes for Stack Accesses

I nst ruct ions t hat use t he st ack im plicit ly ( such as t he PUSH and POP inst ruct ions) have t wo address- size at t ribut es each of eit her 16 or 32 bit s. This is because t hey always have t he im plicit address of t he t op of t he st ack, and t hey m ay also have an explicit m em ory address ( for exam ple, PUSH Array1[ EBX] ) . The at t ribut e of t he explicit address is det erm ined by t he D flag of t he current code segm ent and t he presence or absence of t he 67H address- size prefix. The address- size at t ribut e of t he t op of t he st ack det erm ines whet her SP or ESP is used for t he st ack access. St ack operat ions wit h an address- size at t ribut e of 16 use t he 16- bit SP st ack point er regist er and can use a m axim um st ack address of FFFFH; st ack operat ions wit h an address- size at t ribut e of 32 bit s use t he 32- bit ESP regist er and can use a m axim um address of FFFFFFFFH. The default address- size at t ribut e for dat a segm ent s used as st acks is cont rolled by t he B flag of t he segm ent ’s descript or. When t his flag is clear, t he default address- size at t ribut e is 16; when t he flag is set , t he address- size at t ribut e is 32.

Vol. 1 6-3

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

6.2.4

Procedure Linking Information

The processor provides t wo point ers for linking of procedures: t he st ack- fram e base point er and t he ret urn inst ruct ion point er. When used in conj unct ion wit h a st andard soft ware procedure- call t echnique, t hese point ers perm it reliable and coherent linking of procedures.

6.2.4.1

Stack-Frame Base Pointer

The st ack is t ypically divided int o fram es. Each st ack fram e can t hen cont ain local variables, param et ers t o be passed t o anot her procedure, and procedure linking inform at ion. The st ack- fram e base point er ( cont ained in t he EBP regist er) ident ifies a fixed reference point wit hin t he st ack fram e for t he called procedure. To use t he st ack- fram e base point er, t he called procedure t ypically copies t he cont ent s of t he ESP regist er int o t he EBP regist er prior t o pushing any local variables on t he st ack. The st ack- fram e base point er t hen perm it s easy access t o dat a st ruct ures passed on t he st ack, t o t he ret urn inst ruct ion point er, and t o local variables added t o t he st ack by t he called procedure. Like t he ESP regist er, t he EBP regist er aut om at ically point s t o an address in t he current st ack segm ent ( t hat is, t he segm ent specified by t he current cont ent s of t he SS regist er) .

6.2.4.2

Return Instruction Pointer

Prior t o branching t o t he first inst ruct ion of t he called procedure, t he CALL inst ruct ion pushes t he address in t he EI P regist er ont o t he current st ack. This address is t hen called t he ret urn- inst ruct ion point er and it point s t o t he inst ruct ion where execut ion of t he calling procedure should resum e following a ret urn from t he called procedure. Upon ret urning from a called procedure, t he RET inst ruct ion pops t he ret urn- inst ruct ion point er from t he st ack back int o t he EI P regist er. Execut ion of t he calling procedure t hen resum es. The processor does not keep t rack of t he locat ion of t he ret urn- inst ruct ion point er. I t is t hus up t o t he program m er t o insure t hat st ack point er is point ing t o t he ret urninst ruct ion point er on t he st ack, prior t o issuing a RET inst ruct ion. A com m on way t o reset t he st ack point er t o t he point t o t he ret urn- inst ruct ion point er is t o m ove t he cont ent s of t he EBP regist er int o t he ESP regist er. I f t he EBP regist er is loaded wit h t he st ack point er im m ediat ely following a procedure call, it should point t o t he ret urn inst ruct ion point er on t he st ack. The processor does not require t hat t he ret urn inst ruct ion point er point back t o t he calling procedure. Prior t o execut ing t he RET inst ruct ion, t he ret urn inst ruct ion point er can be m anipulat ed in soft ware t o point t o any address in t he current code segm ent ( near ret urn) or anot her code segm ent ( far ret urn) . Perform ing such an operat ion, however, should be undert aken very caut iously, using only well defined code ent ry point s.

6-4 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

6.2.5

Stack Behavior in 64-Bit Mode

I n 64- bit m ode, address calculat ions t hat reference SS segm ent s are t reat ed as if t he segm ent base is zero. Fields ( base, lim it , and at t ribut e) in segm ent descript or regist ers are ignored. SS DPL is m odified such t hat it is always equal t o CPL. This will be t rue even if it is t he only field in t he SS descript or t hat is m odified. Regist ers E( SP) , E( I P) and E( BP) are prom ot ed t o 64- bit s and are re- nam ed RSP, RI P, and RBP respect ively. Som e form s of segm ent load inst ruct ions are invalid ( for exam ple, LDS, POP ES) . PUSH/ POP inst ruct ions increm ent / decrem ent t he st ack using a 64- bit widt h. When t he cont ent s of a segm ent regist er is pushed ont o 64- bit st ack, t he point er is aut om at ically aligned t o 64 bit s ( as wit h a st ack t hat has a 32- bit widt h) .

6.3

CALLING PROCEDURES USING CALL AND RET

The CALL inst ruct ion allow s cont rol t ransfer s t o procedures w it hin t he current code segm ent ( n e a r ca ll) and in a different code segm ent ( fa r ca ll) . Near calls usually pr ovide access t o local procedur es wit hin t he cur rent ly running pr ogram or t ask. Far calls are usually used t o access operat ing syst em procedures or procedures in a different t ask . See “ CALL—Call Pr ocedure” in Chapt er 3, “ I nst r uct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for a det ailed descript ion of t he CALL inst r uct ion. The RET inst ruct ion also allows near and far ret urns t o m at ch t he near and far versions of t he CALL inst ruct ion. I n addit ion, t he RET inst ruct ion allows a program t o increm ent t he st ack point er on a ret urn t o release param et ers from t he st ack. The num ber of byt es released from t he st ack is det erm ined by an opt ional argum ent ( n) t o t he RET inst ruct ion. See “ RET—Ret urn from Procedure” in Chapt er 4, “ I nst ruct ion Set Reference, N- Z,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2B, for a det ailed descript ion of t he RET inst ruct ion.

6.3.1

Near CALL and RET Operation

When execut ing a near call, t he processor does t he following ( see Figure 6- 2) : 1. Pushes t he current value of t he EI P regist er on t he st ack. 2. Loads t he offset of t he called procedure in t he EI P regist er. 3. Begins execut ion of t he called procedure.

Vol. 1 6-5

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

When execut ing a near ret urn, t he processor perform s t hese act ions: 1. Pops t he t op- of- st ack value ( t he ret urn inst ruct ion point er) int o t he EI P regist er. 2. I f t he RET inst ruct ion has an opt ional n argum ent , increm ent s t he st ack point er by t he num ber of byt es specified wit h t he n operand t o release param et ers from t he st ack. 3. Resum es execut ion of t he calling procedure.

6.3.2

Far CALL and RET Operation

When execut ing a far call, t he processor perform s t hese act ions ( see Figure 6- 2) : 1. Pushes t he current value of t he CS regist er on t he st ack. 2. Pushes t he current value of t he EI P regist er on t he st ack. 3. Loads t he segm ent select or of t he segm ent t hat cont ains t he called procedure in t he CS regist er. 4. Loads t he offset of t he called procedure in t he EI P regist er. 5. Begins execut ion of t he called procedure. When execut ing a far ret urn, t he processor does t he following: 1. Pops t he t op- of- st ack value ( t he ret urn inst ruct ion point er) int o t he EI P regist er. 2. Pops t he t op- of- st ack value ( t he segm ent select or for t he code segm ent being ret urned t o) int o t he CS regist er. 3. I f t he RET inst ruct ion has an opt ional n argum ent , increm ent s t he st ack point er by t he num ber of byt es specified wit h t he n operand t o release param et ers from t he st ack. 4. Resum es execut ion of t he calling procedure.

6-6 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Stack Frame Before Call

Stack Frame After Call

Stack During Near Call

Param 1 Param 2 Param 3 Calling EIP

Stack During Near Return

Stack Frame Before Call ESP Before Call ESP After Call Stack Frame After Call

Stack During Far Call

Param 1 Param 2 Param 3 Calling CS Calling EIP

ESP After Call

Stack During Far Return

ESP After Return Param 1 Param 2 Param 3 Calling EIP

ESP Before Call

ESP Before Return

ESP After Return Param 1 Param 2 Param 3 Calling CS Calling EIP

ESP Before Return

Note: On a near or far return, parameters are released from the stack based on the optional n operand in the RET n instruction.

Figure 6-2. Stack on Near and Far Calls

6.3.3

Parameter Passing

Param et ers can be passed bet ween procedures in any of t hree ways: t hrough general- purpose regist ers, in an argum ent list , or on t he st ack.

6.3.3.1

Passing Parameters Through the General-Purpose Registers

The processor does not save t he st at e of t he general- purpose regist ers on procedure calls. A calling procedure can t hus pass up t o six param et ers t o t he called procedure by copying t he param et ers int o any of t hese regist ers ( except t he ESP and EBP regist ers) prior t o execut ing t he CALL inst ruct ion. The called procedure can likewise pass param et ers back t o t he calling procedure t hrough general- purpose regist ers.

6.3.3.2

Passing Parameters on the Stack

To pass a large num ber of param et ers t o t he called procedure, t he param et ers can be placed on t he st ack, in t he st ack fram e for t he calling procedure. Here, it is useful t o

Vol. 1 6-7

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

use t he st ack- fram e base point er ( in t he EBP regist er) t o m ake a fram e boundary for easy access t o t he param et ers. The st ack can also be used t o pass param et ers back from t he called procedure t o t he calling procedure.

6.3.3.3

Passing Parameters in an Argument List

An alt ernat e m et hod of passing a larger num ber of param et ers ( or a dat a st ruct ure) t o t he called procedure is t o place t he param et ers in an argum ent list in one of t he dat a segm ent s in m em ory. A point er t o t he argum ent list can t hen be passed t o t he called procedure t hrough a general- purpose regist er or t he st ack. Param et ers can also be passed back t o t he calling procedure in t his sam e m anner.

6.3.4

Saving Procedure State Information

The processor does not save t he cont ent s of t he general- purpose regist ers, segm ent regist ers, or t he EFLAGS regist er on a procedure call. A calling procedure should explicit ly save t he values in any of t he general- purpose regist ers t hat it will need when it resum es execut ion aft er a ret urn. These values can be saved on t he st ack or in m em ory in one of t he dat a segm ent s. The PUSHA and POPA inst ruct ions facilit at e saving and rest oring t he cont ent s of t he general- purpose regist ers. PUSHA pushes t he values in all t he general- purpose regist ers on t he st ack in t he following order: EAX, ECX, EDX, EBX, ESP ( t he value prior t o execut ing t he PUSHA inst ruct ion) , EBP, ESI , and EDI . The POPA inst ruct ion pops all t he regist er values saved wit h a PUSHA inst ruct ion ( except t he ESP value) from t he st ack t o t heir respect ive regist ers. I f a called procedure changes t he st at e of any of t he segm ent regist ers explicit ly, it should rest ore t hem t o t heir form er values before execut ing a ret urn t o t he calling procedure. I f a calling procedure needs t o m aint ain t he st at e of t he EFLAGS regist er, it can save and rest ore all or part of t he regist er using t he PUSHF/ PUSHFD and POPF/ POPFD inst ruct ions. The PUSHF inst ruct ion pushes t he lower word of t he EFLAGS regist er on t he st ack, while t he PUSHFD inst ruct ion pushes t he ent ire regist er. The POPF inst ruct ion pops a word from t he st ack int o t he lower word of t he EFLAGS regist er, while t he POPFD inst ruct ion pops a double word from t he st ack int o t he regist er.

6.3.5

Calls to Other Privilege Levels

The I A- 32 archit ect ure’s prot ect ion m echanism recognizes four privilege levels, num bered from 0 t o 3, where a great er num ber m ean less privilege. The reason t o use privilege levels is t o im prove t he reliabilit y of operat ing syst em s. For exam ple, Figure 6- 3 shows how privilege levels can be int erpret ed as rings of prot ect ion.

6-8 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Protection Rings

Operating System Kernel

Level 0

Operating System Services (Device Drivers, Etc.)

Level 1

Applications

Level 2 Level 3

Highest 0

1

2

Lowest 3

Privilege Levels

Figure 6-3. Protection Rings I n t his exam ple, t he highest privilege level 0 ( at t he cent er of t he diagram ) is used for segm ent s t hat cont ain t he m ost crit ical code m odules in t he syst em , usually t he kernel of an operat ing syst em . The out er rings ( wit h progressively lower privileges) are used for segm ent s t hat cont ain code m odules for less crit ical soft ware. Code m odules in lower privilege segm ent s can only access m odules operat ing at higher privilege segm ent s by m eans of a t ight ly cont rolled and prot ect ed int erface called a ga t e . At t em pt s t o access higher privilege segm ent s wit hout going t hrough a prot ect ion gat e and wit hout having sufficient access right s causes a general- prot ect ion except ion ( # GP) t o be generat ed. I f an operat ing syst em or execut ive uses t his m ult ilevel prot ect ion m echanism , a call t o a procedure t hat is in a m ore privileged prot ect ion level t han t he calling procedure is handled in a sim ilar m anner as a far call ( see Sect ion 6.3.2, “ Far CALL and RET Operat ion” ) . The differences are as follows:



The segm ent select or provided in t he CALL inst ruct ion references a special dat a st ruct ure called a ca ll ga t e de scr ipt or. Am ong ot her t hings, t he call gat e descript or provides t he following: — access right s inform at ion — t he segm ent select or for t he code segm ent of t he called procedure — an offset int o t he code segm ent ( t hat is, t he inst ruct ion point er for t he called procedure)

Vol. 1 6-9

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS



The processor swit ches t o a new st ack t o execut e t he called procedure. Each privilege level has it s own st ack. The segm ent select or and st ack point er for t he privilege level 3 st ack are st ored in t he SS and ESP regist ers, respect ively, and are aut om at ically saved when a call t o a m ore privileged level occurs. The segm ent select ors and st ack point ers for t he privilege level 2, 1, and 0 st acks are st ored in a syst em segm ent called t he t ask st at e segm ent ( TSS) .

The use of a call gat e and t he TSS during a st ack swit ch are t ransparent t o t he calling procedure, except when a general- prot ect ion except ion is raised.

6.3.6

CALL and RET Operation Between Privilege Levels

When m aking a call t o a m ore privileged prot ect ion level, t he processor does t he following ( see Figure 6- 4) : 1. Perform s an access right s check ( privilege check) . 2. Tem porarily saves ( int ernally) t he current cont ent s of t he SS, ESP, CS, and EI P regist ers.

Stack Frame Before Call

Stack for Calling Procedure

Stack for Called Procedure

Param 1 Param 2 Param 3

Calling SS Calling ESP Param 1 Param 2 Param 3 Calling CS Calling EIP

ESP Before Call ESP After Call

ESP After Return Param 1 Param 2 Param 3 ESP Before Return

Stack Frame After Call

Calling SS Calling ESP Param 1 Param 2 Param 3 Calling CS Calling EIP

Note: On a return, parameters are released on both stacks based on the optional n operand in the RET n instruction.

Figure 6-4. Stack Switch on a Call to a Different Privilege Level

6-10 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

3. Loads t he segm ent select or and st ack point er for t he new st ack ( t hat is, t he st ack for t he privilege level being called) from t he TSS int o t he SS and ESP regist ers and swit ches t o t he new st ack. 4. Pushes t he t em porarily saved SS and ESP values for t he calling procedure’s st ack ont o t he new st ack. 5. Copies t he param et ers from t he calling procedure’s st ack t o t he new st ack. A value in t he call gat e descript or det erm ines how m any param et ers t o copy t o t he new st ack. 6. Pushes t he t em porarily saved CS and EI P values for t he calling procedure t o t he new st ack. 7. Loads t he segm ent select or for t he new code segm ent and t he new inst ruct ion point er from t he call gat e int o t he CS and EI P regist ers, respect ively. 8. Begins execut ion of t he called procedure at t he new privilege level. When execut ing a ret urn from t he privileged procedure, t he processor perform s t hese act ions: 1. Perform s a privilege check. 2. Rest ores t he CS and EI P regist ers t o t heir values prior t o t he call. 3. I f t he RET inst ruct ion has an opt ional n argum ent , increm ent s t he st ack point er by t he num ber of byt es specified wit h t he n operand t o release param et ers from t he st ack. I f t he call gat e descript or specifies t hat one or m ore param et ers be copied from one st ack t o t he ot her, a RET n inst ruct ion m ust be used t o release t he param et ers from bot h st acks. Here, t he n operand specifies t he num ber of byt es occupied on each st ack by t he param et ers. On a ret urn, t he processor increm ent s ESP by n for each st ack t o st ep over ( effect ively rem ove) t hese param et ers from t he st acks. 4. Rest ores t he SS and ESP regist ers t o t heir values prior t o t he call, which causes a swit ch back t o t he st ack of t he calling procedure. 5. I f t he RET inst ruct ion has an opt ional n argum ent , increm ent s t he st ack point er by t he num ber of byt es specified wit h t he n operand t o release param et ers from t he st ack ( see explanat ion in st ep 3) . 6. Resum es execut ion of t he calling procedure. See Chapt er 4, “ Prot ect ion,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft w are Dev eloper ’s Manual, Volum e 3A, for det ailed inform at ion on calls t o privileged levels and t he call gat e descript or.

6.3.7

Branch Functions in 64-Bit Mode

The 64- bit ext ensions expand branching m echanism s t o accom m odat e branches in 64- bit linear- address space. These are:



Near- branch sem ant ics are redefined in 64- bit m ode

Vol. 1 6-11

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS



I n 64- bit m ode and com pat ibilit y m ode, 64- bit call- gat e descript ors for far calls are available

I n 64- bit m ode, t he operand size for all near branches ( CALL, RET, JCC, JCXZ, JMP, and LOOP) is forced t o 64 bit s. These inst ruct ions updat e t he 64- bit RI P wit hout t he need for a REX operand- size prefix. The following aspect s of near branches are cont rolled by t he effect ive operand size:

• • • •

Truncat ion of t he size of t he inst ruct ion point er Size of a st ack pop or push, due t o a CALL or RET Size of a st ack- point er increm ent or decrem ent , due t o a CALL or RET I ndirect- branch operand size

I n 64- bit m ode, all of t he above act ions are forced t o 64 bit s regardless of operand size prefixes ( operand size prefixes are silent ly ignored) . However, t he displacem ent field for relat ive branches is st ill lim it ed t o 32 bit s and t he address size for near branches is not forced in 64- bit m ode. Address sizes affect t he size of RCX used for JCXZ and LOOP; t hey also im pact t he address calculat ion for m em ory indirect branches. Such addresses are 64 bit s by default ; but t hey can be overridden t o 32 bit s by an address size prefix. Soft ware t ypically uses far branches t o change privilege levels. The legacy I A- 32 archit ect ure provides t he call- gat e m echanism t o allow soft ware t o branch from one privilege level t o anot her, alt hough call gat es can also be used for branches t hat do not change privilege levels. When call gat es are used, t he select or port ion of t he direct or indirect point er references a gat e descript or ( t he offset in t he inst ruct ion is ignored) . The offset t o t he dest inat ion’s code segm ent is t aken from t he call- gat e descript or. 64- bit m ode redefines t he t ype value of a 32- bit call- gat e descript or t ype t o a 64- bit call gat e descript or and expands t he size of t he 64- bit descript or t o hold a 64- bit offset . The 64- bit m ode call- gat e descript or allows far branches t hat reference any locat ion in t he support ed linear- address space. These call gat es also hold t he t arget code select or ( CS) , allowing changes t o privilege level and default size as a result of t he gat e t ransit ion. Because im m ediat es are generally specified up t o 32 bit s, t he only way t o specify a full 64- bit absolut e RI P in 64- bit m ode is wit h an indirect branch. For t his reason, direct far branches are elim inat ed from t he inst ruct ion set in 64- bit m ode. 64- bit m ode also expands t he sem ant ics of t he SYSENTER and SYSEXI T inst ruct ions so t hat t he inst ruct ions operat e w it hin a 64- bit m em ory space. The m ode also int roduces t wo new inst ruct ions: SYSCALL and SYSRET ( w hich are valid only in 64- bit m ode) . For det ails, see “ SYSENTER—Fast Syst em Call” and “ SYSEXI T—Fast Ret ur n from Fast Syst em Call” in Chapt er 4, “ I nst ruct ion Set Reference, N- Z,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft w are Developer’s Manual, Volum e 2B.

6-12 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

6.4

INTERRUPTS AND EXCEPTIONS

The processor provides t wo m echanism s for int errupt ing program execut ion, int errupt s and except ions:



An int e r r upt is an asynchronous event t hat is t ypically t riggered by an I / O device.



An e x ce pt ion is a synchronous event t hat is generat ed when t he processor det ect s one or m ore predefined condit ions while execut ing an inst ruct ion. The I A- 32 archit ect ure specifies t hree classes of except ions: fault s, t raps, and abort s.

The processor responds t o int errupt s and except ions in essent ially t he sam e way. When an int errupt or except ion is signaled, t he processor halt s execut ion of t he current program or t ask and swit ches t o a handler procedure t hat has been writ t en specifically t o handle t he int errupt or except ion condit ion. The processor accesses t he handler procedure t hrough an ent ry in t he int errupt descript or t able ( I DT) . When t he handler has com plet ed handling t he int errupt or except ion, program cont rol is ret urned t o t he int errupt ed program or t ask. The operat ing syst em , execut ive, and/ or device drivers norm ally handle int errupt s and except ions independent ly from applicat ion program s or t asks. Applicat ion program s can, however, access t he int errupt and except ion handlers incorporat ed in an operat ing syst em or execut ive t hrough assem bly- language calls. The rem ainder of t his sect ion gives a brief overview of t he processor ’s int errupt and except ion handling m echanism . See Chapt er 5, “ I nt errupt and Except ion Handling,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, for a descript ion of t his m echanism . The I A- 32 Archit ect ure defines 18 predefined int errupt s and except ions and 224 user defined int er rupt s, which are associat ed wit h ent ries in t he I DT. Each int errupt and except ion in t he I DT is ident ified wit h a num ber, called a ve ct or. Table 6- 1 list s t he int errupt s and except ions wit h ent ries in t he I DT and t heir respect ive vect or num bers. Vect ors 0 t hrough 8, 10 t hrough 14, and 16 t hrough 19 are t he predefined int errupt s and except ions, and vect ors 32 t hrough 255 are t he user- defined int errupt s, called m a sk a ble in t e r r u pt s. Not e t hat t he processor defines several addit ional int errupt s t hat do not point t o ent ries in t he I DT; t he m ost not able of t hese int errupt s is t he SMI int errupt . See Chapt er 5, “ I nt errupt and Except ion Handling,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, for m ore inform at ion about t he int errupt s and except ions. When t he processor det ect s an int errupt or except ion, it does one of t he following t hings:

• •

Execut es an im plicit call t o a handler procedure. Execut es an im plicit call t o a handler t ask.

Vol. 1 6-13

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

6.4.1

Call and Return Operation for Interrupt or Exception Handling Procedures

A call t o an int errupt or except ion handler procedure is sim ilar t o a procedure call t o anot her prot ect ion level ( see Sect ion 6.3.6, “ CALL and RET Operat ion Bet ween Privilege Levels” ) . Here, t he int errupt vect or references one of t wo kinds of gat es: an int e r r u pt ga t e or a t r a p ga t e . I nt errupt and t rap gat es are sim ilar t o call gat es in t hat t hey provide t he following inform at ion:

• • •

Access right s inform at ion The segm ent select or for t he code segm ent t hat cont ains t he handler procedure An offset int o t he code segm ent t o t he first inst ruct ion of t he handler procedure

The difference bet ween an int errupt gat e and a t rap gat e is as follows. I f an int errupt or except ion handler is called t hrough an int errupt gat e, t he processor clears t he int errupt enable ( I F) flag in t he EFLAGS regist er t o prevent subsequent int errupt s from int erfering wit h t he execut ion of t he handler. When a handler is called t hrough a t rap gat e, t he st at e of t he I F flag is not changed.

Table 6-1. Exceptions and Interrupts Vector No.

Mnemonic

0

#DE

Divide Error

DIV and IDIV instructions.

1

#DB

Debug

Any code or data reference.

NMI Interrupt

Non-maskable external interrupt.

2

Description

Source

3

#BP

Breakpoint

INT 3 instruction.

4

#OF

Overflow

INTO instruction.

5

#BR

BOUND Range Exceeded

BOUND instruction.

6

#UD

Invalid Opcode (UnDefined Opcode)

UD2 instruction or reserved opcode.1

7

#NM

Device Not Available (No Math Coprocessor)

Floating-point or WAIT/FWAIT instruction.

8

#DF

Double Fault

Any instruction that can generate an exception, an NMI, or an INTR.

9

#MF

CoProcessor Segment Overrun (reserved)

Floating-point instruction.2

10

#TS

Invalid TSS

Task switch or TSS access.

11

#NP

Segment Not Present

Loading segment registers or accessing system segments.

12

#SS

Stack Segment Fault

Stack operations and SS register loads.

6-14 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Table 6-1. Exceptions and Interrupts (Contd.) Vector No.

Mnemonic

13

#GP

General Protection

Any memory reference and other protection checks.

14

#PF

Page Fault

Any memory reference.

15

Description

Source

Reserved

16

#MF

Floating-Point Error (Math Fault)

Floating-point or WAIT/FWAIT instruction.

17

#AC

Alignment Check

Any data reference in memory.3

18

#MC

Machine Check

Error codes (if any) and source are model dependent.4

19

#XF

SIMD Floating-Point Exception

SIMD Floating-Point Instruction5

20-31

Reserved

32-255

Maskable Interrupts

External interrupt from INTR pin or INT n instruction.

NOTES: 1. The UD2 instruction was introduced in the Pentium Pro processor. 2. IA-32 processors after the Intel386 processor do not generate this exception. 3. This exception was introduced in the Intel486 processor. 4. This exception was introduced in the Pentium processor and enhanced in the P6 family processors. 5. This exception was introduced in the Pentium III processor. I f t he code segm ent for t he handler procedure has t he sam e privilege level as t he current ly execut ing program or t ask, t he handler procedure uses t he current st ack; if t he handler execut es at a m ore privileged level, t he processor swit ches t o t he st ack for t he handler ’s privilege level. I f no st ack swit ch occurs, t he processor does t he following when calling an int errupt or except ion handler ( see Figure 6- 5) : 1. Pushes t he current cont ent s of t he EFLAGS, CS, and EI P regist ers ( in t hat order) on t he st ack. 2. Pushes an error code ( if appropriat e) on t he st ack. 3. Loads t he segm ent select or for t he new code segm ent and t he new inst ruct ion point er ( from t he int errupt gat e or t rap gat e) int o t he CS and EI P regist ers, respect ively. 4. I f t he call is t hrough an int errupt gat e, clears t he I F flag in t he EFLAGS regist er. 5. Begins execut ion of t he handler procedure.

Vol. 1 6-15

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Stack Usage with No Privilege-Level Change Interrupted Procedure’s and Handler’s Stack

EFLAGS CS EIP Error Code

ESP Before Transfer to Handler

ESP After Transfer to Handler

Stack Usage with Privilege-Level Change Interrupted Procedure’s Stack

Handler’s Stack ESP Before Transfer to Handler

ESP After Transfer to Handler

SS ESP EFLAGS CS EIP Error Code

Figure 6-5. Stack Usage on Transfers to Interrupt and Exception Handling Routines I f a st ack swit ch does occur, t he processor does t he following: 1. Tem porarily saves ( int ernally) t he current cont ent s of t he SS, ESP, EFLAGS, CS, and EI P regist ers. 2. Loads t he segm ent select or and st ack point er for t he new st ack ( t hat is, t he st ack for t he privilege level being called) from t he TSS int o t he SS and ESP regist ers and swit ches t o t he new st ack. 3. Pushes t he t em porarily saved SS, ESP, EFLAGS, CS, and EI P values for t he int errupt ed procedure’s st ack ont o t he new st ack. 4. Pushes an error code on t he new st ack ( if appropriat e) . 5. Loads t he segm ent select or for t he new code segm ent and t he new inst ruct ion point er ( from t he int errupt gat e or t rap gat e) int o t he CS and EI P regist ers, respect ively. 6. I f t he call is t hrough an int errupt gat e, clears t he I F flag in t he EFLAGS regist er. 7. Begins execut ion of t he handler procedure at t he new privilege level.

6-16 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

A ret urn from an int errupt or except ion handler is init iat ed wit h t he I RET inst ruct ion. The I RET inst ruct ion is sim ilar t o t he far RET inst ruct ion, except t hat it also rest ores t he cont ent s of t he EFLAGS regist er for t he int errupt ed procedure. When execut ing a ret urn from an int errupt or except ion handler from t he sam e privilege level as t he int errupt ed procedure, t he processor perform s t hese act ions: 1. Rest ores t he CS and EI P regist ers t o t heir values prior t o t he int errupt or except ion. 2. Rest ores t he EFLAGS regist er. 3. I ncrem ent s t he st ack point er appropriat ely. 4. Resum es execut ion of t he int errupt ed procedure. When execut ing a ret urn from an int errupt or except ion handler from a different privilege level t han t he int errupt ed procedure, t he processor perform s t hese act ions: 1. Perform s a privilege check. 2. Rest ores t he CS and EI P regist ers t o t heir values prior t o t he int errupt or except ion. 3. Rest ores t he EFLAGS regist er. 4. Rest ores t he SS and ESP regist ers t o t heir values prior t o t he int errupt or except ion, result ing in a st ack swit ch back t o t he st ack of t he int errupt ed procedure. 5. Resum es execut ion of t he int errupt ed procedure.

6.4.2

Calls to Interrupt or Exception Handler Tasks

I nt errupt and except ion handler rout ines can also be execut ed in a separat e t ask. Here, an int errupt or except ion causes a t ask swit ch t o a handler t ask. The handler t ask is given it s own address space and ( opt ionally) can execut e at a higher prot ect ion level t han applicat ion program s or t asks. The swit ch t o t he handler t ask is accom plished wit h an im plicit t ask call t hat references a t a sk ga t e de scr ipt or. The t ask gat e provides access t o t he address space for t he handler t ask. As part of t he t ask swit ch, t he processor saves com plet e st at e inform at ion for t he int errupt ed program or t ask. Upon ret urning from t he handler t ask, t he st at e of t he int errupt ed program or t ask is rest ored and execut ion cont inues. See Chapt er 5, “ I nt errupt and Except ion Handling,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, for m ore inform at ion on handling int errupt s and except ions t hrough handler t asks.

6.4.3

Interrupt and Exception Handling in Real-Address Mode

When operat ing in real- address m ode, t he processor responds t o an int errupt or except ion wit h an im plicit far call t o an int errupt or except ion handler. The processor uses t he int errupt or except ion vect or num ber as an index int o an int errupt t able. The

Vol. 1 6-17

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

int errupt t able cont ains inst ruct ion point ers t o t he int errupt and except ion handler procedures. The processor saves t he st at e of t he EFLAGS regist er, t he EI P regist er, t he CS regist er, and an opt ional error code on t he st ack before swit ching t o t he handler procedure. A ret urn from t he int errupt or except ion handler is carried out wit h t he I RET inst ruct ion. See Chapt er 15, “ 8086 Em ulat ion,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion on handling int errupt s and except ions in real- address m ode.

6.4.4

INT n, INTO, INT 3, and BOUND Instructions

The I NT n, I NTO, I NT 3, and BOUND inst ruct ions allow a program or t ask t o explicit ly call an int errupt or except ion handler. The I NT n inst ruct ion uses an int errupt vect or as an argum ent , which allows a program t o call any int errupt handler. The I NTO inst ruct ion explicit ly calls t he overflow except ion ( # OF) handler if t he overflow flag ( OF) in t he EFLAGS regist er is set . The OF flag indicat es overflow on arit hm et ic inst ruct ions, but it does not aut om at ically raise an overflow except ion. An overflow except ion can only be raised explicit ly in eit her of t he following ways:

• •

Execut e t he I NTO inst ruct ion. Test t he OF flag and execut e t he I NT n inst ruct ion wit h an argum ent of 4 ( t he vect or num ber of t he overflow except ion) if t he flag is set .

Bot h t he m et hods of dealing wit h overflow condit ions allow a program t o t est for overflow at specific places in t he inst ruct ion st ream . The I NT 3 inst ruct ion explicit ly calls t he breakpoint except ion ( # BP) handler. The BOUND inst ruct ion explicit ly calls t he BOUND- range exceeded except ion ( # BR) handler if an operand is found t o be not wit hin predefined boundaries in m em ory. This inst ruct ion is provided for checking references t o arrays and ot her dat a st ruct ures. Like t he overflow except ion, t he BOUND- range exceeded except ion can only be raised explicit ly wit h t he BOUND inst ruct ion or t he I NT n inst ruct ion wit h an argum ent of 5 ( t he vect or num ber of t he bounds- check except ion) . The processor does not im plicit ly perform bounds checks and raise t he BOUND- range exceeded except ion.

6.4.5

Handling Floating-Point Exceptions

When operat ing on individual or packed float ing- point values, t he I A- 32 archit ect ure support s a set of six float ing- point except ions. These except ions can be generat ed during operat ions perform ed by t he x87 FPU inst ruct ions or by SSE/ SSE2/ SSE3 inst ruct ions. When an x87 FPU inst ruct ion ( including t he FI STTP inst ruct ion in SSE3) generat es one or m ore of t hese except ions, it in t urn generat es float ing- point error

6-18 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

except ion ( # MF) ; when an SSE/ SSE2/ SSE3 inst ruct ion generat es a float ing- point except ion, it in t urn generat es SI MD float ing- point except ion ( # XF) . See t he following sect ions for furt her descript ions of t he float ing- point except ions, how t hey are generat ed, and how t hey are handled:



Sect ion 4.9.1, “ Float ing- Point Except ion Condit ions,” and Sect ion 4.9.3, “ Typical Act ions of a Float ing- Point Except ion Handler ”



Sect ion 8.4, “ x87 FPU Float ing- Point Except ion Handling,” and Sect ion 8.5, “ x87 FPU Float ing- Point Except ion Condit ions”

• •

Sect ion 11.5.1, “ SI MD Float ing- Point Except ions” I nt errupt Behavior

6.4.6

Interrupt and Exception Behavior in 64-Bit Mode

64- bit ext ensions expand t he legacy I A- 32 int errupt- processing and except ionprocessing m echanism t o allow support for 64- bit operat ing syst em s and applicat ions. Changes include:



All int errupt handlers point ed t o by t he I DT are 64- bit code ( does not apply t o t he SMI handler) .



The size of int errupt- st ack pushes is fixed at 64 bit s. The processor uses 8- byt e, zero ext ended st ores.



The st ack point er ( SS: RSP) is pushed uncondit ionally on int errupt s. I n legacy environm ent s, t his push is condit ional and based on a change in current privilege level ( CPL) .

• • • •

The new SS is set t o NULL if t here is a change in CPL. I RET behavior changes. There is a new int errupt st ack- swit ch m echanism . The alignm ent of int errupt st ack fram e is different .

6.5

PROCEDURE CALLS FOR BLOCK-STRUCTURED LANGUAGES

The I A- 32 archit ect ure support s an alt ernat e m et hod of perform ing procedure calls wit h t he ENTER ( ent er procedure) and LEAVE ( leave procedure) inst ruct ions. These inst ruct ions aut om at ically creat e and release, respect ively, st ack fram es for called procedures. The st ack fram es have predefined spaces for local variables and t he necessary point ers t o allow coherent ret urns from called procedures. They also allow scope rules t o be im plem ent ed so t hat procedures can access t heir own local variables and som e num ber of ot her variables locat ed in ot her st ack fram es.

Vol. 1 6-19

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

ENTER and LEAVE offer t wo benefit s:



They provide m achine- language support for im plem ent ing block- st ruct ured languages, such as C and Pascal.



They sim plify procedure ent ry and exit in com piler- generat ed code.

6.5.1

ENTER Instruction

The ENTER inst ruct ion creat es a st ack fram e com pat ible wit h t he scope rules t ypically used in block- st ruct ured languages. I n block- st ruct ured languages, t he scope of a procedure is t he set of variables t o which it has access. The rules for scope vary am ong languages. They m ay be based on t he nest ing of procedures, t he division of t he program int o separat ely com piled files, or som e ot her m odularizat ion schem e. ENTER has t wo operands. The first specifies t he num ber of byt es t o be reserved on t he st ack for dynam ic st orage for t he procedure being called. Dynam ic st orage is t he m em ory allocat ed for variables creat ed when t he procedure is called, also known as aut om at ic variables. The second param et er is t he lexical nest ing level ( from 0 t o 31) of t he procedure. The nest ing level is t he dept h of a procedure in a hierarchy of procedure calls. The lexical level is unrelat ed t o eit her t he prot ect ion privilege level or t o t he I / O privilege level of t he current ly running program or t ask. ENTER, in t he following exam ple, allocat es 2 Kbyt es of dynam ic st orage on t he st ack and set s up point ers t o t wo previous st ack fram es in t he st ack fram e for t his procedure: ENTER 2048,3 The lexical nest ing level det erm ines t he num ber of st ack fram e point ers t o copy int o t he new st ack fram e from t he preceding fram e. A st ack fram e point er is a doubleword used t o access t he variables of a procedure. The set of st ack fram e point ers used by a procedure t o access t he variables of ot her procedures is called t he display. The first doubleword in t he display is a point er t o t he previous st ack fram e. This point er is used by a LEAVE inst ruct ion t o undo t he effect of an ENTER inst ruct ion by discarding t he current st ack fram e. Aft er t he ENTER inst ruct ion creat es t he display for a procedure, it allocat es t he dynam ic local variables for t he procedure by decrem ent ing t he cont ent s of t he ESP regist er by t he num ber of byt es specified in t he first param et er. This new value in t he ESP regist er serves as t he init ial t op- of- st ack for all PUSH and POP operat ions wit hin t he procedure. To allow a procedure t o address it s display, t he ENTER inst ruct ion leaves t he EBP regist er point ing t o t he first doubleword in t he display. Because st acks grow down, t his is act ually t he doubleword wit h t he highest address in t he display. Dat a m anipulat ion inst ruct ions t hat specify t he EBP regist er as a base regist er aut om at ically address locat ions wit hin t he st ack segm ent inst ead of t he dat a segm ent . The ENTER inst ruct ion can be used in t wo ways: nest ed and non- nest ed. I f t he lexical level is 0, t he non- nest ed form is used. The non- nest ed form pushes t he cont ent s of

6-20 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

t he EBP regist er on t he st ack, copies t he cont ent s of t he ESP regist er int o t he EBP regist er, and subt ract s t he first operand from t he cont ent s of t he ESP regist er t o allocat e dynam ic st orage. The non- nest ed form differs from t he nest ed form in t hat no st ack fram e point ers are copied. The nest ed form of t he ENTER inst ruct ion occurs when t he second param et er ( lexical level) is not zero. The following pseudo code shows t he form al definit ion of t he ENTER inst ruct ion. STORAGE is t he num ber of byt es of dynam ic st orage t o allocat e for local variables, and LEVEL is t he lexical nest ing level. PUSH EBP; FRAME_PTR ← ESP; IF LEVEL > 0 THEN DO (LEVEL − 1) times EBP ← EBP − 4; PUSH Pointer(EBP); (* doubleword pointed to by EBP *) OD; PUSH FRAME_PTR; FI; EBP ← FRAME_PTR; ESP ← ESP − STORAGE; The m ain procedure ( in which all ot her procedures are nest ed) operat es at t he highest lexical level, level 1. The first procedure it calls operat es at t he next deeper lexical level, level 2. A level 2 procedure can access t he variables of t he m ain program , which are at fixed locat ions specified by t he com piler. I n t he case of level 1, t he ENTER inst ruct ion allocat es only t he request ed dynam ic st orage on t he st ack because t here is no previous display t o copy. A procedure t hat calls anot her procedure at a lower lexical level gives t he called procedure access t o t he variables of t he caller. The ENTER inst ruct ion provides t his access by placing a point er t o t he calling procedure's st ack fram e in t he display. A procedure t hat calls anot her procedure at t he sam e lexical level should not give access t o it s variables. I n t his case, t he ENTER inst ruct ion copies only t hat part of t he display from t he calling procedure which refers t o previously nest ed procedures operat ing at higher lexical levels. The new st ack fram e does not include t he point er for addressing t he calling procedure’s st ack fram e. The ENTER inst ruct ion t reat s a re- ent rant procedure as a call t o a procedure at t he sam e lexical level. I n t his case, each succeeding it erat ion of t he re- ent rant procedure can address only it s own variables and t he variables of t he procedures wit hin which it is nest ed. A re- ent rant procedure always can address it s own variables; it does not require point ers t o t he st ack fram es of previous it erat ions. By copying only t he st ack fram e point ers of procedures at higher lexical levels, t he ENTER inst ruct ion m akes cert ain t hat procedures access only t hose variables of higher lexical levels, not t hose at parallel lexical levels ( see Figure 6- 6) .

Vol. 1 6-21

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Main (Lexical Level 1) Procedure A (Lexical Level 2) Procedure B (Lexical Level 3) Procedure C (Lexical Level 3) Procedure D (Lexical Level 4)

Figure 6-6. Nested Procedures Block- st ruct ured languages can use t he lexical levels defined by ENTER t o cont rol access t o t he variables of nest ed procedures. I n Figure 6- 6, for exam ple, if procedure A calls procedure B which, in t urn, calls procedure C, t hen procedure C will have access t o t he variables of t he MAI N procedure and procedure A, but not t hose of procedure B because t hey are at t he sam e lexical level. The following definit ion describes t he access t o variables for t he nest ed procedures in Figure 6- 6. 1. MAI N has variables at fixed locat ions. 2. Procedure A can access only t he variables of MAI N. 3. Procedure B can access only t he variables of procedure A and MAI N. Procedure B cannot access t he variables of procedure C or procedure D. 4. Procedure C can access only t he variables of procedure A and MAI N. Procedure C cannot access t he variables of procedure B or procedure D. 5. Procedure D can access t he variables of procedure C, procedure A, and MAI N. Procedure D cannot access t he variables of procedure B. I n Figure 6- 7, an ENTER inst ruct ion at t he beginning of t he MAI N procedure creat es t hree doublewords of dynam ic st orage for MAI N, but copies no point ers from ot her st ack fram es. The first doubleword in t he display holds a copy of t he last value in t he EBP regist er before t he ENTER inst ruct ion was execut ed. The second doubleword holds a copy of t he cont ent s of t he EBP regist er following t he ENTER inst ruct ion. Aft er t he inst ruct ion is execut ed, t he EBP regist er point s t o t he first doubleword pushed on t he st ack, and t he ESP regist er point s t o t he last doubleword in t he st ack fram e. When MAI N calls procedure A, t he ENTER inst ruct ion creat es a new display ( see Figure 6- 8) . The first doubleword is t he last value held in MAI N's EBP regist er. The second doubleword is a point er t o MAI N's st ack fram e which is copied from t he second doubleword in MAI N's display. This happens t o be anot her copy of t he last value held in MAI N’s EBP regist er. Procedure A can access variables in MAI N because MAI N is at level 1.

6-22 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Therefore t he base address for t he dynam ic st orage used in MAI N is t he current address in t he EBP regist er, plus four byt es t o account for t he saved cont ent s of MAI N’s EBP regist er. All dynam ic variables for MAI N are at fixed, posit ive offset s from t his value.

Old EBP Display

EBP

Main’s EBP

Dynamic Storage

ESP

Figure 6-7. Stack Frame After Entering the MAIN Procedure

Old EBP Main’s EBP

Display

Main’s EBP Main’s EBP Procedure A’s EBP

EBP

Dynamic Storage ESP

Figure 6-8. Stack Frame After Entering Procedure A When procedure A calls procedure B, t he ENTER inst ruct ion creat es a new display ( see Figure 6- 9) . The first doubleword holds a copy of t he last value in procedure A’s EBP regist er. The second and t hird doublewords are copies of t he t wo st ack fram e point ers in procedure A’s display. Procedure B can access variables in procedure A and MAI N by using t he st ack fram e point ers in it s display.

Vol. 1 6-23

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

When procedure B calls procedure C, t he ENTER inst ruct ion creat es a new display for procedure C ( see Figure 6- 10) . The first doubleword holds a copy of t he last value in procedure B’s EBP regist er. This is used by t he LEAVE inst ruct ion t o rest ore procedure B’s st ack fram e. The second and t hird doublewords are copies of t he t wo st ack fram e point ers in procedure A’s display. I f procedure C were at t he next deeper lexical level from procedure B, a fourt h doubleword would be copied, which would be t he st ack fram e point er t o procedure B’s local variables. Not e t hat procedure B and procedure C are at t he sam e level, so procedure C is not int ended t o access procedure B’s variables. This does not m ean t hat procedure C is com plet ely isolat ed from procedure B; procedure C is called by procedure B, so t he point er t o t he ret urning st ack fram e is a point er t o procedure B’s st ack fram e. I n addit ion, procedure B can pass param et ers t o procedure C eit her on t he st ack or t hrough variables global t o bot h procedures ( t hat is, variables in t he scope of bot h procedures) .

Old EBP Main’s EBP

Main’s EBP Main’s EBP Procedure A’s EBP

Procedure A’s EBP Display

EBP

Main’s EBP Procedure A’s EBP Procedure B’s EBP

Dynamic Storage

ESP

Figure 6-9. Stack Frame After Entering Procedure B

6-24 Vol. 1

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

Old EBP Main’s EBP

Main’s EBP Main’s EBP Procedure A’s EBP

Procedure A’s EBP Main’s EBP Procedure A’s EBP Procedure B’s EBP

Procedure B’s EBP Display

EBP

Main’s EBP Procedure A’s EBP Procedure C’s EBP

Dynamic Storage

ESP

Figure 6-10. Stack Frame After Entering Procedure C

Vol. 1 6-25

PROCEDURE CALLS, INTERRUPTS, AND EXCEPTIONS

6.5.2

LEAVE Instruction

The LEAVE inst ruct ion, which does not have any operands, reverses t he act ion of t he previous ENTER inst ruct ion. The LEAVE inst ruct ion copies t he cont ent s of t he EBP regist er int o t he ESP regist er t o release all st ack space allocat ed t o t he procedure. Then it rest ores t he old value of t he EBP regist er from t he st ack. This sim ult aneously rest ores t he ESP regist er t o it s original value. A subsequent RET inst ruct ion t hen can rem ove any argum ent s and t he ret urn address pushed on t he st ack by t he calling program for use by t he procedure.

6-26 Vol. 1

CHAPTER 7 PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS General- purpose ( GP) inst ruct ions are a subset of t he I A- 32 inst ruct ions t hat represent t he fundam ent al inst ruct ion set for t he I nt el I A- 32 processors. These inst ruct ions were int roduced int o t he I A- 32 archit ect ure wit h t he first I A- 32 processors ( t he I nt el 8086 and 8088) . Addit ional inst ruct ions were added t o t he general- purpose inst ruct ion set in subsequent fam ilies of I A- 32 processors ( t he I nt el 286, I nt el386, I nt el486, Pent ium , Pent ium Pro, and Pent ium I I processors) . I nt el 64 archit ect ure furt her ext ends t he capabilit y of m ost general- purpose inst ruct ions so t hat t hey are able t o handle 64- bit dat a in 64- bit m ode. A sm all num ber of general- purpose inst ruct ions ( st ill support ed in non- 64- bit m odes) are not support ed in 64- bit m ode. General- purpose inst ruct ions perform basic dat a m ovem ent , m em ory addressing, arit hm et ic and logical, program flow cont rol, input / out put , and st ring operat ions on a set of int eger, point er, and BCD dat a t ypes. This chapt er provides an overview of t he general- purpose inst ruct ions. See I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B, for det ailed descript ions of individual inst ruct ions.

7.1

PROGRAMMING ENVIRONMENT FOR GP INSTRUCTIONS

The program m ing environm ent for t he general- purpose inst ruct ions consist s of t he set of regist ers and address space. The environm ent includes t he following it em s:



Ge n e r a l- pur pose r e gist e r s — Eight 32- bit general- purpose regist ers ( see Sect ion 3.4.1, “ General- Purpose Regist ers” ) are used in non- 64- bit m odes t o address operands in m em ory. These regist ers are referenced by t he nam es EAX, EBX, ECX, EDX, EBP, ESI EDI , and ESP.



Se gm e nt r e gist e r s — The six 16- bit segm ent regist ers cont ain segm ent point ers for use in accessing m em ory ( see Sect ion 3.4.2, “ Segm ent Regist ers” ) . These regist ers are referenced by t he nam es CS, DS, SS, ES, FS, and GS.



EFLAGS r e gist e r — This 32- bit regist er ( see Sect ion 3.4.3, “ EFLAGS Regist er ” ) is used t o provide st at us and cont rol for basic arit hm et ic, com pare, and syst em operat ions.



EI P r e gist e r — This 32- bit regist er cont ains t he current inst ruct ion point er ( see Sect ion 3.4.3, “ EFLAGS Regist er ” ) .

General- purpose inst ruct ions operat e on t he following dat a t ypes. The widt h of valid dat a t ypes is dependent on processor m ode ( see Chapt er 4) :



Byt es, words, doublewords

Vol. 1 7-1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

• • • •

Signed and unsigned byt e, word, doubleword int egers Near and far point ers Bit fields BCD int egers

7.2

PROGRAMMING ENVIRONMENT FOR GP INSTRUCTIONS IN 64-BIT MODE

The program m ing environm ent for t he general- purpose inst ruct ions in 64- bit m ode is sim ilar t o t hat described in Sect ion 7.1.



Ge ne r a l- pu r pose r e gist e r s — I n 64- bit m ode, sixt een general- purpose regist ers available. These include t he eight GPRs described in Sect ion 7.1 and eight new GPRs ( R8D- R15D) . R8D- R15D are available by using a REX prefix. All sixt een GPRs can be prom ot ed t o 64 bit s. The 64- bit regist ers are referenced as RAX, RBX, RCX, RDX, RBP, RSI , RDI , RSP and R8- R15 ( see Sect ion 3.4.1.1, “ General- Purpose Regist ers in 64- Bit Mode” ) . Prom ot ion t o 64- bit operand requires REX prefix encodings.



Se gm e n t r e gist e r s — I n 64- bit m ode, segm ent at ion is available but it is set up uniquely ( see Sect ion 3.4.2.1, “ Segm ent Regist ers in 64- Bit Mode” ) .



Fla gs a nd St a t us r e gist e r — When t he processor is running in 64- bit m ode, EFLAGS becom es t he 64- bit RFLAGS regist er ( see Sect ion 3.4.3, “ EFLAGS Regist er ” ) .



I nst r uct ion Point e r r e gist e r — I n 64- bit m ode, t he EI P regist er becom es t he 64- bit RI P regist er ( see Sect ion 3.5.1, “ I nst ruct ion Point er in 64- Bit Mode” ) .

General- purpose inst ruct ions operat e on t he following dat a t ypes in 64- bit m ode. The widt h of valid dat a t ypes is dependent on default operand size, address size, or a prefix t hat overrides t he default size:

• • • •

Byt es, words, doublewords, quadwords Signed and unsigned byt e, word, doubleword, quadword int egers Near and far point ers Bit fields

See also:



Chapt er 3, “ Basic Execut ion Environm ent ,” for m ore inform at ion about I A- 32e m odes.



Chapt er 2, “ I nst ruct ion Form at ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for m ore det ailed inform at ion about REX prefixes.



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B for a com plet e list ing of all inst ruct ions. This inform at ion docum ent s t he behavior of individual inst ruct ions in t he 64- bit m ode cont ext .

7-2 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

7.3

SUMMARY OF GP INSTRUCTIONS

General purpose inst ruct ions are divided int o t he following subgroups:

• • • • • • • • • • • • •

Dat a t ransfer Binary arit hm et ic Decim al arit hm et ic Logical Shift and rot at e Bit and byt e Cont rol t ransfer St ring I/O Ent er and Leave Flag cont rol Segm ent regist er Miscellaneous

Each sub- group of general- purpose inst ruct ions is discussed in t he cont ext of non64- bit m ode operat ion first . Changes in 64- bit m ode beyond t hose affect ed by t he use of t he REX prefixes are discussed in separat e sub- sect ions wit hin each subgroup. For a sim ple list of general- purpose inst ruct ions by subgroup, see Chapt er 5.

7.3.1

Data Transfer Instructions

The dat a t ransfer inst ruct ions m ove byt es, words, doublewords, or quadwords bot h bet ween m em ory and t he processor ’s regist ers and bet ween regist ers. For t he purpose of t his discussion, t hese inst ruct ions are divided int o subordinat e subgroups t hat provide for:

• • • •

General dat a m ovem ent Exchange St ack m anipulat ion Type conversion

7.3.1.1

General Data Movement Instructions

M ove inst r uct ions — The MOV ( m ove) and CMOVcc ( condit ional m ove) inst ruct ions t ransfer dat a bet ween m em ory and regist ers or bet ween regist ers. The MOV inst ruct ion perform s basic load dat a and st ore dat a operat ions bet ween m em ory and t he processor ’s regist ers and dat a m ovem ent operat ions bet ween regist ers. I t handles dat a t ransfers along t he pat hs list ed in Table 7- 1. ( See “ MOV—Move

Vol. 1 7-3

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

t o/ from Cont rol Regist ers” and “ MOV—Move t o/ from Debug Regist ers” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for inform at ion on m oving dat a t o and from t he cont rol and debug regist ers.) The MOV inst ruct ion cannot m ove dat a from one m em ory locat ion t o anot her or from one segm ent regist er t o anot her segm ent regist er. Mem ory- t o- m em ory m oves are perform ed wit h t he MOVS ( st ring m ove) inst ruct ion ( see Sect ion 7.3.9, “ St ring Operat ions” ) . Condit iona l m ove inst r uct ions — The CMOVcc inst ruct ions are a group of inst ruct ions t hat check t he st at e of t he st at us flags in t he EFLAGS regist er and perform a m ove operat ion if t he flags are in a specified st at e. These inst ruct ions can be used t o m ove a 16- bit or 32- bit value from m em ory t o a general- purpose regist er or from one general- purpose regist er t o anot her. The flag st at e being t est ed is specified wit h a condit ion code ( cc) associat ed wit h t he inst ruct ion. I f t he condit ion is not sat isfied, a m ove is not perform ed and execut ion cont inues wit h t he inst ruct ion following t he CMOVcc inst ruct ion.

Table 7-1. Move Instruction Operations Type of Data Movement From memory to a register From a register to memory Between registers

Immediate data to a register Immediate data to memory

Source → Destination

Memory location → General-purpose register Memory location → Segment register

General-purpose register → Memory location Segment register → Memory location

General-purpose register → General-purpose register General-purpose register → Segment register Segment register → General-purpose register General-purpose register → Control register Control register → General-purpose register General-purpose register → Debug register Debug register → General-purpose register Immediate → General-purpose register Immediate → Memory location

Table 7- 2 shows m nem onics for CMOVcc inst ruct ions and t he condit ions being t est ed for each inst ruct ion. The condit ion code m nem onics are appended t o t he let t ers “ CMOV” t o form t he m nem onics for CMOVcc inst ruct ions. The inst ruct ions list ed in Table 7- 2 as pairs ( for exam ple, CMOVA/ CMOVNBE) are alt ernat e nam es for t he sam e inst ruct ion. The assem bler provides t hese alt ernat e nam es t o m ake it easier t o read program list ings. CMOVcc inst ruct ions are useful for opt im izing sm all I F const ruct ions. They also help elim inat e branching overhead for I F st at em ent s and t he possibilit y of branch m ispredict ions by t he processor.

7-4 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

These condit ional m ove inst ruct ions are support ed in t he P6 fam ily, Pent ium 4, and I nt el Xeon processors. Soft ware can check if CMOVcc inst ruct ions are support ed by checking t he processor ’s feat ure inform at ion wit h t he CPUI D inst ruct ion.

7.3.1.2

Exchange Instructions

The exchange inst ruct ions swap t he cont ent s of one or m ore operands and, in som e cases, perform addit ional operat ions such as assert ing t he LOCK signal or m odifying flags in t he EFLAGS regist er. The XCHG ( exchange) inst ruct ion swaps t he cont ent s of t wo operands. This inst ruct ion t akes t he place of t hree MOV inst ruct ions and does not require a t em porary locat ion t o save t he cont ent s of one operand locat ion while t he ot her is being loaded. When a m em ory operand is used wit h t he XCHG inst ruct ion, t he processor ’s LOCK signal is aut om at ically assert ed. This inst ruct ion is t hus useful for im plem ent ing sem aphores or sim ilar dat a st ruct ures for process synchronizat ion. See “ Bus Locking” in Chapt er 7, “ Mult iple- Processor Managem ent ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion on bus locking. The BSWAP ( byt e swap) inst ruct ion reverses t he byt e order in a 32- bit regist er operand. Bit posit ions 0 t hrough 7 are exchanged wit h 24 t hrough 31, and bit posit ions 8 t hrough 15 are exchanged wit h 16 t hrough 23. Execut ing t his inst ruct ion t wice in a row leaves t he regist er wit h t he sam e value as before. The BSWAP inst ruct ion is useful for convert ing bet ween “ big- endian” and “ lit t le- endian” dat a form at s. This inst ruct ion also speeds execut ion of decim al arit hm et ic. ( The XCHG inst ruct ion can be used t o swap t he byt es in a word.)

Table 7-2. Conditional Move Instructions Instruction Mnemonic

Status Flag States

Condition Description

CMOVA/CMOVNBE

(CF or ZF) = 0

Above/not below or equal

CMOVAE/CMOVNB

CF = 0

Above or equal/not below

CMOVNC

CF = 0

Not carry

CMOVB/CMOVNAE

CF = 1

Below/not above or equal

CMOVC

CF = 1

Carry

CMOVBE/CMOVNA

(CF or ZF) = 1

Below or equal/not above

CMOVE/CMOVZ

ZF = 1

Equal/zero

CMOVNE/CMOVNZ

ZF = 0

Not equal/not zero

CMOVP/CMOVPE

PF = 1

Parity/parity even

CMOVNP/CMOVPO

PF = 0

Not parity/parity odd

Unsigned Conditional Moves

Vol. 1 7-5

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

Table 7-2. Conditional Move Instructions (Contd.) Instruction Mnemonic

Status Flag States

Condition Description

CMOVGE/CMOVNL

(SF xor OF) = 0

Greater or equal/not less

CMOVL/CMOVNGE

(SF xor OF) = 1

Less/not greater or equal

CMOVLE/CMOVNG

((SF xor OF) or ZF) = 1

Less or equal/not greater

CMOVO

OF = 1

Overflow

CMOVNO

OF = 0

Not overflow

CMOVS

SF = 1

Sign (negative)

CMOVNS

SF = 0

Not sign (non-negative)

Signed Conditional Moves

The XADD ( exchange and add) inst ruct ion swaps t wo operands and t hen st ores t he sum of t he t wo operands in t he dest inat ion operand. The st at us flags in t he EFLAGS regist er indicat e t he result of t he addit ion. This inst ruct ion can be com bined wit h t he LOCK prefix ( see “ LOCK—Assert LOCK# Signal Prefix” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A) in a m ult iprocessing syst em t o allow m ult iple processors t o execut e one DO loop. The CMPXCHG ( com pare and exchange) and CMPXCHG8B ( com pare and exchange 8 byt es) inst ruct ions are used t o synchronize operat ions in syst em s t hat use m ult iple processors. The CMPXCHG inst ruct ion requires t hree operands: a source operand in a regist er, anot her source operand in t he EAX regist er, and a dest inat ion operand. I f t he values cont ained in t he dest inat ion operand and t he EAX regist er are equal, t he dest inat ion operand is replaced wit h t he value of t he ot her source operand ( t he value not in t he EAX regist er) . Ot herwise, t he original value of t he dest inat ion operand is loaded in t he EAX regist er. The st at us flags in t he EFLAGS regist er reflect t he result t hat would have been obt ained by subt ract ing t he dest inat ion operand from t he value in t he EAX regist er. The CMPXCHG inst ruct ion is com m only used for t est ing and m odifying sem aphores. I t checks t o see if a sem aphore is free. I f t he sem aphore is free, it is m arked allocat ed; ot herwise it get s t he I D of t he current owner. This is all done in one unint errupt ible operat ion. I n a single- processor syst em , t he CMPXCHG inst ruct ion elim inat es t he need t o swit ch t o prot ect ion level 0 ( t o disable int errupt s) before execut ing m ult iple inst ruct ions t o t est and m odify a sem aphore. For m ult iple processor syst em s, CMPXCHG can be com bined wit h t he LOCK prefix t o perform t he com pare and exchange operat ion at om ically. ( See “ Locked At om ic Operat ions” in Chapt er 7, “ Mult iple- Processor Managem ent ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion on at om ic operat ions.)

7-6 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The CMPXCHG8B inst ruct ion also requires t hree operands: a 64- bit value in EDX: EAX, a 64- bit value in ECX: EBX, and a dest inat ion operand in m em ory. The inst ruct ion com pares t he 64- bit value in t he EDX: EAX regist ers wit h t he dest inat ion operand. I f t hey are equal, t he 64- bit value in t he ECX: EBX regist er is st ored in t he dest inat ion operand. I f t he EDX: EAX regist er and t he dest inat ion are not equal, t he dest inat ion is loaded in t he EDX: EAX regist er. The CMPXCHG8B inst ruct ion can be com bined wit h t he LOCK prefix t o perform t he operat ion at om ically.

7.3.1.3

Exchange Instructions in 64-Bit Mode

The CMPXCHG16B inst ruct ion is available in 64- bit m ode only. I t is an ext ension of t he funct ionalit y provided by CMPXCHG8B t hat operat es on 128- bit s of dat a.

7.3.1.4

Stack Manipulation Instructions

The PUSH, POP, PUSHA ( push all regist ers) , and POPA ( pop all regist ers) inst ruct ions m ove dat a t o and from t he st ack. The PUSH inst ruct ion decrem ent s t he st ack point er ( cont ained in t he ESP regist er) , t hen copies t he source operand t o t he t op of st ack ( see Figure 7- 1) . I t operat es on m em ory operands, im m ediat e operands, and regist er operands ( including segm ent regist ers) . The PUSH inst ruct ion is com m only used t o place param et ers on t he st ack before calling a procedure. I t can also be used t o reserve space on t he st ack for t em porary variables.

Stack

Before Pushing Doubleword Stack Growth

31 n n−4 n−8

After Pushing Doubleword 31

0

0

ESP Doubleword Value

ESP

Figure 7-1. Operation of the PUSH Instruction The PUSHA inst ruct ion saves t he cont ent s of t he eight general- purpose regist ers on t he st ack ( see Figure 7- 2) . This inst ruct ion sim plifies procedure calls by reducing t he num ber of inst ruct ions required t o save t he cont ent s of t he general- purpose regist ers. The regist ers are pushed on t he st ack in t he following order: EAX, ECX, EDX, EBX, t he init ial value of ESP before EAX was pushed, EBP, ESI , and EDI .

Vol. 1 7-7

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

Stack

After Pushing Registers

Before Pushing Registers 31 0

Stack Growth n n-4 n-8 n - 12 n - 16 n - 20 n - 24 n - 28 n - 32 n - 36

31

0

ESP EAX ECX EDX EBX Old ESP EBP ESI EDI

ESP

Figure 7-2. Operation of the PUSHA Instruction The POP inst ruct ion copies t he word or doubleword at t he current t op of st ack ( indicat ed by t he ESP regist er) t o t he locat ion specified wit h t he dest inat ion operand. I t t hen increm ent s t he ESP regist er t o point t o t he new t op of st ack ( see Figure 7- 3) . The dest inat ion operand m ay specify a general- purpose regist er, a segm ent regist er, or a m em ory locat ion.

Stack

After Popping Doubleword

Before Popping Doubleword Stack Growth

31 n n-4 n-8

0

31

0

ESP Doubleword Value

ESP

Figure 7-3. Operation of the POP Instruction The POPA inst ruct ion reverses t he effect of t he PUSHA inst ruct ion. I t pops t he t op eight words or doublewords from t he t op of t he st ack int o t he general- purpose regist ers, except for t he ESP regist er ( see Figure 7- 4) . I f t he operand- size at t ribut e is 32, t he doublewords on t he st ack are t ransferred t o t he regist ers in t he following order: EDI , ESI , EBP, ignore doubleword, EBX, EDX, ECX, and EAX. The ESP regist er is rest ored by t he act ion of popping t he st ack. I f t he operand- size at t ribut e is 16, t he words on t he st ack are t ransferred t o t he regist ers in t he following order: DI , SI , BP, ignore word, BX, DX, CX, and AX.

7-8 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

Stack

Stack Growth

After Popping Registers

Before Popping Registers 0 31

n n-4 n-8 n - 12 n - 16 n - 20 n - 24 n - 28 n - 32 n - 36

0

31 ESP

EAX ECX EDX EBX Ignored EBP ESI EDI

ESP

Figure 7-4. Operation of the POPA Instruction

7.3.1.5

Stack Manipulation Instructions in 64-Bit Mode

I n 64- bit m ode, t he st ack point er size is 64 bit s and cannot be overridden by an inst ruct ion prefix. I n im plicit st ack references, address- size overrides are ignored. Pushes and pops of 32- bit values on t he st ack are not possible in 64- bit m ode. 16- bit pushes and pops are support ed by using t he 66H operand- size prefix. PUSHA, PUSHAD, POPA, and POPAD are not support ed.

7.3.1.6

Type Conversion Instructions

The t ype conversion inst ruct ions convert byt es int o words, words int o doublewords, and doublewords int o quadwords. These inst ruct ions are especially useful for convert ing int egers t o larger int eger form at s, because t hey perform sign ext ension ( see Figure 7- 5) . Two kinds of t ype conversion inst ruct ions are provided: sim ple conversion and m ove and convert .

Vol. 1 7-9

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

15

0

S N N N N N N N N N N N N N N N 31

15

Before Sign Extension

0

S S S S S S S S S S S S S S S S S N N N N N N N N N N N N N N N

After Sign Extension

Figure 7-5. Sign Extension Sim ple con ve r sion — The CBW ( convert byt e t o word) , CWDE ( convert word t o doubleword ext ended) , CWD ( convert word t o doubleword) , and CDQ ( convert doubleword t o quadword) inst ruct ions perform sign ext ension t o double t he size of t he source operand. The CBW inst ruct ion copies t he sign ( bit 7) of t he byt e in t he AL regist er int o every bit posit ion of t he upper byt e of t he AX regist er. The CWDE inst ruct ion copies t he sign ( bit 15) of t he word in t he AX regist er int o every bit posit ion of t he high word of t he EAX regist er. The CWD inst ruct ion copies t he sign ( bit 15) of t he word in t he AX regist er int o every bit posit ion in t he DX regist er. The CDQ inst ruct ion copies t he sign ( bit 31) of t he doubleword in t he EAX regist er int o every bit posit ion in t he EDX regist er. The CWD inst ruct ion can be used t o produce a doubleword dividend from a word before a word division, and t he CDQ inst ruct ion can be used t o produce a quadword dividend from a doubleword before doubleword division. M ove w it h sign or ze r o e x t e n sion — The MOVSX ( m ove wit h sign ext ension) and MOVZX ( m ove wit h zero ext ension) inst ruct ions m ove t he source operand int o a regist er t hen perform t he sign ext ension. The MOVSX inst ruct ion ext ends an 8- bit value t o a 16- bit value or an 8- bit or 16- bit value t o a 32- bit value by sign ext ending t he source operand, as shown in Figure 7- 5. The MOVZX inst ruct ion ext ends an 8- bit value t o a 16- bit value or an 8- bit or 16- bit value t o a 32- bit value by zero ext ending t he source operand.

7.3.1.7

Type Conversion Instructions in 64-Bit Mode

The MOVSXD inst ruct ion operat es on 64- bit dat a. I t sign- ext ends a 32- bit value t o 64 bit s. This inst ruct ion is not encodable in non- 64- bit m odes.

7.3.2

Binary Arithmetic Instructions

Binary arit hm et ic inst ruct ions operat e on 8- , 16- , and 32- bit num eric dat a encoded as signed or unsigned binary int egers. The binary arit hm et ic inst ruct ions m ay also be used in algorit hm s t hat operat e on decim al ( BCD) values.

7-10 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

For t he purpose of t his discussion, t hese inst ruct ions are divided subordinat e subgroups of inst ruct ions t hat :

• • • •

Add and subt ract I ncrem ent and decrem ent Com pare and change signs Mult iply and divide

7.3.2.1

Addition and Subtraction Instructions

The ADD ( add int egers) , ADC ( add int egers wit h carry) , SUB ( subt ract int egers) , and SBB ( subt ract int egers wit h borrow) inst ruct ions perform addit ion and subt ract ion operat ions on signed or unsigned int eger operands. The ADD inst ruct ion com put es t he sum of t wo int eger operands. The ADC inst ruct ion com put es t he sum of t wo int eger operands, plus 1 if t he CF flag is set . This inst ruct ion is used t o propagat e a carry when adding num bers in st ages. The SUB inst ruct ion com put es t he difference of t wo int eger operands. The SBB inst ruct ion com put es t he difference of t wo int eger operands, m inus 1 if t he CF flag is set . This inst ruct ion is used t o propagat e a borrow when subt ract ing num bers in st ages.

7.3.2.2

Increment and Decrement Instructions

The I NC ( increm ent ) and DEC ( decrem ent ) inst ruct ions add 1 t o or subt ract 1 from an unsigned int eger operand, respect ively. A prim ary use of t hese inst ruct ions is for im plem ent ing count ers.

7.3.2.3

Increment and Decrement Instructions in 64-Bit Mode

The I NC and DEC inst ruct ions are support ed in 64- bit m ode. However, som e form s of I NC and DEC ( t he regist er operand being encoded using regist er ext ension field in t he MOD R/ M byt e) are not encodable in 64- bit m ode because t he opcodes are t reat ed as REX prefixes.

7.3.2.4

Comparison and Sign Change Instruction

The CMP ( com pare) inst ruct ion com put es t he difference bet ween t wo int eger operands and updat es t he OF, SF, ZF, AF, PF, and CF flags according t o t he result . The source operands are not m odified, nor is t he result saved. The CMP inst ruct ion is com m only used in conj unct ion wit h a Jcc ( j um p) or SETcc ( byt e set on condit ion) inst ruct ion, wit h t he lat t er inst ruct ions perform ing an act ion based on t he result of a CMP inst ruct ion.

Vol. 1 7-11

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The NEG ( negat e) inst ruct ion subt ract s a signed int eger operand from zero. The effect of t he NEG inst ruct ion is t o change t he sign of a t wo's com plem ent operand while keeping it s m agnit ude.

7.3.2.5

Multiplication and Divide Instructions

The processor provides t wo m ult iply inst ruct ions, MUL ( unsigned m ult iply) and I MUL signed m ult iply) , and t wo divide inst ruct ions, DI V ( unsigned divide) and I DI V ( signed divide) . The MUL inst ruct ion m ult iplies t wo unsigned int eger operands. The result is com put ed t o t wice t he size of t he source operands ( for exam ple, if word operands are being m ult iplied, t he result is a doubleword) . The I MUL inst ruct ion m ult iplies t wo signed int eger operands. The result is com put ed t o t wice t he size of t he source operands; however, in som e cases t he result is t runcat ed t o t he size of t he source operands ( see “ I MUL—Signed Mult iply” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A) . The DI V inst ruct ion divides one unsigned operand by anot her unsigned operand and ret urns a quot ient and a rem ainder. The I DI V inst r uct ion is ident ical t o t he DI V inst r uct ion, except t hat I DI V per for m s a signed div ision.

7.3.3

Decimal Arithmetic Instructions

Decim al arit hm et ic can be perform ed by com bining t he binary arit hm et ic inst ruct ions ADD, SUB, MUL, and DI V ( discussed in Sect ion 7.3.2, “ Binary Arit hm et ic I nst ruct ions” ) wit h t he decim al arit hm et ic inst ruct ions. The decim al arit hm et ic inst ruct ions are provided t o carry out t he following operat ions:



To adj ust t he result s of a previous binary arit hm et ic operat ion t o produce a valid BCD result .



To adj ust t he operands of a subsequent binary arit hm et ic operat ion so t hat t he operat ion will produce a valid BCD result .

These inst ruct ions operat e on bot h packed and unpacked BCD values. For t he purpose of t his discussion, t he decim al arit hm et ic inst ruct ions are divided subordinat e subgroups of inst ruct ions t hat provide:

• •

Packed BCD adj ust m ent s Unpacked BCD adj ust m ent s

7.3.3.1

Packed BCD Adjustment Instructions

The DAA ( decim al adj ust aft er addit ion) and DAS ( decim al adj ust aft er subt ract ion) inst ruct ions adj ust t he r esult s of oper at ions per for m ed on pack ed BCD int eger s

7-12 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

( see Sect ion 4.7, “ BCD and Packed BCD I nt egers” ) . Adding t wo packed BCD values requires t wo inst ruct ions: an ADD inst ruct ion followed by a DAA inst ruct ion. The ADD inst ruct ion adds ( binary addit ion) t he t wo values and st ores t he result in t he AL regist er. The DAA inst ruct ion t hen adj ust s t he value in t he AL regist er t o obt ain a valid, 2- digit , packed BCD value and set s t he CF flag if a decim al carry occurred as t he result of t he addit ion. Likewise, subt ract ing one packed BCD value from anot her requires a SUB inst ruct ion followed by a DAS inst ruct ion. The SUB inst ruct ion subt ract s ( binary subt ract ion) one BCD value from anot her and st ores t he result in t he AL regist er. The DAS inst ruct ion t hen adj ust s t he value in t he AL regist er t o obt ain a valid, 2- digit , packed BCD value and set s t he CF flag if a decim al borrow occurred as t he result of t he subt ract ion.

7.3.3.2

Unpacked BCD Adjustment Instructions

The AAA ( ASCI I adj ust aft er addit ion) , AAS ( ASCI I adj ust aft er subt ract ion) , AAM ( ASCI I adj ust aft er m ult iplicat ion) , and AAD ( ASCI I adj ust before division) inst ruct ions adj ust t he r esult s of ar it hm et ic operat ions per for m ed in unpack ed BCD values ( see Sect ion 4.7, “ BCD and Packed BCD I nt egers” ) . All t hese inst ruct ions assum e t hat t he value t o be adj ust ed is st ored in t he AL regist er or, in one inst ance, t he AL and AH regist ers. The AAA inst ruct ion adj ust s t he cont ent s of t he AL regist er following t he addit ion of t wo unpacked BCD values. I t convert s t he binary value in t he AL regist er int o a decim al value and st ores t he result in t he AL regist er in unpacked BCD form at ( t he decim al num ber is st ored in t he lower 4 bit s of t he regist er and t he upper 4 bit s are cleared) . I f a decim al carry occurred as a result of t he addit ion, t he CF flag is set and t he cont ent s of t he AH regist er are increm ent ed by 1. The AAS inst ruct ion adj ust s t he cont ent s of t he AL regist er following t he subt ract ion of t wo unpacked BCD values. Here again, a binary value is convert ed int o an unpacked BCD value. I f a borrow was required t o com plet e t he decim al subt ract , t he CF flag is set and t he cont ent s of t he AH regist er are decrem ent ed by 1. The AAM inst ruct ion adj ust s t he cont ent s of t he AL regist er following a m ult iplicat ion of t wo unpacked BCD values. I t convert s t he binary value in t he AL regist er int o a decim al value and st ores t he least significant digit of t he result in t he AL regist er ( in unpacked BCD form at ) and t he m ost significant digit , if t here is one, in t he AH regist er ( also in unpacked BCD form at ) . The AAD inst ruct ion adj ust s a t wo- digit BCD value so t hat when t he value is divided wit h t he DI V inst ruct ion, a valid unpacked BCD result is obt ained. The inst ruct ion convert s t he BCD value in regist ers AH ( m ost significant digit ) and AL ( least significant digit ) int o a binary value and st ores t he result in regist er AL. When t he value in AL is divided by an unpacked BCD value, t he quot ient and rem ainder will be aut om at ically encoded in unpacked BCD form at .

Vol. 1 7-13

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

7.3.4

Decimal Arithmetic Instructions in 64-Bit Mode

Decim al arit hm et ic inst ruct ions are not support ed in 64- bit m ode, They are eit her invalid or not encodable.

7.3.5

Logical Instructions

The logical inst ruct ions AND, OR, XOR ( exclusive or) , and NOT perform t he st andard Boolean operat ions for which t hey are nam ed. The AND, OR, and XOR inst ruct ions require t wo operands; t he NOT inst ruct ion operat es on a single operand.

7.3.6

Shift and Rotate Instructions

The shift and rot at e inst ruct ions rearrange t he bit s wit hin an operand. For t he purpose of t his discussion, t hese inst ruct ions are furt her divided subordinat e subgroups of inst ruct ions t hat :

• • •

Shift bit s Double- shift bit s ( m ove t hem bet ween operands) Rot at e bit s

7.3.6.1

Shift Instructions

The SAL ( shift arit hm et ic left ) , SHL ( shift logical left ) , SAR ( shift arit hm et ic right ) , SHR ( shift logical right ) inst ruct ions perform an arit hm et ic or logical shift of t he bit s in a byt e, word, or doubleword. The SAL and SHL inst ruct ions perform t he sam e operat ion ( see Figure 7- 6) . They shift t he source operand left by from 1 t o 31 bit posit ions. Em pt y bit posit ions are cleared. The CF flag is loaded wit h t he last bit shift ed out of t he operand.

7-14 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

Initial State Operand

CF X

1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1

After 1-bit SHL/SAL Instruction 0

1

0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0

After 10-bit SHL/SAL Instruction 0

0

0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0

Figure 7-6. SHL/SAL Instruction Operation The SHR inst ruct ion shift s t he source operand right by from 1 t o 31 bit posit ions ( see Figure 7- 7) . As wit h t he SHL/ SAL inst ruct ion, t he em pt y bit posit ions are cleared and t he CF flag is loaded wit h t he last bit shift ed out of t he operand.

Initial State

Operand

1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1

CF X

After 1-bit SHR Instruction 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1

1

0

After 10-bit SHR Instruction 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0

0

0

Figure 7-7. SHR Instruction Operation The SAR inst ruct ion shift s t he source operand right by from 1 t o 31 bit posit ions ( see Figure 7- 8) . This inst ruct ion differs from t he SHR inst ruct ion in t hat it preserves t he sign of t he source operand by clearing em pt y bit posit ions if t he operand is posit ive or set t ing t he em pt y bit s if t he operand is negat ive. Again, t he CF flag is loaded wit h t he last bit shift ed out of t he operand.

Vol. 1 7-15

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The SAR and SHR inst r uct ions can also be used t o per for m div ision by pow er s of 2 ( see “ SAL/ SAR/ SHL/ SHR—Shift I nst r uct ions” in Chapt er 4, “ I nst r uct ion Set Refer ence, N- Z,” of t he I nt el® 64 and I A- 32 Ar chit ect ur es Soft w ar e Dev eloper ’s Manual, Volum e 2B) .

Initial State (Positive Operand)

Operand

0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1

CF X

After 1-bit SAR Instruction 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1

Initial State (Negative Operand) 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1

1

CF X

After 1-bit SAR Instruction 1 1 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1

1

Figure 7-8. SAR Instruction Operation

7.3.6.2

Double-Shift Instructions

The SHLD ( shift left double) and SHRD ( shift right double) inst ruct ions shift a specified num ber of bit s from one operand t o anot her ( see Figure 7- 9) . They are provided t o facilit at e operat ions on unaligned bit st rings. They can also be used t o im plem ent a variet y of bit st ring m ove operat ions.

7-16 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

SHLD Instruction

31 CF

0

Destination (Memory or Register)

31

0 Source (Register) SHRD Instruction

31

0

Source (Register)

31

0 Destination (Memory or Register)

CF

Figure 7-9. SHLD and SHRD Instruction Operations The SHLD inst ruct ion shift s t he bit s in t he dest inat ion operand t o t he left and fills t he em pt y bit posit ions ( in t he dest inat ion operand) wit h bit s shift ed out of t he source operand. The dest inat ion and source operands m ust be t he sam e lengt h ( eit her words or doublewords) . The shift count can range from 0 t o 31 bit s. The result of t his shift operat ion is st ored in t he dest inat ion operand, and t he source operand is not m odified. The CF flag is loaded wit h t he last bit shift ed out of t he dest inat ion operand. The SHRD inst ruct ion operat es t he sam e as t he SHLD inst ruct ion except bit s are shift ed t o t he right in t he dest inat ion operand, wit h t he em pt y bit posit ions filled wit h bit s shift ed out of t he source operand.

7.3.6.3

Rotate Instructions

The ROL ( rot at e left ) , ROR ( rot at e right ) , RCL ( rot at e t hrough carry left ) and RCR ( rot at e t hrough carry right ) inst ruct ions rot at e t he bit s in t he dest inat ion operand out of one end and back t hrough t he ot her end ( see Figure 7- 10) . Unlike a shift , no bit s are lost during a rot at ion. The rot at e count can range from 0 t o 31.

Vol. 1 7-17

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

31 CF

ROL Instruction

0

Destination (Memory or Register)

31

ROR Instruction

0

Destination (Memory or Register)

31 CF

RCL Instruction

CF

0

Destination (Memory or Register)

31

RCR Instruction

0

Destination (Memory or Register)

CF

Figure 7-10. ROL, ROR, RCL, and RCR Instruction Operations The ROL inst ruct ion rot at es t he bit s in t he operand t o t he left ( t oward m ore significant bit locat ions) . The ROR inst ruct ion rot at es t he operand right ( t oward less significant bit locat ions) . The RCL inst ruct ion rot at es t he bit s in t he operand t o t he left , t hrough t he CF flag. This inst ruct ion t reat s t he CF flag as a one- bit ext ension on t he upper end of t he operand. Each bit t hat exit s from t he m ost significant bit locat ion of t he operand m oves int o t he CF flag. At t he sam e t im e, t he bit in t he CF flag ent ers t he least significant bit locat ion of t he operand. The RCR inst ruct ion rot at es t he bit s in t he operand t o t he right t hrough t he CF flag. For all t he rot at e inst ruct ions, t he CF flag always cont ains t he value of t he last bit rot at ed out of t he operand, even if t he inst ruct ion does not use t he CF flag as an ext ension of t he operand. The value of t his flag can t hen be t est ed by a condit ional j um p inst ruct ion ( JC or JNC) .

7-18 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

7.3.7

Bit and Byte Instructions

These inst ruct ions operat e on bit or byt e st rings. For t he purpose of t his discussion, t hey are furt her divided subordinat e subgroups t hat :

• • • •

Test and m odify a single bit Scan a bit st ring Set a byt e given condit ions Test operands and report result s

7.3.7.1

Bit Test and Modify Instructions

The bit t est and m odify inst ruct ions ( see Table 7- 3) operat e on a single bit , which can be in an operand. The locat ion of t he bit is specified as an offset from t he least significant bit of t he operand. When t he processor ident ifies t he bit t o be t est ed and m odified, it first loads t he CF flag wit h t he current value of t he bit . Then it assigns a new value t o t he select ed bit , as det erm ined by t he m odify operat ion for t he inst ruct ion.

Table 7-3. Bit Test and Modify Instructions Instruction BT (Bit Test) BTS (Bit Test and Set) BTR (Bit Test and Reset) BTC (Bit Test and Complement)

7.3.7.2

Effect on CF Flag

CF flag ← Selected Bit

Effect on Selected Bit

CF flag ← Selected Bit

No effect

CF flag ← Selected Bit

Selected Bit ← 1

CF flag ← Selected Bit

Selected Bit ← 0

Selected Bit ← NOT (Selected Bit)

Bit Scan Instructions

The BSF ( bit scan forward) and BSR ( bit scan reverse) inst ruct ions scan a bit st ring in a source operand for a set bit and st ore t he bit index of t he first set bit found in a dest inat ion regist er. The bit index is t he offset from t he least significant bit ( bit 0) in t he bit st ring t o t he first set bit . The BSF inst ruct ion scans t he source operand low- t ohigh ( from bit 0 of t he source operand t oward t he m ost significant bit ) ; t he BSR inst ruct ion scans high- t o- low ( from t he m ost significant bit t oward t he least significant bit ) .

7.3.7.3

Byte Set on Condition Instructions

The SETcc ( set byt e on condit ion) inst ruct ions set a dest inat ion- operand byt e t o 0 or 1, depending on t he st at e of select ed st at us flags ( CF, OF, SF, ZF, and PF) in t he EFLAGS regist er. The suffix ( cc) added t o t he SET m nem onic det erm ines t he condit ion being t est ed for. For exam ple, t he SETO inst ruct ion t est s for overflow. I f t he OF flag is set , t he dest inat ion byt e is set t o 1; if OF is clear, t he dest inat ion byt e is cleared t o 0. Appendix B,

Vol. 1 7-19

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

“ EFLAGS Condit ion Codes,” list s t he condit ions it is possible t o t est for wit h t his inst ruct ion.

7.3.7.4

Test Instruction

The TEST inst ruct ion perform s a logical AND of t wo operands and set s t he SF, ZF, and PF flags according t o t he result s. The flags can t hen be t est ed by t he condit ional j um p or loop inst ruct ions or t he SETcc inst ruct ions. The TEST inst ruct ion differs from t he AND inst ruct ion in t hat it does not alt er eit her of t he operands.

7.3.8

Control Transfer Instructions

The processor provides bot h condit ional and uncondit ional cont rol t ransfer inst ruct ions t o direct t he flow of program execut ion. Condit ional t ransfers are t aken only for specified st at es of t he st at us flags in t he EFLAGS regist er. Uncondit ional cont rol t ransfers are always execut ed. For t he purpose of t his discussion, t hese inst ruct ions are furt her divided subordinat e subgroups t hat process:

• • •

Uncondit ional t ransfers Condit ional t ransfers Soft ware int errupt s

7.3.8.1

Unconditional Transfer Instructions

The JMP, CALL, RET, I NT, and I RET inst ruct ions t ransfer program cont rol t o anot her locat ion ( dest inat ion address) in t he inst ruct ion st ream . The dest inat ion can be wit hin t he sam e code segm ent ( near t ransfer) or in a different code segm ent ( far t ransfer) . Ju m p inst r u ct ion — The JMP ( j um p) inst ruct ion uncondit ionally t ransfers program cont rol t o a dest inat ion inst ruct ion. The t ransfer is one- way; t hat is, a ret urn address is not saved. A dest inat ion operand specifies t he address ( t he inst ruct ion point er) of t he dest inat ion inst ruct ion. The address can be a r e la t ive a ddr e ss or an a bsolut e a ddr e ss. A r e la t ive a ddr e ss is a displacem ent ( offset ) wit h respect t o t he address in t he EI P regist er. The dest inat ion address ( a near point er) is form ed by adding t he displacem ent t o t he address in t he EI P regist er. The displacem ent is specified wit h a signed int eger, allowing j um ps eit her forward or backward in t he inst ruct ion st ream . An a bsolut e a ddr e ss is a offset from address 0 of a segm ent . I t can be specified in eit her of t he following ways:



An a ddr e ss in a ge n e r a l- pur pose r e gist e r — This address is t reat ed as a near point er, which is copied int o t he EI P regist er. Program execut ion t hen cont inues at t he new address wit hin t he current code segm ent .

7-20 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS



An a ddr e ss spe cifie d u sin g t he st a n da r d a ddr e ssin g m ode s of t he pr oce ssor — Here, t he address can be a near point er or a far point er. I f t he address is for a near point er, t he address is t ranslat ed int o an offset and copied int o t he EI P regist er. I f t he address is for a far point er, t he address is t ranslat ed int o a segm ent select or ( which is copied int o t he CS regist er) and an offset ( which is copied int o t he EI P regist er) .

I n prot ect ed m ode, t he JMP inst ruct ion also allows j um ps t o a call gat e, a t ask gat e, and a t ask- st at e segm ent . Ca ll a n d r e t ur n inst r u ct ion s — The CALL ( call procedure) and RET ( ret urn from procedure) inst ruct ions allow a j um p from one procedure ( or subrout ine) t o anot her and a subsequent j um p back ( ret urn) t o t he calling procedure. The CALL inst ruct ion t ransfers program cont rol from t he current ( or calling procedure) t o anot her procedure ( t he called procedure) . To allow a subsequent ret urn t o t he calling procedure, t he CALL inst ruct ion saves t he current cont ent s of t he EI P regist er on t he st ack before j um ping t o t he called procedure. The EI P regist er ( prior t o t ransferring program cont rol) cont ains t he address of t he inst ruct ion following t he CALL inst ruct ion. When t his address is pushed on t he st ack, it is referred t o as t he r e t ur n inst r uct ion point e r or r e t ur n a ddr e ss. The address of t he called procedure ( t he address of t he first inst ruct ion in t he procedure being j um ped t o) is specified in a CALL inst ruct ion t he sam e way as it is in a JMP inst ruct ion ( see “ Jum p inst ruct ion” on page 7- 20) . The address can be specified as a relat ive address or an absolut e address. I f an absolut e address is specified, it can be eit her a near or a far point er. The RET inst ruct ion t ransfers program cont rol from t he procedure current ly being execut ed ( t he called procedure) back t o t he procedure t hat called it ( t he calling procedure) . Transfer of cont rol is accom plished by copying t he ret urn inst ruct ion point er from t he st ack int o t he EI P regist er. Program execut ion t hen cont inues wit h t he inst ruct ion point ed t o by t he EI P regist er. The RET inst ruct ion has an opt ional operand, t he value of which is added t o t he cont ent s of t he ESP regist er as part of t he ret urn operat ion. This operand allows t he st ack point er t o be increm ent ed t o rem ove param et ers from t he st ack t hat were pushed on t he st ack by t he calling procedure. See Sect ion 6.3, “ Calling Procedures Using CALL and RET,” for m ore inform at ion on t he m echanics of m aking procedure calls wit h t he CALL and RET inst ruct ions. Re t ur n fr om int e r r upt inst r u ct ion — When t he processor services an int errupt , it perform s an im plicit call t o an int errupt- handling procedure. The I RET ( ret urn from int errupt ) inst ruct ion ret urns program cont rol from an int errupt handler t o t he int errupt ed procedure ( t hat is, t he procedure t hat was execut ing when t he int errupt occurred) . The I RET inst ruct ion perform s a sim ilar operat ion t o t he RET inst ruct ion ( see “ Call and ret urn inst ruct ions” on page 7- 21) except t hat it also rest ores t he EFLAGS regist er from t he st ack. The cont ent s of t he EFLAGS regist er are aut om at ically st ored on t he st ack along wit h t he ret urn inst ruct ion point er when t he processor services an int errupt .

Vol. 1 7-21

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

7.3.8.2

Conditional Transfer Instructions

The condit ional t ransfer inst ruct ions execut e j um ps or cont rol t o anot her inst ruct ion in t he inst ruct ion st ream The condit ions for cont rol t ransfer are specified wit h a define various st at es of t he st at us flags ( CF, ZF, OF, PF,

loops t hat t ransfer program if specified condit ions are m et . set of condit ion codes t hat and SF) in t he EFLAGS regist er.

Condit iona l j um p inst r uct ions — The Jcc ( condit ional) j um p inst ruct ions t ransfer program cont rol t o a dest inat ion inst ruct ion if t he condit ions specified wit h t he condit ion code ( cc) associat ed wit h t he inst ruct ion are sat isfied ( see Table 7- 4) . I f t he condit ion is not sat isfied, execut ion cont inues wit h t he inst ruct ion following t he Jcc inst ruct ion. As wit h t he JMP inst ruct ion, t he t ransfer is one- way; t hat is, a ret urn address is not saved.

Table 7-4. Conditional Jump Instructions Instruction Mnemonic

Condition (Flag States)

Description

JA/JNBE

(CF or ZF) = 0

Above/not below or equal

JAE/JNB

CF = 0

Above or equal/not below

JB/JNAE

CF = 1

Below/not above or equal

JBE/JNA

(CF or ZF) = 1

Below or equal/not above

JC

CF = 1

Carry

JE/JZ

ZF = 1

Equal/zero

JNC

CF = 0

Not carry

JNE/JNZ

ZF = 0

Not equal/not zero

JNP/JPO

PF = 0

Not parity/parity odd

JP/JPE

PF = 1

Parity/parity even

JCXZ

CX = 0

Register CX is zero

JECXZ

ECX = 0

Register ECX is zero

JG/JNLE

((SF xor OF) or ZF) = 0

Greater/not less or equal

JGE/JNL

(SF xor OF) = 0

Greater or equal/not less

JL/JNGE

(SF xor OF) = 1

Less/not greater or equal

JLE/JNG

((SF xor OF) or ZF) = 1

Less or equal/not greater

JNO

OF = 0

Not overflow

JNS

SF = 0

Not sign (non-negative)

JO

OF = 1

Overflow

JS

SF = 1

Sign (negative)

Unsigned Conditional Jumps

Signed Conditional Jumps

7-22 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The dest inat ion operand specifies a relat ive address ( a signed offset wit h respect t o t he address in t he EI P regist er) t hat point s t o an inst ruct ion in t he current code segm ent . The Jcc inst ruct ions do not support far t ransfers; however, far t ransfers can be accom plished wit h a com binat ion of a Jcc and a JMP inst ruct ion ( see “ Jcc—Jum p if Condit ion I s Met ” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A) . Table 7- 4 shows t he m nem onics for t he Jcc inst ruct ions and t he condit ions being t est ed for each inst ruct ion. The condit ion code m nem onics are appended t o t he let t er “ J” t o form t he m nem onic for a Jcc inst ruct ion. The inst ruct ions are divided int o t wo groups: unsigned and signed condit ional j um ps. These groups correspond t o t he result s of operat ions perform ed on unsigned and signed int egers respect ively. Those inst ruct ions list ed as pairs ( for exam ple, JA/ JNBE) are alt ernat e nam es for t he sam e inst ruct ion. Assem blers provide alt ernat e nam es t o m ake it easier t o read program list ings. The JCXZ and JECXZ inst ruct ions t est t he CX and ECX regist ers, respect ively, inst ead of one or m ore st at us flags. See “ Jum p if zero inst ruct ions” on page 7- 24 for m ore inform at ion about t hese inst ruct ions. Loop in st r u ct ions — The LOOP, LOOPE ( loop while equal) , LOOPZ ( loop while zero) , LOOPNE ( loop while not equal) , and LOOPNZ ( loop while not zero) inst ruct ions are condit ional j um p inst ruct ions t hat use t he value of t he ECX regist er as a count for t he num ber of t im es t o execut e a loop. All t he loop inst ruct ions decrem ent t he count in t he ECX regist er each t im e t hey are execut ed and t erm inat e a loop when zero is reached. The LOOPE, LOOPZ, LOOPNE, and LOOPNZ inst ruct ions also accept t he ZF flag as a condit ion for t erm inat ing t he loop before t he count reaches zero. The LOOP inst ruct ion decrem ent s t he cont ent s of t he ECX regist er ( or t he CX regist er, if t he address- size at t ribut e is 16) , t hen t est s t he regist er for t he loop- t erm inat ion condit ion. I f t he count in t he ECX regist er is non-zero, program cont rol is t ransferred t o t he inst ruct ion address specified by t he dest inat ion operand. The dest inat ion operand is a relat ive address ( t hat is, an offset relat ive t o t he cont ent s of t he EI P regist er) , and it generally point s t o t he first inst ruct ion in t he block of code t hat is t o be execut ed in t he loop. When t he count in t he ECX regist er reaches zer o, pr ogr am cont r ol is t ransfer r ed t o t he inst r uct ion im m ediat ely follow ing t he LOOP inst ruct ion, which t erm inat es t he loop. I f t he count in t he ECX regist er is zero when t he LOOP inst ruct ion is first execut ed, t he regist er is pre- decrem ent ed t o FFFFFFFFH, causing t he loop t o be execut ed 2 32 t im es. The LOOPE and LOOPZ inst ruct ions perform t he sam e operat ion ( t hey are m nem onics for t he sam e inst ruct ion) . These inst ruct ions operat e t he sam e as t he LOOP inst ruct ion, except t hat t hey also t est t he ZF flag. I f t he count in t he ECX regist er is not zero and t he ZF flag is set , program cont rol is t ransferred t o t he dest inat ion operand. When t he count reaches zero or t he ZF flag is clear, t he loop is t erm inat ed by t ransferring program cont rol t o t he inst ruct ion im m ediat ely following t he LOOPE/ LOOPZ inst ruct ion.

Vol. 1 7-23

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The LOOPNE and LOOPNZ inst ruct ions ( m nem onics for t he sam e inst ruct ion) operat e t he sam e as t he LOOPE/ LOOPPZ inst ruct ions, except t hat t hey t erm inat e t he loop if t he ZF flag is set . Ju m p if z e r o inst r u ct ion s — The JECXZ ( j um p if ECX zero) inst ruct ion j um ps t o t he locat ion specified in t he dest inat ion operand if t he ECX regist er cont ains t he value zero. This inst ruct ion can be used in com binat ion wit h a loop inst ruct ion ( LOOP, LOOPE, LOOPZ, LOOPNE, or LOOPNZ) t o t est t he ECX regist er prior t o beginning a loop. As described in “ Loop inst ruct ions on page 7- 23, t he loop inst ruct ions decrem ent t he cont ent s of t he ECX regist er before t est ing for zero. I f t he value in t he ECX regist er is zero init ially, it will be decrem ent ed t o FFFFFFFFH on t he first loop inst ruct ion, causing t he loop t o be execut ed 232 t im es. To prevent t his problem , a JECXZ inst ruct ion can be insert ed at t he beginning of t he code block for t he loop, causing a j um p out t he loop if t he EAX regist er count is init ially zero. When used wit h repeat ed st ring scan and com pare inst ruct ions, t he JECXZ inst ruct ion can det erm ine whet her t he loop t erm inat ed because t he count reached zero or because t he scan or com pare condit ions were sat isfied. The JCXZ ( j um p if CX is zero) inst ruct ion operat es t he sam e as t he JECXZ inst ruct ion when t he 16- bit address- size at t ribut e is used. Here, t he CX regist er is t est ed for zero.

7.3.8.3

Control Transfer Instructions in 64-Bit Mode

I n 64- bit m ode, t he operand size for all near branches ( CALL, RET, JCC, JCXZ, JMP, and LOOP) is forced t o 64 bit s. The list ed inst ruct ions updat e t he 64- bit RI P wit hout need for a REX operand- size prefix. Near branches in t he following operat ions are forced t o 64- bit s ( regardless of operand size prefixes) :

• • • •

Truncat ion of t he size of t he inst ruct ion point er Size of a st ack pop or push, due t o CALL or RET Size of a st ack- point er increm ent or decrem ent , due t o CALL or RET I ndirect- branch operand size

Not e t hat t he displacem ent field for relat ive branches is st ill lim it ed t o 32 bit s and t he address size for near branches is not forced. Address size det erm ines t he regist er size ( CX/ ECX/ RCX) used for JCXZ and LOOP. I t also im pact s t he address calculat ion for m em ory indirect branches. Addresses size is 64 bit s by default , alt hough it can be over- ridden t o 32 bit s ( using a prefix) .

7.3.8.4

Software Interrupt Instructions

The I NT n ( soft ware int errupt ) , I NTO ( int errupt on overflow) , and BOUND ( det ect value out of range) inst ruct ions allow a program t o explicit ly raise a specified int errupt or except ion, which in t urn causes t he handler rout ine for t he int errupt or except ion t o be called.

7-24 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The I NT n inst ruct ion can raise any of t he processor ’s int errupt s or except ions by encoding t he vect or num ber or t he int errupt or except ion in t he inst ruct ion. This inst ruct ion can be used t o support soft ware generat ed int errupt s or t o t est t he operat ion of int errupt and except ion handlers. The I RET ( ret urn from int errupt ) inst ruct ion ret urns program cont rol from an int errupt handler t o t he int errupt ed procedure. The I RET inst ruct ion perform s a sim ilar operat ion t o t he RET inst ruct ion. The CALL ( call procedure) and RET ( ret urn from procedure) inst ruct ions allow a j um p from one procedure t o anot her and a subsequent ret urn t o t he calling procedure. EFLAGS regist er cont ent s are aut om at ically st ored on t he st ack along wit h t he ret urn inst ruct ion point er when t he processor services an int errupt . The I NTO inst ruct ion raises t he overflow except ion if t he OF flag is set . I f t he flag is clear, execut ion cont inues wit hout raising t he except ion. This inst ruct ion allows soft ware t o access t he overflow except ion handler explicit ly t o check for overflow condit ions. The BOUND inst ruct ion com pares a signed value against upper and lower bounds, and raises t he “ BOUND range exceeded” except ion if t he value is less t han t he lower bound or great er t han t he upper bound. This inst ruct ion is useful for operat ions such as checking an array index t o m ake sure it falls wit hin t he range defined for t he array.

7.3.8.5

Software Interrupt Instructions in 64-bit Mode and Compatibility Mode

I n 64- bit m ode, t he st ack size is 8 byt es wide. I RET m ust pop 8- byt e it em s off t he st ack. SS: RSP pops uncondit ionally. BOUND is not support ed. I n com pat ibilit y m ode, SS: RSP is popped only if t he CPL changes.

7.3.9

String Operations

The MOVS ( Move St ring) , CMPS ( Com pare st ring) , SCAS ( Scan st ring) , LODS ( Load st ring) , and STOS ( St ore st ring) inst ruct ions perm it large dat a st ruct ures, such as alphanum eric charact er st rings, t o be m oved and exam ined in m em ory. These inst ruct ions operat e on individual elem ent s in a st ring, which can be a byt e, word, or doubleword. The st ring elem ent s t o be operat ed on are ident ified wit h t he ESI ( source st ring elem ent ) and EDI ( dest inat ion st ring elem ent ) regist ers. Bot h of t hese regist ers cont ain absolut e addresses ( offset s int o a segm ent ) t hat point t o a st ring elem ent . By default , t he ESI regist er addresses t he segm ent ident ified wit h t he DS segm ent regist er. A segm ent- override prefix allows t he ESI regist er t o be associat ed wit h t he CS, SS, ES, FS, or GS segm ent regist er. The EDI regist er addresses t he segm ent ident ified wit h t he ES segm ent regist er; no segm ent override is allowed for t he EDI regist er. The use of t wo different segm ent regist ers in t he st ring inst ruct ions perm it s operat ions t o be perform ed on st rings locat ed in different segm ent s. Or by associ-

Vol. 1 7-25

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

at ing t he ESI regist er wit h t he ES segm ent regist er, bot h t he source and dest inat ion st rings can be locat ed in t he sam e segm ent . ( This lat t er condit ion can also be achieved by loading t he DS and ES segm ent regist ers wit h t he sam e segm ent select or and allowing t he ESI regist er t o default t o t he DS regist er.) The MOVS inst ruct ion m oves t he st ring elem ent addressed by t he ESI regist er t o t he locat ion addressed by t he EDI regist er. The assem bler recognizes t hree “ short form s” of t his inst ruct ion, which specify t he size of t he st ring t o be m oved: MOVSB ( m ove byt e st ring) , MOVSW ( m ove word st ring) , and MOVSD ( m ove doubleword st ring) . The CMPS inst ruct ion subt ract s t he dest inat ion st ring elem ent from t he source st ring elem ent and updat es t he st at us flags ( CF, ZF, OF, SF, PF, and AF) in t he EFLAGS regist er according t o t he result s. Neit her st ring elem ent is writ t en back t o m em ory. The assem bler recognizes t hree “ short form s” of t he CMPS inst ruct ion: CMPSB ( com pare byt e st rings) , CMPSW ( com pare word st rings) , and CMPSD ( com pare doubleword st rings) . The SCAS inst ruct ion subt ract s t he dest inat ion st ring elem ent from t he cont ent s of t he EAX, AX, or AL regist er ( depending on operand lengt h) and updat es t he st at us flags according t o t he result s. The st ring elem ent and regist er cont ent s are not m odified. The following “ short form s” of t he SCAS inst ruct ion specify t he operand lengt h: SCASB ( scan byt e st ring) , SCASW ( scan word st ring) , and SCASD ( scan doubleword st ring) . The LODS inst ruct ion loads t he source st ring elem ent ident ified by t he ESI regist er int o t he EAX regist er ( for a doubleword st ring) , t he AX regist er ( for a word st ring) , or t he AL regist er ( for a byt e st ring) . The “ short form s” for t his inst ruct ion are LODSB ( load byt e st ring) , LODSW ( load word st ring) , and LODSD ( load doubleword st ring) . This inst ruct ion is usually used in a loop, where ot her inst ruct ions process each elem ent of t he st ring aft er t hey are loaded int o t he t arget regist er. The STOS inst ruct ion st ores t he source st ring elem ent from t he EAX ( doubleword st ring) , AX ( word st ring) , or AL ( byt e st ring) regist er int o t he m em ory locat ion ident ified wit h t he EDI regist er. The “ short form s” for t his inst ruct ion are STOSB ( st ore byt e st ring) , STOSW ( st ore word st ring) , and STOSD ( st ore doubleword st ring) . This inst ruct ion is also nor m ally used in a loop. Her e a st r ing is com m only loaded int o t he r egist er w it h a LODS inst ruct ion, operat ed on by ot her inst ruct ions, and t hen st ored again in m em ory wit h a STOS inst ruct ion. The I / O inst ruct ions ( see Sect ion 7.3.11, “ I / O I nst ruct ions” ) also perform operat ions on st rings in m em ory.

7.3.9.1

Repeating String Operations

The st ring inst ruct ions described in Sect ion 7.3.9, “ St ring Operat ions” , perform one it erat ion of a st ring operat ion. To operat e st rings longer t han a doubleword, t he st ring inst ruct ions can be com bined wit h a repeat prefix ( REP) t o creat e a repeat ing inst ruct ion or be placed in a loop. When used in st ring inst ruct ions, t he ESI and EDI regist ers are aut om at ically increm ent ed or decrem ent ed aft er each it erat ion of an inst ruct ion t o point t o t he next elem ent ( byt e, word, or doubleword) in t he st ring. St ring operat ions can t hus begin 7-26 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

at higher addresses and work t oward lower ones, or t hey can begin at lower addresses and work t oward higher ones. The DF flag in t he EFLAGS regist er cont rols whet her t he regist ers are increm ent ed ( DF = 0) or decrem ent ed ( DF = 1) . The STD and CLD inst ruct ions set and clear t his flag, respect ively. The following repeat prefixes can be used in conj unct ion wit h a count in t he ECX regist er t o cause a st ring inst ruct ion t o repeat :

• • •

REP — Repeat while t he ECX regist er not zero. REPE/ REPZ — Repeat while t he ECX regist er not zero and t he ZF flag is set . REPN E/ REPN Z — Repeat while t he ECX regist er not zero and t he ZF flag is clear.

When a st ring inst ruct ion has a repeat prefix, t he operat ion execut es unt il one of t he t erm inat ion condit ions specified by t he prefix is sat isfied. The REPE/ REPZ and REPNE/ REPNZ prefixes are used only wit h t he CMPS and SCAS inst ruct ions. Also, not e t hat a REP STOS inst ruct ion is t he fast est way t o init ialize a large block of m em ory.

7.3.10

String Operations in 64-Bit Mode

The behavior of MOVS ( Move St ring) , CMPS ( Com pare st ring) , SCAS ( Scan st ring) , LODS ( Load st ring) , and STOS ( St ore st ring) inst ruct ions in 64- bit m ode is sim ilar t o t heir behavior in non- 64- bit m odes, wit h t he following differences:



The source operand is specified by RSI or DS: ESI , depending on t he address size at t ribut e of t he operat ion.



The dest inat ion operand is specified by RDI or DS: EDI , depending on t he address size at t ribut e of t he operat ion.



Operat ion on 64- bit dat a is support ed by using t he REX.W prefix.

7.3.10.1

Repeating String Operations in 64-bit Mode

When using REP prefixes for st ring operat ions in 64- bit m ode, t he repeat count is specified by RCX or ECX ( depending on t he address size at t ribut e of t he operat ion) . The default address size is 64 bit s.

7.3.11

I/O Instructions

The I N ( input from port t o regist er) , I NS ( input from port t o st ring) , OUT ( out put from regist er t o port ) , and OUTS ( out put st ring t o port ) inst ruct ions m ove dat a bet ween t he processor ’s I / O port s and eit her a regist er or m em ory. The regist er I / O inst ruct ions ( I N and OUT) m ove dat a bet ween an I / O port and t he EAX regist er ( 32- bit I / O) , t he AX regist er ( 16- bit I / O) , or t he AL ( 8- bit I / O) regist er. The I / O port being read or writ t en t o is specified wit h an im m ediat e operand or an address in t he DX regist er.

Vol. 1 7-27

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The block I / O inst ruct ions ( I NS and OUTS) inst ruct ions m ove blocks of dat a ( st rings) bet ween an I / O port and m em ory. These inst ruct ions operat e sim ilar t o t he st ring inst ruct ions ( see Sect ion 7.3.9, “ St ring Operat ions” ) . The ESI and EDI regist ers are used t o specify st ring elem ent s in m em ory and t he repeat prefixes ( REP) are used t o repeat t he inst ruct ions t o im plem ent block m oves. The assem bler recognizes t he following alt ernat e m nem onics for t hese inst ruct ions: I NSB ( input byt e) , I NSW ( input word) , and I NSD ( input doubleword) , and OUTB ( out put byt e) , OUTW ( out put word) , and OUTD ( out put doubleword) . The I NS and OUTS inst ruct ions use an address in t he DX regist er t o specify t he I / O port t o be read or writ t en t o.

7.3.12

I/O Instructions in 64-Bit Mode

For I / O inst ruct ions t o and from m em ory, t he differences in 64- bit m ode are:



The source operand is specified by RSI or DS: ESI , depending on t he address size at t ribut e of t he operat ion.



The dest inat ion operand is specified by RDI or DS: EDI , depending on t he address size at t ribut e of t he operat ion.



Operat ion on 64- bit dat a is not encodable and REX prefixes are silent ly ignored.

7.3.13

Enter and Leave Instructions

The ENTER and LEAVE inst ruct ions provide m achine- language support for procedure calls in block- st ruct ured languages, such as C and Pascal. These inst ruct ions and t he call and ret urn m echanism t hat t hey support are described in det ail in Sect ion 6.5, “ Procedure Calls for Block- St ruct ured Languages” .

7.3.14

Flag Control (EFLAG) Instructions

The Flag Cont rol ( EFLAG) inst ruct ions allow t he st at e of select ed flags in t he EFLAGS regist er t o be read or m odified. For t he purpose of t his discussion, t hese inst ruct ions are furt her divided subordinat e subgroups of inst ruct ions t hat m anipulat e:

• • •

Carry and direct ion flags The EFLAGS regist er I nt errupt flags

7.3.14.1

Carry and Direction Flag Instructions

The STC ( set carry flag) , CLC ( clear carry flag) , and CMC ( com plem ent carry flag) inst ruct ions allow t he CF flags in t he EFLAGS regist er t o be m odified direct ly. They are t ypically used t o init ialize t he CF flag t o a known st at e before an inst ruct ion t hat

7-28 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

uses t he flag in an operat ion is execut ed. They are also used in conj unct ion wit h t he rot at e- wit h- carry inst ruct ions ( RCL and RCR) . The STD ( set direct ion flag) and CLD ( clear direct ion flag) inst ruct ions allow t he DF flag in t he EFLAGS regist er t o be m odified direct ly. The DF flag det erm ines t he direct ion in which index regist ers ESI and EDI are st epped when execut ing st ring processing inst ruct ions. I f t he DF flag is clear, t he index regist ers are increm ent ed aft er each it erat ion of a st ring inst ruct ion; if t he DF flag is set , t he regist ers are decrem ent ed.

7.3.14.2

EFLAGS Transfer Instructions

The EFLAGS t ransfer inst ruct ions allow groups of flags in t he EFLAGS regist er t o be copied t o a regist er or m em ory or be loaded from a regist er or m em ory. The LAHF ( load AH from flags) and SAHF ( st ore AH int o flags) inst ruct ions operat e on five of t he EFLAGS st at us flags ( SF, ZF, AF, PF, and CF) . The LAHF inst ruct ion copies t he st at us flags t o bit s 7, 6, 4, 2, and 0 of t he AH regist er, respect ively. The cont ent s of t he rem aining bit s in t he regist er ( bit s 5, 3, and 1) are undefined, and t he cont ent s of t he EFLAGS regist er rem ain unchanged. The SAHF inst ruct ion copies bit s 7, 6, 4, 2, and 0 from t he AH regist er int o t he SF, ZF, AF, PF, and CF flags, respect ively in t he EFLAGS regist er. The PUSHF ( push flags) , PUSHFD ( push flags double) , POPF ( pop flags) , and POPFD ( pop flags double) inst ruct ions copy t he flags in t he EFLAGS regist er t o and from t he st ack. The PUSHF inst ruct ion pushes t he lower word of t he EFLAGS regist er ont o t he st ack ( see Figure 7- 11) . The PUSHFD inst ruct ion pushes t he ent ire EFLAGS regist er ont o t he st ack ( wit h t he RF and VM flags read as clear) .

PUSHFD/POPFD PUSHF/POPF 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 V V N A V R 0 T 0 0 0 0 0 0 0 0 0 0 I I I C M F D P F

I O P L

O D I T S Z P C A F F F F F F 0 F 0 F 1 F

Figure 7-11. Flags Affected by the PUSHF, POPF, PUSHFD, and POPFD Instructions The POPF inst ruct ion pops a word from t he st ack int o t he EFLAGS regist er. Only bit s 11, 10, 8, 7, 6, 4, 2, and 0 of t he EFLAGS regist er are affect ed wit h all uses of t his inst ruct ion. I f t he current privilege level ( CPL) of t he current code segm ent is 0 ( m ost privileged) , t he I OPL bit s ( bit s 13 and 12) also are affect ed. I f t he I / O privilege level ( I OPL) is great er t han or equal t o t he CPL, num erically, t he I F flag ( bit 9) also is affect ed.

Vol. 1 7-29

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

The POPFD inst ruct ion pops a doubleword int o t he EFLAGS regist er. This inst ruct ion can change t he st at e of t he AC bit ( bit 18) and t he I D bit ( bit 21) , as well as t he bit s affect ed by a POPF inst ruct ion. The rest rict ions for changing t he I OPL bit s and t he I F flag t hat were given for t he POPF inst ruct ion also apply t o t he POPFD inst ruct ion.

7.3.14.3

Interrupt Flag Instructions

The STI ( set int errupt flag) and CTI ( clear int errupt flag) inst ruct ions allow t he int errupt I F flag in t he EFLAGS regist er t o be m odified direct ly. The I F flag cont rols t he servicing of hardware- generat ed int errupt s ( t hose received at t he processor ’s I NTR pin) . I f t he I F flag is set , t he processor services hardware int errupt s; if t he I F flag is clear, hardware int errupt s are m asked. The abilit y t o execut e t hese inst ruct ions depends on t he operat ing m ode of t he processor and t he current privilege level ( CPL) of t he program or t ask at t em pt ing t o execut e t hese inst ruct ions.

7.3.15

Flag Control (RFLAG) Instructions in 64-Bit Mode

I n 64- bit m ode, t he LAHF and SAHF inst ruct ions are support ed if CPUI D.80000001H: ECX.LAHF- SAHF[ bit 0] = 1. PUSHF and POPF behave t he sam e in 64- bit m ode as in non- 64- bit m ode. PUSHFD always pushes 64- bit RFLAGS ont o t he st ack ( wit h t he RF and VM flags read as clear) . POPFD always pops a 64- bit value from t he t op of t he st ack and loads t he lower 32 bit s int o RFLAGS. I t t hen zero ext ends t he upper bit s of RFLAGS.

7.3.16

Segment Register Instructions

The processor provides a variet y of inst ruct ions t hat address t he segm ent regist ers of t he processor direct ly. These inst ruct ions are only used when an operat ing syst em or execut ive is using t he segm ent ed or t he real- address m ode m em ory m odel. For t he purpose of t his discussion, t hese inst ruct ions are divided subordinat e subgroups of inst ruct ions t hat allow:

• • • •

Segm ent- regist er load and st ore Far cont rol t ransfers Soft ware int errupt calls Handling of far point ers

7.3.16.1

Segment-Register Load and Store Instructions

The MOV inst ruct ion ( int roduced in Sect ion 7.3.1.1, “ General Dat a Movem ent I nst ruct ions” ) and t he PUSH and POP inst ruct ions ( int roduced in Sect ion 7.3.1.4, “ St ack Manipulat ion I nst ruct ions” ) can t ransfer 16- bit segm ent select ors t o and from

7-30 Vol. 1

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

segm ent regist ers ( DS, ES, FS, GS, and SS) . The t ransfers are always m ade t o or from a segm ent regist er and a general- purpose regist er or m em ory. Transfers bet ween segm ent regist ers are not support ed. The POP and MOV inst ruct ions cannot place a value in t he CS regist er. Only t he far cont rol- t ransfer versions of t he JMP, CALL, and RET inst ruct ions ( see Sect ion 7.3.16.2, “ Far Cont rol Transfer I nst ruct ions” ) affect t he CS regist er direct ly.

7.3.16.2

Far Control Transfer Instructions

The JMP and CALL inst ruct ions ( see Sect ion 7.3.8, “ Cont rol Transfer I nst ruct ions” ) bot h accept a far point er as a source operand t o t ransfer program cont rol t o a segm ent ot her t han t he segm ent current ly being point ed t o by t he CS regist er. When a far call is m ade wit h t he CALL inst ruct ion, t he current values of t he EI P and CS regist ers are bot h pushed on t he st ack. The RET inst ruct ion ( see “ Call and ret urn inst ruct ions” on page 7- 21) can be used t o execut e a far ret urn. Here, program cont rol is t ransferred from a code segm ent t hat cont ains a called procedure back t o t he code segm ent t hat cont ained t he calling procedure. The RET inst ruct ion rest ores t he values of t he CS and EI P regist ers for t he calling procedure from t he st ack.

7.3.16.3

Software Interrupt Instructions

The soft ware int errupt inst ruct ions I NT, I NTO, BOUND, and I RET ( see Sect ion 7.3.8.4, “ Soft ware I nt errupt I nst ruct ions” ) can also call and ret urn from int errupt and except ion handler procedures t hat are locat ed in a code segm ent ot her t han t he current code segm ent . Wit h t hese inst ruct ions, however, t he swit ching of code segm ent s is handled t ransparent ly from t he applicat ion program .

7.3.16.4

Load Far Pointer Instructions

The load far point er inst ruct ions LDS ( load far point er using DS) , LES ( load far point er using ES) , LFS ( load far point er using FS) , LGS ( load far point er using GS) , and LSS ( load far point er using SS) load a far point er from m em ory int o a segm ent regist er and a general- purpose general regist er. The segm ent select or part of t he far point er is loaded int o t he select ed segm ent regist er and t he offset is loaded int o t he select ed general- purpose regist er.

7.3.17

Miscellaneous Instructions

The following inst ruct ions perform operat ions t hat are of int erest t o applicat ions program m ers. For t he purpose of t his discussion, t hese inst ruct ions are furt her divided int o subordinat e subgroups of inst ruct ions t hat provide for:

• •

Address com put at ions Table lookup

Vol. 1 7-31

PROGRAMMING WITH GENERAL-PURPOSE INSTRUCTIONS

• •

Processor ident ificat ion NOP and undefined inst ruct ion ent ry

7.3.17.1

Address Computation Instruction

The LEA ( load effect ive address) inst ruct ion com put es t he effect ive address in m em ory ( offset wit hin a segm ent ) of a source operand and places it in a generalpurpose regist er. This inst ruct ion can int erpret any of t he processor ’s addressing m odes and can perform any indexing or scaling t hat m ay be needed. I t is especially useful for init ializing t he ESI or EDI registers before t he execut ion of st ring inst ruct ions or for init ializing t he EBX regist er before an XLAT inst ruct ion.

7.3.17.2

Table Lookup Instructions

The XLAT and XLATB ( t able lookup) inst ruct ions replace t he cont ent s of t he AL regist er wit h a byt e read from a t ranslat ion t able in m em ory. The init ial value in t he AL regist er is int erpret ed as an unsigned index int o t he t ranslat ion t able. This index is added t o t he cont ent s of t he EBX regist er ( which cont ains t he base address of t he t able) t o calculat e t he address of t he t able ent ry. These inst ruct ions are used for applicat ions such as convert ing charact er codes from one alphabet int o anot her ( for exam ple, an ASCI I code could be used t o look up it s EBCDI C equivalent in a t able) .

7.3.17.3

Processor Identification Instruction

The CPUI D ( processor ident ificat ion) inst ruct ion ret urns inform at ion about t he processor on which t he inst ruct ion is execut ed.

7.3.17.4

No-Operation and Undefined Instructions

The NOP ( no operat ion) inst ruct ion increm ent s t he EI P regist er t o point at t he next inst ruct ion, but affect s not hing else. The UD2 ( undefined) inst ruct ion generat es an invalid opcode except ion. I nt el reserves t he opcode for t his inst ruct ion for t his funct ion. The inst ruct ion is provided t o allow soft ware t o t est an invalid opcode except ion handler.

7-32 Vol. 1

CHAPTER 8 PROGRAMMING WITH THE X87 FPU The x87 Float ing- Point Unit ( FPU) provides high- perform ance float ing- point processing capabilit ies for use in graphics processing, scient ific, engineering, and business applicat ions. I t support s t he float ing- point , int eger, and packed BCD int eger dat a t ypes and t he float ing- point processing algorit hm s and except ion handling archit ect ure defined in t he I EEE St andard 754 for Binary Float ing- Point Arit hm et ic. This chapt er describes t he x87 FPU’s execut ion environm ent and inst ruct ion set . I t also provides except ion handling inform at ion t hat is specific t o t he x87 FPU. Refer t o t he following chapt ers or sect ions of chapt ers for addit ional inform at ion about x87 FPU inst ruct ions and float ing- point operat ions:



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, provide det ailed descript ions of x87 FPU inst ruct ions.



Sect ion 4.2.2, “ Float ing- Point Dat a Types,” Sect ion 4.2.1.2, “ Signed I nt egers,” and Sect ion 4.7, “ BCD and Packed BCD I nt egers,” describe t he float ing- point , int eger, and BCD dat a t ypes.



Sect ion 4.9, “ Overview of Float ing- Point Except ions,” Sect ion 4.9.1, “ Float ingPoint Except ion Condit ions,” and Sect ion 4.9.2, “ Float ing- Point Except ion Priorit y,” give an overview of t he float ing- point except ions t hat t he x87 FPU can det ect and report .

8.1

X87 FPU EXECUTION ENVIRONMENT

The x87 FPU represent s a separat e execut ion environm ent wit hin t he I A- 32 archit ect ure ( see Figure 8- 1) . This execut ion environm ent consist s of eight dat a regist ers ( called t he x87 FPU dat a regist ers) and t he following special- purpose regist ers:

• • • • • •

St at us regist er Cont rol regist er Tag word regist er Last inst ruct ion point er regist er Last dat a ( operand) point er regist er Opcode regist er

These regist ers are described in t he following sect ions.

Vol. 1 8-1

PROGRAMMING WITH THE X87 FPU

The x87 FPU execut es inst ruct ions from t he processor ’s norm al inst ruct ion st ream . The st at e of t he x87 FPU is independent from t he st at e of t he basic execut ion environm ent and from t he st at e of SSE/ SSE2/ SSE3 ext ensions. However, t he x87 FPU and I nt el MMX t echnology share st at e because t he MMX regist ers are aliased t o t he x87 FPU dat a regist ers. Therefore, when writ ing code t hat uses x87 FPU and MMX inst ruct ions, t he program m er m ust explicit ly m anage t he x87 FPU and MMX st at e ( see Sect ion 9.5, “ Com pat ibilit y wit h x87 FPU Archit ect ure” ) .

8.1.1

x87 FPU in 64-Bit Mode and Compatibility Mode

I n com pat ibilit y m ode and 64- bit m ode, x87 FPU inst ruct ions funct ion like t hey do in prot ect ed m ode. Mem ory operands are specified using t he ModR/ M, SI B encoding t hat is described in Sect ion 3.7.5, “ Specifying an Offset .”

8.1.2

x87 FPU Data Registers

The x87 FPU dat a regist ers ( shown in Figure 8- 1) consist of eight 80- bit regist ers. Values are st ored in t hese regist ers in t he double ext ended- precision float ing- point form at shown in Figure 4- 3. When float ing- point , int eger, or packed BCD int eger values are loaded from m em ory int o any of t he x87 FPU dat a regist ers, t he values are aut om at ically convert ed int o double ext ended- precision float ing- point form at ( if t hey are not already in t hat form at ) . When com put at ion result s are subsequent ly t ransferred back int o m em ory from any of t he x87 FPU regist ers, t he result s can be left in t he double ext ended- precision float ing- point form at or convert ed back int o a short er float ing- point form at , an int eger form at , or t he packed BCD int eger form at . ( See Sect ion 8.2, “ x87 FPU Dat a Types,” for a descript ion of t he dat a t ypes operat ed on by t he x87 FPU.)

8-2 Vol. 1

PROGRAMMING WITH THE X87 FPU

Data Registers

Sign

79 78 R7

0

64 63

Exponent

Significand

R6 R5 R4 R3 R2 R1 R0 0

15

47

0

Control Register

Last Instruction Pointer

Status Register

Last Data (Operand) Pointer

Tag Register

10

0

Opcode

Figure 8-1. x87 FPU Execution Environment The x87 FPU inst ruct ions t reat t he eight x87 FPU dat a regist ers as a regist er st ack ( see Figure 8- 2) . All addressing of t he dat a r egist er s is r elat ive t o t he r egist er on t he t op of t he st ack. The regist er num ber of t he current t op- of- st ack regist er is st ored in t he TOP ( st ack TOP) field in t he x 87 FPU st at us wor d. Load operat ions decr em ent TOP by one and load a value int o t he new t op- of- st ack r egist er, and st or e operat ions st or e t he value fr om t he cur r ent TOP r egist er in m em or y and t hen incr em ent TOP by one. ( For t he x87 FPU, a load operat ion is equivalent t o a push and a st or e operat ion is equivalent t o a pop.) Not e t hat load and st or e operat ions ar e also available t hat do not push and pop t he st ack.

Vol. 1 8-3

PROGRAMMING WITH THE X87 FPU

FPU Data Register Stack

7 6 Growth Stack 5 4

ST(1)

Top

3

ST(0)

011B

ST(2)

2 1 0

Figure 8-2. x87 FPU Data Register Stack I f a load operat ion is perform ed when TOP is at 0, regist er wraparound occurs and t he new value of TOP is set t o 7. The float ing- point st ack- overflow except ion indicat es when wraparound m ight cause an unsaved value t o be overwrit t en ( see Sect ion 8.5.1.1, “ St ack Overflow or Underflow Except ion ( # I S) ” ) . Many float ing- point inst ruct ions have several addressing m odes t hat perm it t he program m er t o im plicit ly operat e on t he t op of t he st ack, or t o explicit ly operat e on specific regist ers relat ive t o t he TOP. Assem blers support t hese regist er addressing m odes, using t he expression ST( 0) , or sim ply ST, t o represent t he current st ack t op and ST( i) t o specify t he it h regist er from TOP in t he st ack ( 0 ≤ i ≤ 7) . For exam ple, if TOP cont ains 011B ( regist er 3 is t he t op of t he st ack) , t he following inst ruct ion would add t he cont ent s of t wo regist ers in t he st ack ( regist ers 3 and 5) : FADD ST, ST(2); Figure 8- 3 shows an exam ple of how t he st ack st ruct ure of t he x87 FPU regist ers and inst ruct ions are t ypically used t o perform a series of com put at ions. Here, a t wodim ensional dot product is com put ed, as follows: 1. The first inst ruct ion ( FLD value1) decrem ent s t he st ack regist er point er ( TOP) and loads t he value 5.6 from m em ory int o ST( 0) . The result of t his operat ion is shown in snap- shot ( a) . 2. The second inst ruct ion m ult iplies t he value in ST( 0) by t he value 2.4 from m em ory and st ores t he result in ST( 0) , shown in snap- shot ( b) . 3. The t hird inst ruct ion decrem ent s TOP and loads t he value 3.8 in ST( 0) . 4. The fourt h inst ruct ion m ult iplies t he value in ST( 0) by t he value 10.3 from m em ory and st ores t he result in ST( 0) , shown in snap- shot ( c) . 5. The fift h inst ruct ion adds t he value and t he value in ST( 1) and st ores t he result in ST( 0) , shown in snap- shot ( d) .

8-4 Vol. 1

PROGRAMMING WITH THE X87 FPU

Computation Dot Product = (5.6 x 2.4) + (3.8 x 10.3) Code: FLD value1 FMUL value2 FLD value3 FMUL value4 FADD ST(1) (a)

;(a) value1 = 5.6 ;(b) value2 = 2.4 ; value3 = 3.8 ;(c)value4 = 10.3 ;(d) (c)

(b)

(d)

R7

R7

R7

R7

R6

R6

R6

R6

R5

R5

R5

R5

R4

ST(0) R4

13.44

ST(1)

R4

13.44

ST(

39.14

ST(0)

R3

52.58

ST

R4

5.6

ST(0)

13.44

R3

R3

R3

R2

R2

R2

R2

R1

R1

R1

R1

R0

R0

R0

R0

Figure 8-3. Example x87 FPU Dot Product Computation The st yle of program m ing dem onst rat ed in t his exam ple is support ed by t he float ingpoint inst ruct ion set . I n cases where t he st ack st ruct ure causes com put at ion bot t lenecks, t he FXCH ( exchange x87 FPU regist er cont ent s) inst ruct ion can be used t o st ream line a com put at ion.

8.1.2.1

Parameter Passing With the x87 FPU Register Stack

Like t he general- purpose regist ers, t he cont ent s of t he x87 FPU dat a regist ers are unaffect ed by procedure calls, or in ot her words, t he values are m aint ained across procedure boundaries. A calling procedure can t hus use t he x87 FPU dat a regist ers ( as well as t he procedure st ack) for passing param et er bet ween procedures. The called procedure can reference param et ers passed t hrough t he regist er st ack using t he current st ack regist er point er ( TOP) and t he ST( 0) and ST( i) nom enclat ure. I t is also com m on pract ice for a called procedure t o leave a ret urn value or result in regist er ST( 0) when ret urning execut ion t o t he calling procedure or program . When m ixing MMX and x87 FPU inst ruct ions in t he procedures or code sequences, t he program m er is responsible for m aint aining t he int egrit y of param et ers being passed in t he x87 FPU dat a regist ers. I f an MMX inst ruct ion is execut ed before t he param et ers in t he x87 FPU dat a regist ers have been passed t o anot her procedure, t he param et ers m ay be lost ( see Sect ion 9.5, “ Com pat ibilit y wit h x87 FPU Archit ect ure” ) .

Vol. 1 8-5

PROGRAMMING WITH THE X87 FPU

8.1.3

x87 FPU Status Register

The 16- bit x87 FPU st at us regist er ( see Figure 8- 4) indicat es t he current st at e of t he x87 FPU. The flags in t he x87 FPU st at us regist er include t he FPU busy flag, t op- ofst ack ( TOP) point er, condit ion code flags, error sum m ary st at us flag, st ack fault flag, and except ion flags. The x87 FPU set s t he flags in t his regist er t o show t he result s of operat ions.

FPU Busy Top of Stack Pointer 15 14 13 C B 3

11 10 9 8 7 6 5 4 3 2 1 0

TOP

C C C E S P U O Z D I 2 1 0 S F E E E E E E

Condition Code Error Summary Status Stack Fault Exception Flags Precision Underflow Overflow Zero Divide Denormalized Operand Invalid Operation

Figure 8-4. x87 FPU Status Word The cont ent s of t he x87 FPU st at us regist er ( referred t o as t he x87 FPU st at us word) can be st ored in m em ory using t he FSTSW/ FNSTSW, FSTENV/ FNSTENV, FSAVE/ FNSAVE, and FXSAVE inst ruct ions. I t can also be st ored in t he AX regist er of t he int eger unit , using t he FSTSW/ FNSTSW inst ruct ions.

8.1.3.1

Top of Stack (TOP) Pointer

A point er t o t he x87 FPU dat a regist er t hat is current ly at t he t op of t he x87 FPU regist er st ack is cont ained in bit s 11 t hrough 13 of t he x87 FPU st at us word. This point er, which is com m only refer r ed t o as TOP ( for t op- of- st ack ) , is a binar y value fr om 0 t o 7 . See Sect ion 8 . 1. 2, “ x 87 FPU Dat a Regist er s,” for m ore inform at ion about t he TOP point er.

8.1.3.2

Condition Code Flags

The four condit ion code flags ( C0 t hrough C3) indicat e t he result s of float ing- point com parison and arit hm et ic operat ions. Table 8- 1 sum m arizes t he m anner in which t he float ing- point inst ruct ions set t he condit ion code flags. These condit ion code bit s

8-6 Vol. 1

PROGRAMMING WITH THE X87 FPU

are used principally for condit ional branching and for st orage of inform at ion used in except ion handling ( see Sect ion 8.1.4, “ Branching and Condit ional Moves on Condit ion Codes” ) . As shown in Table 8- 1, t he C1 condit ion code flag is used for a variet y of funct ions. When bot h t he I E and SF flags in t he x87 FPU st at us word are set , indicat ing a st ack overflow or underflow except ion ( # I S) , t he C1 flag dist inguishes bet ween overflow ( C1 = 1) and underflow ( C1 = 0) . When t he PE flag in t he st at us word is set , indicat ing an inexact ( rounded) result , t he C1 flag is set t o 1 if t he last rounding by t he inst ruct ion was upward. The FXAM inst ruct ion set s C1 t o t he sign of t he value being exam ined. The C2 condit ion code flag is used by t he FPREM and FPREM1 inst ruct ions t o indicat e an incom plet e reduct ion ( or part ial rem ainder) . When a successful reduct ion has been com plet ed, t he C0, C3, and C1 condit ion code flags are set t o t he t hree leastsignificant bit s of t he quot ient ( Q2, Q1, and Q0, respect ively) . See “ FPREM1—Part ial Rem ainder ” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for m ore inform at ion on how t hese inst ruct ions use t he condit ion code flags. The FPTAN, FSI N, FCOS, and FSI NCOS inst ruct ions set t he C2 flag t o 1 t o indicat e t hat t he source operand is beyond t he allowable range of ± 2 63 and clear t he C2 flag if t he source operand is wit hin t he allowable range. Where t he st at e of t he condit ion code flags are list ed as undefined in Table 8- 1, do not rely on any specific value in t hese flags.

8.1.3.3

x87 FPU Floating-Point Exception Flags

The six x87 FPU float ing- point except ion flags ( bit s 0 t hrough 5) of t he x87 FPU st at us word indicat e t hat one or m ore float ing- point except ions have been det ect ed since t he bit s were last cleared. The individual except ion flags ( I E, DE, ZE, OE, UE, and PE) are described in det ail in Sect ion 8.4, “ x87 FPU Float ing- Point Except ion Handling.” Each of t he except ion flags can be m asked by an except ion m ask bit in t he x87 FPU cont rol word ( see Sect ion 8.1.5, “ x87 FPU Cont rol Word” ) . The except ion sum m ary st at us flag ( ES, bit 7) is set when any of t he unm asked except ion flags are set . When t he ES flag is set , t he x87 FPU except ion handler is invoked, using one of t he t echniques described in Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware.” ( Not e t hat if an except ion flag is m asked, t he x87 FPU will st ill set t he appropriat e flag if t he associat ed except ion occurs, but it will not set t he ES flag.) The except ion flags are “ st icky” bit s ( once set , t hey rem ain set unt il explicit ly cleared) . They can be cleared by execut ing t he FCLEX/ FNCLEX ( clear except ions) inst ruct ions, by reinit ializing t he x87 FPU wit h t he FI NI T/ FNI NI T or FSAVE/ FNSAVE inst ruct ions, or by overwrit ing t he flags wit h an FRSTOR or FLDENV inst ruct ion. The B- bit ( bit 15) is included for 8087 com pat ibilit y only. I t reflect s t he cont ent s of t he ES flag.

Vol. 1 8-7

PROGRAMMING WITH THE X87 FPU

Table 8-1. Condition Code Interpretation Instruction

C0

C3

C2

FCOM, FCOMP, FCOMPP, FICOM, FICOMP, FTST, FUCOM, FUCOMP, FUCOMPP

Result of Comparison

FCOMI, FCOMIP, FUCOMI, FUCOMIP

Undefined. (These instructions set the status flags in the EFLAGS register.)

#IS

Operand class

Sign

FXAM FPREM, FPREM1

Q2

F2XM1, FADD, FADDP, FBSTP, FCMOVcc, FIADD, FDIV, FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FIMUL, FIST, FISTP, FISUB, FISUBR,FMUL, FMULP, FPATAN, FRNDINT, FSCALE, FST, FSTP, FSUB, FSUBP, FSUBR, FSUBRP,FSQRT, FYL2X, FYL2XP1

Operands are not Comparable

C1

Q1

0 = reduction complete 1 = reduction incomplete

Undefined

FCOS, FSIN, FSINCOS, FPTAN

Undefined

FABS, FBLD, FCHS, FDECSTP, FILD, FINCSTP, FLD, Load Constants, FSTP (ext. prec.), FXCH, FXTRACT

Undefined

FLDENV, FRSTOR

Each bit loaded from memory

FINIT/FNINIT, FSAVE/FNSAVE

8-8 Vol. 1

Q0 or #IS

Roundup or #IS

0 = source operand within range 1 = source operand out of range

FFREE, FLDCW, FCLEX/FNCLEX, FNOP, FSTCW/FNSTCW, FSTENV/FNSTENV, FSTSW/FNSTSW,

0 or #IS

Roundup or #IS (Undefined if C2 = 1)

0 or #IS

Undefined

0

0

0

0

PROGRAMMING WITH THE X87 FPU

8.1.3.4

Stack Fault Flag

The st ack fault flag ( bit 6 of t he x87 FPU st at us word) indicat es t hat st ack overflow or st ack underflow has occurred wit h dat a in t he x87 FPU dat a regist er st ack. The x87 FPU explicit ly set s t he SF flag when it det ect s a st ack overflow or underflow condit ion, but it does not explicit ly clear t he flag when it det ect s an invalid- arit hm et icoperand condit ion. When t his flag is set , t he condit ion code flag C1 indicat es t he nat ur e of t he fault : ov er flow ( C1 = 1) and under flow ( C1 = 0) . The SF flag is a “ st ick y ” flag, m eaning t hat aft er it is set , t he processor does not clear it unt il it is explicit ly inst ruct ed t o do so ( for exam ple, by an FI NI T/ FNI NI T, FCLEX/ FNCLEX, or FSAVE/ FNSAVE inst ruct ion) . See Sect ion 8.1.7, “ x87 FPU Tag Word,” for m ore inform at ion on x87 FPU st ack fault s.

8.1.4

Branching and Conditional Moves on Condition Codes

The x87 FPU ( beginning wit h t he P6 fam ily processors) support s t wo m echanism s for branching and perform ing condit ional m oves according t o com parisons of t wo float ing- point values. These m echanism are referred t o here as t he “ old m echanism ” and t he “ new m echanism .” The old m echanism is available in x87 FPU’s prior t o t he P6 fam ily processors and in P6 fam ily processors. This m echanism uses t he float ing- point com pare inst ruct ions ( FCOM, FCOMP, FCOMPP, FTST, FUCOMPP, FI COM, and FI COMP) t o com pare t wo float ing- point values and set t he condit ion code flags ( C0 t hrough C3) according t o t he result s. The cont ent s of t he condit ion code flags are t hen copied int o t he st at us flags of t he EFLAGS regist er using a t wo st ep process ( see Figure 8- 5) : 1. The FSTSW AX inst ruct ion m oves t he x87 FPU st at us word int o t he AX regist er. 2. The SAHF inst ruct ion copies t he upper 8 bit s of t he AX regist er, which includes t he condit ion code flags, int o t he lower 8 bit s of t he EFLAGS regist er. When t he condit ion code flags have been loaded int o t he EFLAGS regist er, condit ional j um ps or condit ional m oves can be perform ed based on t he new set t ings of t he st at us flags in t he EFLAGS regist er.

Vol. 1 8-9

PROGRAMMING WITH THE X87 FPU

x87 FPU Status Word

15 Condition Status Flag Code C0 C1 C2 C3

CF (none) PF ZF

C 3

0

C C C 2 1 0

FSTSW AX Instruction AX Register

15 C 3

0

C C C 2 1 0

SAHF Instruction 31

EFLAGS Register

7

0 Z F

P C F 1 F

Figure 8-5. Moving the Condition Codes to the EFLAGS Register The new m echanism is available beginning wit h t he P6 fam ily processors. Using t his m echanism , t he new float ing- point com pare and set EFLAGS inst ruct ions ( FCOMI , FCOMI P, FUCOMI , and FUCOMI P) com pare t wo float ing- point values and set t he ZF, PF, and CF flags in t he EFLAGS regist er direct ly. A single inst ruct ion t hus replaces t he t hree inst ruct ions required by t he old m echanism . Not e also t hat t he FCMOVcc inst ruct ions ( also new in t he P6 fam ily processors) allow condit ional m oves of float ing- point values ( values in t he x87 FPU dat a regist ers) based on t he set t ing of t he st at us flags ( ZF, PF, and CF) in t he EFLAGS regist er. These inst ruct ions elim inat e t he need for an I F st at em ent t o perform condit ional m oves of float ing- point values.

8.1.5

x87 FPU Control Word

The 16- bit x87 FPU cont rol word ( see Figure 8- 6) cont rols t he precision of t he x87 FPU and rounding m et hod used. I t also cont ains t he x87 FPU float ing- point except ion m ask bit s. The cont rol word is cached in t he x87 FPU cont rol regist er. The cont ent s of t his regist er can be loaded wit h t he FLDCW inst ruct ion and st ored in m em ory wit h t he FSTCW/ FNSTCW inst ruct ions.

8-10 Vol. 1

PROGRAMMING WITH THE X87 FPU

Infinity Control Rounding Control Precision Control 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 X

RC

PC

P U O Z D I M M M M M M

Exception Masks Precision Underflow Overflow Zero Divide Denormal Operand Invalid Operation Reserved

Figure 8-6. x87 FPU Control Word When t he x87 FPU is init ialized wit h eit her an FI NI T/ FNI NI T or FSAVE/ FNSAVE inst ruct ion, t he x87 FPU cont rol word is set t o 037FH, which m asks all float ing- point except ions, set s rounding t o nearest , and set s t he x87 FPU precision t o 64 bit s.

8.1.5.1

x87 FPU Floating-Point Exception Mask Bits

The except ion- flag m ask bit s ( bit s 0 t hrough 5 of t he x87 FPU cont rol word) m ask t he 6 float ing- point except ion flags in t he x87 FPU st at us word. When one of t hese m ask bit s is set , it s corresponding x87 FPU float ing- point except ion is blocked from being generat ed.

8.1.5.2

Precision Control Field

The precision- cont rol ( PC) field ( bit s 8 and 9 of t he x87 FPU cont rol word) det erm ines t he precision ( 64, 53, or 24 bit s) of float ing- point calculat ions m ade by t he x87 FPU ( see Table 8- 2) . The default precision is double ext ended precision, which uses t he full 64- bit significand available wit h t he double ext ended- precision float ing- point form at of t he x87 FPU dat a regist ers. This set t ing is best suit ed for m ost applicat ions, because it allows applicat ions t o t ake full advant age of t he m axim um precision available wit h t he x87 FPU dat a regist ers.

Vol. 1 8-11

PROGRAMMING WITH THE X87 FPU

Table 8-2. Precision Control Field (PC) Precision

PC Field

Single Precision (24 bits)

00B

Reserved

01B

Double Precision (53 bits)

10B

Double Extended Precision (64 bits)

11B

The double precision and single precision set t ings reduce t he size of t he significand t o 53 bit s and 24 bit s, respect ively. These set t ings are provided t o support I EEE St andard 754 and t o provide com pat ibilit y wit h t he specificat ions of cert ain exist ing program m ing languages. Using t hese set t ings nullifies t he advant ages of t he double ext ended- precision float ing- point form at 's 64- bit significand lengt h. When reduced precision is specified, t he rounding of t he significand value clears t he unused bit s on t he right t o zeros. The precision- cont rol bit s only affect t he result s of t he following float ing- point inst ruct ions: FADD, FADDP, FI ADD, FSUB, FSUBP, FI SUB, FSUBR, FSUBRP, FI SUBR, FMUL, FMULP, FI MUL, FDI V, FDI VP, FI DI V, FDI VR, FDI VRP, FI DI VR, and FSQRT.

8.1.5.3

Rounding Control Field

The rounding- cont rol ( RC) field of t he x87 FPU cont rol regist er ( bit s 10 and 11) cont rols how t he result s of x87 FPU float ing- point inst ruct ions are rounded. See Sect ion 4.8.4, “ Rounding,” for a discussion of rounding of float ing- point values; See Sect ion 4.8.4.1, “ Rounding Cont rol ( RC) Fields” , for t he encodings of t he RC field.

8.1.6

Infinity Control Flag

The infinit y cont rol flag ( bit 12 of t he x87 FPU cont rol word) is provided for com pat ibilit y wit h t he I nt el 287 Mat h Coprocessor; it is not m eaningful for lat er version x87 FPU coprocessors or I A- 32 processors. See Sect ion 4.8.3.3, “ Signed I nfinit ies,” for inform at ion on how t he x87 FPUs handle infinit y values.

8.1.7

x87 FPU Tag Word

The 16- bit t ag word ( see Figure 8- 7) indicat es t he cont ent s of each t he 8 regist ers in t he x87 FPU dat a- regist er st ack ( one 2- bit t ag per regist er) . The t ag codes indicat e whet her a regist er cont ains a valid num ber, zero, or a special float ing- point num ber ( NaN, infinit y, denorm al, or unsupport ed form at ) , or whet her it is em pt y. The x87 FPU t ag word is cached in t he x87 FPU in t he x87 FPU t ag word regist er. When t he x87 FPU is init ialized wit h eit her an FI NI T/ FNI NI T or FSAVE/ FNSAVE inst ruct ion, t he x87 FPU t ag word is set t o FFFFH, which m arks all t he x87 FPU dat a regist ers as em pt y.

8-12 Vol. 1

PROGRAMMING WITH THE X87 FPU

.

15

0

TAG(7)

TAG(6)

TAG(5)

TAG(4)

TAG(3)

TAG(2)

TAG(1)

TAG(0)

TAG Values 00 — Valid 01 — Zero 10 — Special: invalid (NaN, unsupported), infinity, or denormal 11 — Empty

Figure 8-7. x87 FPU Tag Word Each t ag in t he x87 FPU t ag word corresponds t o a physical regist er ( num bers 0 t hrough 7) . The current t op- of- st ack ( TOP) point er st ored in t he x87 FPU st at us word can be used t o associat e t ags wit h regist ers relat ive t o ST( 0) . The x87 FPU uses t he t ag values t o det ect st ack overflow and underflow condit ions ( see Sect ion 8.5.1.1, “ St ack Overflow or Underflow Except ion ( # I S) ” ) . Applicat ion program s and except ion handlers can use t his t ag inform at ion t o check t he cont ent s of an x87 FPU dat a regist er wit hout perform ing com plex decoding of t he act ual dat a in t he regist er. To read t he t ag regist er, it m ust be st ored in m em ory using eit her t he FSTENV/ FNSTENV or FSAVE/ FNSAVE inst ruct ions. The locat ion of t he t ag word in m em ory aft er being saved wit h one of t hese inst ruct ions is shown in Figures 8- 9 t hrough 8- 12. Soft ware cannot direct ly load or m odify t he t ags in t he t ag regist er. The FLDENV and FRSTOR inst ruct ions load an im age of t he t ag regist er int o t he x87 FPU; how ever, t he x87 FPU uses t hose t ag values only t o det erm ine if t he dat a regist ers are em pt y ( 11B) or non- em pt y ( 00B, 01B, or 10B) . I f t he t ag regist er im age indicat es t hat a dat a regist er is em pt y, t he t ag in t he t ag regist er for t hat dat a regist er is m arked em pt y ( 11B) ; if t he t ag regist er im age indicat es t hat t he dat a regist er is non- em pt y, t he x87 FPU reads t he act ual value in t he dat a regist er and set s t he t ag for t he regist er accordingly. This act ion prevent s a program from set t ing t he values in t he t ag regist er t o incorrect ly represent t he act ual cont ent s of non- em pt y dat a regist ers.

8.1.8

x87 FPU Instruction and Data (Operand) Pointers

The x87 FPU st ores point ers t o t he inst ruct ion and dat a ( operand) for t he last noncont rol inst ruct ion execut ed. These point ers are st ored in t wo 48- bit regist ers: t he x87 FPU inst ruct ion point er and x87 FPU operand ( dat a) point er regist ers ( see Figure 8- 1) . ( These point ers are saved t o provide st at e inform at ion for except ion handlers.)

Vol. 1 8-13

PROGRAMMING WITH THE X87 FPU

Not e t hat t he value in t he x87 FPU dat a point er regist er is always a point er t o a m em ory operand, I f t he last non- cont rol inst ruct ion t hat was execut ed did not have a m em ory operand, t he value in t he dat a point er regist er is undefined ( reserved) . The cont ent s of t he x87 FPU inst ruct ion and dat a point er regist ers rem ain unchanged when any of t he cont rol inst ruct ions ( FI NI T/ FNI NI T, FCLEX/ FNCLEX, FLDCW, FSTCW/ FNSTCW, FSTSW/ FNSTSW, FSTENV/ FNSTENV, FLDENV, FSAVE/ FNSAVE, FRSTOR, and WAI T/ FWAI T) are execut ed. The point ers st ored in t he x87 FPU inst ruct ion and dat a point er regist ers consist of an offset ( st ored in bit s 0 t hrough 31) and a segm ent select or ( st ored in bit s 32 t hrough 47) . These regist ers can be accessed by t he FSTENV/ FNSTENV, FLDENV, FI NI T/ FNI NI T, FSAVE/ FNSAVE, FRSTOR, FXSAVE, and FXRSTOR inst ruct ions. The FI NI T/ FNI NI T and FSAVE/ FNSAVE inst ruct ions clear t hese regist ers. For all t he x87 FPUs and NPXs except t he 8087, t he x87 FPU inst ruct ion point er point s t o any prefixes t hat preceded t he inst ruct ion. For t he 8087, t he x87 FPU inst ruct ion point er point s only t o t he act ual opcode.

8.1.9

Last Instruction Opcode

The x87 FPU st ores t he opcode of t he last non- cont rol inst ruct ion execut ed in an 11- bit x87 FPU opcode regist er. ( This inform at ion provides st at e inform at ion for except ion handlers.) Only t he first and second opcode byt es ( aft er all prefixes) are st ored in t he x87 FPU opcode regist er. Figure 8- 8 shows t he encoding of t hese t wo byt es. Since t he upper 5 bit s of t he first opcode byt e are t he sam e for all float ingpoint opcodes ( 11011B) , only t he lower 3 bit s of t his byt e are st ored in t he opcode regist er.

8.1.9.1

Fopcode Compatibility Sub-mode

Beginning wit h t he Pent ium 4 and I nt el Xeon processors, t he I A- 32 archit ect ure provides program cont rol over t he st oring of t he last inst ruct ion opcode ( som et im es referred t o as t he fopcode) . Here, bit 2 of t he I A32_MI SC_ENABLE MSR enables ( set ) or disables ( clear) t he fopcode com pat ibilit y m ode. I f FOP code com pat ibilit y m ode is enabled, t he FOP is defined as it has always been in previous I A32 im plem ent at ions ( alw ay s defined as t he FOP of t he last non- t ransparent FP inst ruct ion execut ed before a FSAVE/ FSTENV/ FXSAVE) . I f FOP code com pat ibilit y m ode is disabled ( default ) , FOP is only valid if t he last non- t ransparent FP inst ruct ion execut ed before a FSAVE/ FSTENV/ FXSAVE had an unm asked except ion.

8-14 Vol. 1

PROGRAMMING WITH THE X87 FPU

7

1st Instruction Byte 2

10

2nd Instruction Byte 0

7

0

8 7

0

x87 FPU Opcode Register

Figure 8-8. Contents of x87 FPU Opcode Registers The fopcode com pat ibilit y m ode should be enabled only when x87 FPU float ing- point except ion handlers are designed t o use t he fopcode t o analyze program perform ance or rest art a program aft er an except ion has been handled.

8.1.10

Saving the x87 FPU’s State with FSTENV/FNSTENV and FSAVE/FNSAVE

The FSTENV/ FNSTENV and FSAVE/ FNSAVE inst ruct ions st ore x87 FPU st at e inform at ion in m em ory for use by except ion handlers and ot her syst em and applicat ion soft ware. The FSTENV/ FNSTENV inst ruct ion saves t he cont ent s of t he st at us, cont rol, t ag, x87 FPU inst ruct ion point er, x87 FPU operand point er, and opcode regist ers. The FSAVE/ FNSAVE inst ruct ion st ores t hat inform at ion plus t he cont ent s of t he x87 FPU dat a regist ers. Not e t hat t he FSAVE/ FNSAVE inst ruct ion also init ializes t he x87 FPU t o default values ( j ust as t he FI NI T/ FNI NI T inst ruct ion does) aft er it has saved t he original st at e of t he x87 FPU. The m anner in which t his inform at ion is st ored in m em ory depends on t he operat ing m ode of t he processor ( prot ect ed m ode or real- address m ode) and on t he operandsize at t ribut e in effect ( 32- bit or 16- bit ) . See Figures 8- 9 t hrough 8- 12. I n virt ual8086 m ode or SMM, t he real- address m ode form at s shown in Figure 8- 12 is used. See Chapt er 24, “ Syst em Managem ent ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, for inform at ion on using t he x87 FPU while in SMM. The FLDENV and FRSTOR inst ruct ions allow x87 FPU st at e inform at ion t o be loaded from m em ory int o t he x87 FPU. Here, t he FLDENV inst ruct ion loads only t he st at us, cont rol, t ag, x87 FPU inst ruct ion point er, x87 FPU operand point er, and opcode regist ers, and t he FRSTOR inst ruct ion loads all t he x87 FPU regist ers, including t he x87 FPU st ack regist ers.

Vol. 1 8-15

PROGRAMMING WITH THE X87 FPU

31

32-Bit Protected Mode Format 16 15

0

Control Word

0

Status Word

4

Tag Word

8

FPU Instruction Pointer Offset 00000

Opcode 10...00

12

FPU Instruction Pointer Selector

16 20

FPU Operand Pointer Offset

24

FPU Operand Pointer Selector

For instructions that also store x87 FPU data registers, the eight 80-bit registers (R0-R7) follow the above structure in sequence.

Figure 8-9. Protected Mode x87 FPU State Image in Memory, 32-Bit Format

31

0000

32-Bit Real-Address Mode Format 16 15

0

Control Word

0

Status Word

4

Tag Word

8

FPU Instruction Pointer 15...00

12

FPU Instruction Pointer 31...16

0

Opcode 10...00

FPU Operand Pointer 15...00 0000

FPU Operand Pointer 31...16

000000000000

16 20 24

For instructions that also store x87 FPU data registers, the eight 80-bit registers (R0-R7) follow the above structure in sequence.

Figure 8-10. Real Mode x87 FPU State Image in Memory, 32-Bit Format

8-16 Vol. 1

PROGRAMMING WITH THE X87 FPU

16-Bit Protected Mode Format 0 15

Control Word

0

Status Word

2

Tag Word

4

FPU Instruction Pointer Offset

6

FPU Instruction Pointer Selector 8 FPU Operand Pointer Offset

10

FPU Operand Pointer Selector

12

Figure 8-11. Protected Mode x87 FPU State Image in Memory, 16-Bit Format

16-Bit Real-Address Mode and Virtual-8086 Mode Format

0

15 Control Word

0

Status Word

2

Tag Word FPU Instruction Pointer 15...00 IP 19..16 0

Opcode 10...00

FPU Operand Pointer 15...00

4 6 8 10

OP 19..16 0 0 0 0 0 0 0 0 0 0 0 0 12

Figure 8-12. Real Mode x87 FPU State Image in Memory, 16-Bit Format

8.1.11

Saving the x87 FPU’s State with FXSAVE

The FXSAVE and FXRSTOR inst ruct ions save and rest ore, respect ively, t he x87 FPU st at e along wit h t he st at e of t he XMM regist ers and t he MXCSR regist er. Using t he FXSAVE inst ruct ion t o save t he x87 FPU st at e has t wo benefit s: ( 1) FXSAVE execut es fast er t han FSAVE, and ( 2) FXSAVE saves t he ent ire x87 FPU, MMX, and XMM st at e in one operat ion. See Sect ion 10.5, “ FXSAVE and FXRSTOR I nst ruct ions,” for addit ional inform at ion about t hese inst ruct ions.

Vol. 1 8-17

PROGRAMMING WITH THE X87 FPU

8.2

X87 FPU DATA TYPES

The x87 FPU recognizes and operat es on t he following seven dat a t ypes ( see Figures 8- 13) : single- precision float ing point , double- precision float ing point , double ext ended- precision float ing point , signed word int eger, signed doubleword int eger, signed quadword int eger, and packed BCD decim al int egers. For det ailed inform at ion about t hese dat a t ypes, see Sect ion 4.2.2, “ Float ing- Point Dat a Types,” Sect ion 4.2.1.2, “ Signed I nt egers,” and Sect ion 4.7, “ BCD and Packed BCD I nt egers.” Wit h t he except ion of t he 80- bit double ext ended- precision float ing- point form at , all of t hese dat a t ypes exist in m em ory only. When t hey are loaded int o x87 FPU dat a regist ers, t hey are convert ed int o double ext ended- precision float ing- point form at and operat ed on in t hat form at . Denorm al values are also support ed in each of t he float ing- point t ypes, as required by I EEE St andard 754. When a denorm al num ber in single- precision or double- precision float ing- point form at is used as a source operand and t he denorm al except ion is m asked, t he x87 FPU aut om at ically nor m a lize s t he num ber when it is convert ed t o double ext ended- precision form at . When st ored in m em ory, t he least significant byt e of an x87 FPU dat a- t ype value is st ored at t he init ial address specified for t he value. Successive byt es from t he value are t hen st ored in successively higher addresses in m em ory. The float ing- point inst ruct ions load and st ore m em ory operands using only t he init ial address of t he operand.

8-18 Vol. 1

PROGRAMMING WITH THE X87 FPU

Single-Precision Floating-Point Sign

Exp. 23 22

3130

Fraction 0

Implied Integer

Double-Precision Floating-Point Sign

Exponent 63 62 52 51

Fraction 0

Implied Integer

Sign Double Extended-Precision Floating-Point Exponent 6463 62

79 78

Fraction 0

Integer Word Integer Sign 15 14

0

Doubleword Integer Sign 31 30

0

Quadword Integer Sign Sign

63 62

0 Packed BCD Integers

X

79 78

D17 D16 D15 D14 D13 D12 D11 D10

72 71

D9

D8

D7

D6

D5

4 Bits = 1 BCD Digit

D4

D3

D2

D1

D0

0

Figure 8-13. x87 FPU Data Type Formats As a general rule, values should be st ored in m em ory in double- precision form at . This form at provides sufficient range and precision t o ret urn correct result s wit h a m inim um of program m er at t ent ion. The single- precision form at is useful for debugging algorit hm s, because rounding problem s will m anifest t hem selves m ore quickly in t his form at . The double ext ended- precision form at is norm ally reserved for holding int erm ediat e result s in t he x87 FPU regist ers and const ant s. I t s ext ra lengt h is designed t o shield final result s from t he effect s of rounding and overflow/ underflow in int erm ediat e calculat ions. However, when an applicat ion requires t he m axim um range and precision of t he x87 FPU ( for dat a st orage, com put at ions, and result s) , values can be st ored in m em ory in double ext ended- precision form at .

8.2.1

Indefinites

For each x87 FPU dat a t ype, one unique encoding is r eser ved for r epr esent ing t he special value in de fin it e . The x87 FPU pr oduces indefinit e values as r esponses t o som e m asked float ing- point invalid- operat ion except ions. See Tables 4- 1, 4- 3, and

Vol. 1 8-19

PROGRAMMING WITH THE X87 FPU

4- 4 for t he encoding of t he int eger indefinit e, QNaN float ing- point indefinit e, and packed BCD int eger indefinit e, r espect ively. The binary int eger encoding 100..00B represent s eit her of t wo t hings, depending on t he circum st ances of it s use:

• •

The largest negat ive num ber support ed by t he form at ( –2 15 , –2 31 , or –2 63 ) The in t e ge r in de fin it e value

I f t his encoding is used as a sour ce operand ( as in an int eger load or int eger arit hm et ic inst ruct ion) , t he x87 FPU int erpret s it as t he largest negat ive num ber represent able in t he form at being used. I f t he x87 FPU det ect s an invalid operat ion w hen st oring an int eger value in m em ory w it h an FI ST/ FI STP inst ruct ion and t he invalidoperat ion except ion is m asked, t he x87 FPU st ores t he int eger indefinit e encoding in t he dest inat ion operand as a m asked response t o t he except ion. I n sit uat ions w here t he origin of a value w it h t his encoding m ay be am biguous, t he invalid- operat ion except ion flag can be exam ined t o see if t he value was produced as a response t o an except ion.

8.2.2

Unsupported Double Extended-Precision Floating-Point Encodings and Pseudo-Denormals

The double ext ended- precision float ing- point form at per m it s m any encodings t hat do not fall int o any of t he cat egories shown in Table 4- 3. Table 8- 3 show s t hese unsupport ed encodings. Som e of t hese encodings were support ed by t he I nt el 287 m at h coprocessor; however, m ost of t hem are not support ed by t he I nt el 387 m at h coprocessor and lat er I A- 32 processor s. These encodings ar e no longer suppor t ed due t o changes m ade in t he final version of I EEE St andard 754 t hat elim inat ed t hese encodings. Specifically, t he cat egories of encodings form erly known as pseudo- NaNs, pseudoinfinit ies, and un- norm al num bers are not support ed and should not be used as operand values. The I nt el 387 m at h coprocessor and lat er I A- 32 processors generat e an invalid- operat ion except ion when t hese encodings are encount ered as operands. Beginning wit h t he I nt el 387 m at h coprocessor, t he encodings form erly known as pseudo- denorm al num bers are not generat ed by I A- 32 processors. When encount ered as operands, however, t hey are handled correct ly; t hat is, t hey are t reat ed as denorm als and a denorm al except ion is generat ed. Pseudo- denorm al num bers should not be used as operand values. They are support ed by current I A- 32 processors ( as described here) t o support legacy code.

8-20 Vol. 1

PROGRAMMING WITH THE X87 FPU

Table 8-3. Unsupported Double Extended-Precision Floating-Point Encodings and Pseudo-Denormals Significand Class Positive Pseudo-NaNs

Positive Floating Point

Negative Floating Point

Negative Pseudo-NaNs

Sign

Biased Exponent

Integer

Fraction

11..11 . 11..11

0

Quiet

0 . 0

11..11 . 10..00

0 . 0

11..11 . 11..11

0

Signaling

01..11 . 00..01

Pseudo-infinity

0

11..11

0

00..00

0 . 0

11..10 . 00..01

0

Unnormals

11..11 . 00..00

Pseudo-denormals

0 . 0

00..00 . 00..00

1

11..11 . 00..00

Pseudo-denormals

1 . 1

00..00 . 00..00

1

11..11 . 00..00

1 . 1

11..10 . 00..01

0

Unnormals

11..01 . 00..00

Pseudo-infinity

1

11..11

0

00..00

1 . 1

11..11 . 11..11

0

Signaling

01..11 . 00..01

1 . 1

11..11 . 11..11

0

Quiet

11..11 . 10..00

← 15 bits →

← 63 bits →

Vol. 1 8-21

PROGRAMMING WITH THE X87 FPU

8.3

X86 FPU INSTRUCTION SET

The float ing- point inst ruct ions t hat t he x87 FPU support s can be grouped int o six funct ional cat egories:

• • • • • •

Dat a t ransfer inst ruct ions Basic arit hm et ic inst ruct ions Com parison inst ruct ions Transcendent al inst ruct ions Load const ant inst ruct ions x87 FPU cont rol inst ruct ions

See Sect ion 5.2, “ x87 FPU I nst ruct ions,” for a list of t he float ing- point inst ruct ions by cat egory. The following sect ion briefly describes t he inst ruct ions in each cat egory. Det ailed descript ions of t he float ing- point inst ruct ions are given in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B.

8.3.1

Escape (ESC) Instructions

All of t he inst ruct ions in t he x87 FPU inst ruct ion set fall int o a class of inst ruct ions known as escape ( ESC) inst ruct ions. All of t hese inst ruct ions have a com m on opcode form at , where t he first byt e of t he opcode is one of t he num bers from D8H t hrough DFH.

8.3.2

x87 FPU Instruction Operands

Most float ing- point inst ruct ions require one or t wo operands, locat ed on t he x87 FPU dat a- regist er st ack or in m em ory. ( None of t he float ing- point inst ruct ions accept im m ediat e operands.) When an operand is locat ed in a dat a regist er, it is referenced relat ive t o t he ST( 0) regist er ( t he regist er at t he t op of t he regist er st ack) , rat her t han by a physical regist er num ber. Oft en t he ST( 0) regist er is an im plied operand. Operands in m em ory can be referenced using t he sam e operand addressing m et hods described in Sect ion 3.7, “ Operand Addressing.”

8.3.3

Data Transfer Instructions

The dat a t ransfer inst ruct ions ( see Table 8- 4) perform t he following operat ions:



Load a float ing- point , int eger, or packed BCD operand from m em ory int o t he ST( 0) regist er.

8-22 Vol. 1

PROGRAMMING WITH THE X87 FPU



St ore t he value in an ST( 0) regist er t o m em ory in float ing- point , int eger, or packed BCD form at .



Move values bet ween regist ers in t he x87 FPU regist er st ack.

The FLD ( load float ing point ) inst ruct ion pushes a float ing- point operand from m em ory ont o t he t op of t he x87 FPU dat a- regist er st ack. I f t he operand is in singleprecision or double- precision float ing- point form at , it is aut om at ically convert ed t o double ext ended- precision float ing- point form at . This inst ruct ion can also be used t o push t he value in a select ed x87 FPU dat a regist er ont o t he t op of t he regist er st ack. The FI LD ( load int eger) inst ruct ion convert s an int eger operand in m em ory int o double ext ended- precision float ing- point form at and pushes t he value ont o t he t op of t he regist er st ack. The FBLD ( load packed decim al) inst ruct ion perform s t he sam e load operat ion for a packed BCD operand in m em ory.

Table 8-4. Data Transfer Instructions Floating Point

Integer

Packed Decimal

FLD

Load Floating Point

FILD

Load Integer

FST

Store Floating Point

FIST

Store Integer

FSTP

Store Floating Point and Pop

FISTP

Store Integer and Pop

FXCH

Exchange Register Contents

FCMOVcc

Conditional Move

FBLD

Load Packed Decimal

FBSTP

Store Packed Decimal and Pop

The FST ( st ore float ing point ) and FI ST ( st ore int eger) inst ruct ions st ore t he value in regist er ST( 0) in m em ory in t he dest inat ion form at ( float ing point or int eger, respect ively) . Again, t he form at conversion is carried out aut om at ically. The FSTP ( st ore float ing point and pop) , FI STP ( st ore int eger and pop) , and FBSTP ( st ore packed decim al and pop) inst ruct ions st ore t he value in t he ST( 0) regist ers int o m em ory in t he dest inat ion form at ( float ing point , int eger, or packed BCD) , t hen perform s a pop operat ion on t he regist er st ack. A pop operat ion causes t he ST( 0) regist er t o be m arked em pt y and t he st ack point er ( TOP) in t he x87 FPU cont rol work t o be increm ent ed by 1. The FSTP inst ruct ion can also be used t o copy t he value in t he ST( 0) regist er t o anot her x87 FPU regist er [ ST( i) ] . The FXCH ( exchange regist er cont ent s) inst ruct ion exchanges t he value in a select ed regist er in t he st ack [ ST( i) ] wit h t he value in ST( 0) . The FCMOVcc ( condit ional m ove) inst ruct ions m ove t he value in a select ed regist er in t he st ack [ ST( i) ] t o r egist er ST( 0) if a condit ion specified wit h a condit ion code ( cc) is sat isfied ( see Table 8- 5) . The condit ion being t est ed for is represent ed by t he

Vol. 1 8-23

PROGRAMMING WITH THE X87 FPU

st at us flags in t he EFLAGS regist er. The condit ion code m nem onics ar e appended t o t he let t ers “ FCMOV” t o form t he m nem onic for a FCMOVcc inst ruct ion.

Table 8-5. Floating-Point Conditional Move Instructions Instruction Mnemonic

Status Flag States

Condition Description

FCMOVB

CF=1

Below

FCMOVNB

CF=0

Not below

FCMOVE

ZF=1

Equal

FCMOVNE

ZF=0

Not equal

FCMOVBE

CF=1 or ZF=1

Below or equal

FCMOVNBE

CF=0 or ZF=0

Not below nor equal

FCMOVU

PF=1

Unordered

FCMOVNU

PF=0

Not unordered

Like t he CMOVcc inst ruct ions, t he FCMOVcc inst ruct ions are useful for opt im izing sm all I F const ruct ions. They also help elim inat e branching overhead for I F operat ions and t he possibilit y of branch m ispredict ions by t he processor. Soft ware can check if t he FCMOVcc inst ruct ions are support ed by checking t he processor ’s feat ure inform at ion wit h t he CPUI D inst ruct ion.

8.3.4

Load Constant Instructions

The following inst ruct ions push com m only used const ant s ont o t he t op [ ST( 0) ] of t he x87 FPU regist er st ack: FLDZ

Load + 0.0

FLD1

Load + 1.0

FLDPI FLDL2T FLDL2E FLDLG2 FLDLN2

Load π

Load log 2 10 Load log 2 e Load log 10 2 Load log e 2

The const ant values have full double ext ended- precision float ing- point precision ( 64 bit s) and are accurat e t o approxim at ely 19 decim al digit s. They are st ored int ernally in a form at m ore precise t han double ext ended- precision float ing point . When loading t he const ant , t he x87 FPU rounds t he m ore precise int ernal const ant according t o t he RC ( rounding cont rol) field of t he x87 FPU cont rol word. The inexactresult except ion ( # P) is not generat ed as a result of t his rounding, nor is t he C1 flag

8-24 Vol. 1

PROGRAMMING WITH THE X87 FPU

set in t he x87 FPU st at us word if t he value is rounded up. See Sect ion 8.3.8, “ Pi,” for inform at ion on t he π const ant .

8.3.5

Basic Arithmetic Instructions

The following float ing- point inst ruct ions perform basic arit hm et ic operat ions on float ing- point num bers. Where applicable, t hese inst ruct ions m at ch I EEE St andard 754: FADD/ FADDP

Add float ing point

FI ADD

Add int eger t o float ing point

FSUB/ FSUBP

Subt ract float ing point

FI SUB

Subt ract int eger from float ing point

FSUBR/ FSUBRP

Reverse subt ract float ing point

FI SUBR

Reverse subt ract float ing point from int eger

FMUL/ FMULP

Mult iply float ing point

FI MUL

Mult iply int eger by float ing point

FDI V/ FDI VP

Divide float ing point

FI DI V

Divide float ing point by int eger

FDI VR/ FDI VRP

Reverse divide

FI DI VR

Reverse divide int eger by float ing point

FABS

Absolut e value

FCHS

Change sign

FSQRT

Square root

FPREM

Part ial rem ainder

FPREM1

I EEE part ial rem ainder

FRNDI NT

Round t o int egral value

FXTRACT

Ext ract exponent and significand

The add, subt ract , m ult iply and divide inst ruct ions operat e on t he following t ypes of operands:

• •

Two x87 FPU dat a regist ers An x87 FPU dat a regist er and a float ing- point or int eger value in m em ory

See Sect ion 8.1.2, “ x87 FPU Dat a Regist ers,” for a descript ion of how operands are referenced on t he dat a regist er st ack. Operands in m em ory can be in single- precision float ing- point , double- precision float ing- point , word- int eger, or doubleword- int eger form at . They are convert ed t o double ext ended- precision float ing- point form at aut om at ically.

Vol. 1 8-25

PROGRAMMING WITH THE X87 FPU

Reverse versions of t he subt ract ( FSUBR) and divide ( FDI VR) inst ruct ions enable efficient coding. For exam ple, t he following opt ions are available wit h t he FSUB and FSUBR inst ruct ions for operat ing on values in a specified x87 FPU dat a regist er ST( i) and t he ST( 0) regist er: FSUB:

ST(0) ← ST(0) − ST(i) ST(i) ← ST(i) − ST(0)

FSUBR:

ST(0) ← ST(i) − ST(0) ST(i) ← ST(0) − ST(i)

These inst ruct ions elim inat e t he need t o exchange values bet ween t he ST( 0) regist er and anot her x87 FPU regist er t o perform a subt ract ion or division. The pop versions of t he add, subt ract , m ult iply, and divide inst ruct ions offer t he opt ion of popping t he x87 FPU regist er st ack following t he arit hm et ic operat ion. These inst ruct ions operat e on values in t he ST( i) and ST( 0) regist ers, st ore t he result in t he ST( i) regist er, and pop t he ST( 0) regist er. The FPREM inst ruct ion com put es t he rem ainder from t he division of t wo operands in t he m anner used by t he I nt el 8087 and I nt el 287 m at h coprocessors; t he FPREM1 inst ruct ion com put es t he rem ainder in t he m anner specified in I EEE St andard 754. The FSQRT inst ruct ion com put es t he square root of t he source operand. The FRNDI NT inst ruct ion ret urns a float ing- point value t hat is t he int egral value closest t o t he source value in t he direct ion of t he rounding m ode specified in t he RC field of t he x87 FPU cont rol word. The FABS, FCHS, and FXTRACT inst ruct ions perform convenient arit hm et ic operat ions. The FABS inst ruct ion produces t he absolut e value of t he source operand. The FCHS inst ruct ion changes t he sign of t he source operand. The FXTRACT inst ruct ion separat es t he source operand int o it s exponent and fract ion and st ores each value in a regist er in float ing- point form at .

8.3.6

Comparison and Classification Instructions

The following inst ruct ions com pare or classify float ing- point values: FCOM/ FCOMP/ FCOMPP FUCOM/ FUCOMP/ FUCOMPP FI COM/ FI COMP FCOMI / FCOMI P FUCOMI / FUCOMI P

8-26 Vol. 1

Com pare float ing point and set x87 FPU condit ion code flags. Unordered com pare float ing point and set x87 FPU condit ion code flags. Com pare int eger and set x87 FPU condit ion code flags. Com pare float ing point and set EFLAGS st at us flags. Unordered com pare float ing point and set EFLAGS st at us flags.

PROGRAMMING WITH THE X87 FPU

FTST FXAM

Test ( com pare float ing point wit h 0.0) . Exam ine.

Com parison of float ing- point values differ from com parison of int egers because float ing- point values have four ( rat her t han t hree) m ut ually exclusive relat ionships: less t han, equal, great er t han, and unordered. The unordered relat ionship is t rue when at least one of t he t wo values being com pared is a NaN or in an unsupport ed form at . This addit ional relat ionship is required because, by definit ion, NaNs are not num bers, so t hey cannot have less t han, equal, or great er t han relat ionships wit h ot her float ing- point values. The FCOM, FCOMP, and FCOMPP inst ruct ions com pare t he value in regist er ST( 0) wit h a float ing- point source operand and set t he condit ion code flags ( C0, C2, and C3) in t he x87 FPU st at us word according t o t he result s ( see Table 8- 6) . I f an unordered condit ion is det ect ed ( one or bot h of t he values are NaNs or in an undefined form at ) , a float ing- point invalid- operat ion except ion is generat ed. The pop versions of t he inst ruct ion pop t he x87 FPU regist er st ack once or t wice aft er t he com parison operat ion is com plet e. The FUCOM, FUCOMP, and FUCOMPP inst ruct ions operat e t he sam e as t he FCOM, FCOMP, and FCOMPP inst ruct ions. The only difference is t hat wit h t he FUCOM, FUCOMP, and FUCOMPP inst ruct ions, if an unordered condit ion is det ect ed because one or bot h of t he operands are QNaNs, t he float ing- point invalid- operat ion except ion is not generat ed.

Table 8-6. Setting of x87 FPU Condition Code Flags for Floating-Point Number Comparisons Condition

C3

C2

C0

ST(0) > Source Operand

0

0

0

ST(0) < Source Operand

0

0

1

ST(0) = Source Operand

1

0

0

Unordered

1

1

1

The FI COM and FI COMP inst r uct ions also operat e t he sam e as t he FCOM and FCOMP inst r uct ions, except t hat t he source operand is an int eger value in m em ory. The int eger value is aut om at ically convert ed int o an double ext ended- precision float ingpoint value prior t o m aking t he com parison. The FI COMP inst r uct ion pops t he x87 FPU regist er st ack following t he com par ison operat ion. The FTST inst ruct ion perform s t he sam e operat ion as t he FCOM inst ruct ion, except t hat t he value in regist er ST( 0) is always com pared wit h t he value 0.0. The FCOMI and FCOMI P inst ruct ions were int roduced int o t he I A- 32 archit ect ure in t he P6 fam ily processors. They perform t he sam e com parison as t he FCOM and

Vol. 1 8-27

PROGRAMMING WITH THE X87 FPU

FCOMP inst ruct ions, except t hat t hey set t he st at us flags ( ZF, PF, and CF) in t he EFLAGS regist er t o indicat e t he result s of t he com parison ( see Table 8- 7) inst ead of t he x87 FPU condit ion code flags. The FCOMI and FCOMI P inst ruct ions allow condit ion branch inst ruct ions ( Jcc) t o be execut ed direct ly from t he result s of t heir com parison.

Table 8-7. Setting of EFLAGS Status Flags for Floating-Point Number Comparisons Comparison Results ST0 > ST(i)

ZF

PF

CF

ST0 < ST(i)

0

0

0

ST0 = ST(i)

0

0

1

1

0

0

Unordered

1

1

1

Soft ware can check if t he FCOMI and FCOMI P inst ruct ions are support ed by checking t he processor ’s feat ure inform at ion wit h t he CPUI D inst ruct ion. The FUCOMI and FUCOMI P inst ruct ions operat e t he sam e as t he FCOMI and FCOMI P inst ruct ions, except t hat t hey do not generat e a float ing- point invalid- operat ion except ion if t he unordered condit ion is t he result of one or bot h of t he operands being a QNaN. The FCOMI P and FUCOMI P inst ruct ions pop t he x87 FPU regist er st ack following t he com parison operat ion. The FXAM inst ruct ion det erm ines t he classificat ion of t he float ing- point value in t he ST( 0) regist er ( t hat is, whet her t he value is zero, a denorm al num ber, a norm al finit e num ber, ∞, a NaN, or an unsupport ed form at ) or t hat t he regist er is em pt y. I t set s t he x87 FPU condit ion code flags t o indicat e t he classificat ion ( see “ FXAM—Exam ine” in Chapt er 3, “ I nst ruct ion Set Reference, A- M,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A) . I t also set s t he C1 flag t o indicat e t he sign of t he value.

8.3.6.1

Branching on the x87 FPU Condition Codes

The processor does not offer any cont rol- flow inst ruct ions t hat branch on t he set t ing of t he condit ion code flags ( C0, C2, and C3) in t he x87 FPU st at us word. To branch on t he st at e of t hese flags, t he x87 FPU st at us word m ust first be m oved t o t he AX regist er in t he int eger unit . The FSTSW AX ( st ore st at us word) inst ruct ion can be used for t his purpose. When t hese flags are in t he AX regist er, t he TEST inst ruct ion can be used t o cont rol condit ional branching as follows: 1. Check for an unordered result . Use t he TEST inst ruct ion t o com pare t he cont ent s of t he AX regist er wit h t he const ant 0400H ( see Table 8- 8) . This operat ion will clear t he ZF flag in t he EFLAGS regist er if t he condit ion code flags indicat e an unordered result ; ot herwise, t he ZF flag will be set . The JNZ inst ruct ion can t hen be used t o t ransfer cont rol ( if necessary) t o a procedure for handling unordered operands.

8-28 Vol. 1

PROGRAMMING WITH THE X87 FPU

Table 8-8. TEST Instruction Constants for Conditional Branching Order

Constant

Branch

ST(0) < Source Operand

4500H

JZ

ST(0) = Source Operand

0100H

JNZ

4000H

JNZ

Unordered

0400H

JNZ

ST(0) > Source Operand

2. Check ordered com parison result . Use t he const ant s given in Table 8- 8 in t he TEST inst ruct ion t o t est for a less t han, equal t o, or great er t han result , t hen use t he corresponding condit ional branch inst ruct ion t o t ransfer program cont rol t o t he appropriat e procedure or sect ion of code. I f a program or procedure has been t horoughly t est ed and it incorporat es periodic checks for QNaN result s, t hen it is not necessary t o check for t he unordered result every t im e a com parison is m ade. See Sect ion 8.1.4, “ Branching and Condit ional Moves on Condit ion Codes,” for anot her t echnique for branching on x87 FPU condit ion codes. Som e non- com parison x87 FPU inst ruct ions updat e t he condit ion code flags in t he x87 FPU st at us word. To ensure t hat t he st at us word is not alt ered inadvert ent ly, st ore it im m ediat ely following a com parison operat ion.

8.3.7

Trigonometric Instructions

The following inst ruct ions perform four com m on t rigonom et ric funct ions: FSI N

Sine

FCOS

Cosine

FSI NCOS

Sine and cosine

FPTAN

Tangent

FPATAN

Arct angent

These inst ruct ions operat e on t he t op one or t wo regist ers of t he x87 FPU regist er st ack and t hey ret urn t heir result s t o t he st ack. The source operands for t he FSI N, FCOS, FSI NCOS, and FPTAN inst ruct ions m ust be given in radians; t he source operand for t he FPATAN inst ruct ion is given in rect angular coordinat e unit s. The FSI NCOS inst ruct ion ret urns bot h t he sine and t he cosine of a source operand value. I t operat es fast er t han execut ing t he FSI N and FCOS inst ruct ions in succession. The FPATAN inst ruct ion com put es t he arct angent of ST( 1) divided by ST( 0) , ret urning a result in radians. I t is useful for convert ing rect angular coordinat es t o polar coordinat es.

Vol. 1 8-29

PROGRAMMING WITH THE X87 FPU

8.3.8

Pi

When t he argum ent ( source operand) of a t rigonom et ric funct ion is wit hin t he range of t he funct ion, t he argum ent is aut om at ically reduced by t he appropriat e m ult iple of 2π t hrough t he sam e reduct ion m echanism used by t he FPREM and FPREM1 inst ruct ions. The int ernal value of π t hat t he x87 FPU uses for argum ent reduct ion and ot her com put at ions is as follows: π = 0.f ∗ 22

where: f = C90FDAA2 2168C234 C ( The spaces in t he fract ion above indicat e 32- bit boundaries.)

This int ernal π value has a 66- bit m ant issa, which is 2 bit s m ore t han is allowed in t he significand of an double ext ended- precision float ing- point value. ( Since 66 bit s is not an even num ber of hexadecim al digit s, t wo addit ional zer os hav e been added t o t he value so t hat it can be r epr esent ed in hex adecim al for m at . The least - significant hex adecim al digit ( C) is t hus 1 100B, w her e t he t w o least- significant bit s represent bit s 67 and 68 of t he m ant issa.) This value of π has been chosen t o guarant ee no loss of significance in a source operand, provided t he operand is wit hin t he specified range for t he inst ruct ion.

I f t he result s of com put at ions t hat explicit ly use π are t o be used in t he FSI N, FCOS, FSI NCOS, or FPTAN inst ruct ions, t he full 66- bit fract ion of π should be used. This insures t hat t he result s are consist ent wit h t he argum ent- reduct ion algorit hm s t hat t hese inst ruct ions use. Using a rounded version of π can cause inaccuracies in result values, which if propagat ed t hrough several calculat ions, m ight result in m eaningless result s. A com m on m et hod of represent ing t he full 66- bit fract ion of π is t o separat e t he value int o t wo num bers ( highπ and lowπ) t hat when added t oget her give t he value for π shown earlier in t his sect ion wit h t he full 66- bit fract ion: π = highπ + lowπ

For exam ple, t he following t wo values ( given in scient ific not at ion wit h t he fract ion in hexadecim al and t he exponent in decim al) represent t he 33 m ost- significant and t he 33 least- significant bit s of t he fract ion: highπ (unnormalized) = 0.C90FDAA20 * 2+2 lowπ (unnormalized) = 0.42D184698 * 2− 31 These values encoded in t he I EEE double- precision float ing- point form at are as follows: highπ = 400921FB 54400000 lowπ = 3DE0B461 1A600000 ( Not e t hat in t he I EEE double- precision float ing- point form at , t he exponent s are biased ( by 1023) and t he fract ions are norm alized.)

8-30 Vol. 1

PROGRAMMING WITH THE X87 FPU Sim ilar versions of π can also be writ t en in double ext ended- precision float ing- point form at . When using t his t wo- part π value in an algorit hm , parallel com put at ions should be perform ed on each part , wit h t he result s kept separat e. When all t he com put at ions are com plet e, t he t wo result s can be added t oget her t o form t he final result .

The com plicat ions of m aint aining a consist ent value of π for argum ent reduct ion can be avoided, eit her by applying t he t rigonom et ric funct ions only t o argum ent s wit hin t he range of t he aut om at ic reduct ion m echanism , or by perform ing all argum ent reduct ions ( down t o a m agnit ude less t han π/ 4) explicit ly in soft ware.

8.3.9

Logarithmic, Exponential, and Scale

The following inst ruct ions provide t wo different logarit hm ic funct ions, an exponent ial funct ion and a scale funct ion: FYL2X

Logarit hm

FYL2XP1

Logarit hm epsilon

F2XM1

Exponent ial

FSCALE

Scale

The FYL2X and FYL2XP1 inst ruct ions perform t wo different base 2 logarit hm ic operat ions. The FYL2X inst ruct ion com put es ( y ∗ log 2 x) . This operat ion perm it s t he calculat ion of t he log of any base using t he following equat ion: logb x = (1/log2 b) ∗ log2 x

The FYL2XP1 inst ruct ion com put es ( y ∗ log 2 ( x + 1) ) . This operat ion provides opt im um accuracy for values of x t hat are close t o 0.

The F2XM1 inst ruct ion com put es ( 2 x − 1) . This inst ruct ion only operat es on source values in t he range −1.0 t o + 1.0. The FSCALE inst ruct ion m ult iplies t he source operand by a power of 2.

8.3.10

Transcendental Instruction Accuracy

New t ranscendent al inst ruct ion algorit hm s were incorporat ed int o t he I A- 32 archit ect ure beginning wit h t he Pent ium processors. These new algorit hm s ( used in t ranscendent al inst ruct ions FSI N, FCOS, FSI NCOS, FPTAN, FPATAN, F2XM1, FYL2X, and FYL2XP1) allow a higher level of accuracy t han was possible in earlier I A- 32 processors and x87 m at h coprocessors. The accuracy of t hese inst ruct ions is m easured in t erm s of unit s in t he la st pla ce ( u lp) . For a given argum ent x, let f( x) and F( x) be

Vol. 1 8-31

PROGRAMMING WITH THE X87 FPU

t he correct and com put ed ( approxim at e) funct ion values, respect ively. The error in ulps is defined t o be:

( x ) – F ( x )error = f-------------------------k – 63 2

where k is an int eger such t hat :

1≤2

–k

f ( x ) < 2.

Wit h t he Pent ium processor and lat er I A- 32 pr ocessor s, t he w or st case er r or on t r anscendent al funct ions is less t han 1 ulp when rounding t o t he nearest ( even) and less t han 1.5 ulps when rounding in ot her m odes. The funct ions are guarant eed t o be m onot onic, wit h respect t o t he input operands, t hroughout t he dom ain support ed by t he inst ruct ion. The inst ruct ions FYL2X and FYL2XP1 are t wo operand inst ruct ions and are guarant eed t o be wit hin 1 ulp only when y equals 1. When y is not equal t o 1, t he m axim um ulp error is always wit hin 1.35 ulps in round t o nearest m ode. ( For t he t wo operand funct ions, m onot onicit y was proved by holding one of t he operands const ant .)

8.3.11

x87 FPU Control Instructions

The following inst ruct ions cont rol t he st at e and m odes of operat ion of t he x87 FPU. They also allow t he st at us of t he x87 FPU t o be exam ined: FI NI T/ FNI NI T FLDCW FSTCW/ FNSTCW FSTSW/ FNSTSW FCLEX/ FNCLEX FLDENV FSTENV/ FNSTENV FRSTOR FSAVE/ FNSAVE FI NCSTP FDECSTP FFREE FNOP WAI T/ FWAI T

I nit ialize x87 FPU Load x87 FPU cont rol word St ore x87 FPU cont rol word St ore x87 FPU st at us word Clear x87 FPU except ion flags Load x87 FPU environm ent St ore x87 FPU environm ent Rest ore x87 FPU st at e Save x87 FPU st at e I ncrem ent x87 FPU regist er st ack point er Decrem ent x87 FPU regist er st ack point er Free x87 FPU regist er No operat ion Check for and handle pending unm asked x87 FPU except ions

The FI NI T/ FNI NI T inst ruct ions init ialize t he x87 FPU and it s int ernal regist ers t o default values. The FLDCW inst ruct ions loads t he x87 FPU cont rol word regist er wit h a value from m em ory. The FSTCW/ FNSTCW and FSTSW/ FNSTSW inst ruct ions st ore t he x87 FPU

8-32 Vol. 1

PROGRAMMING WITH THE X87 FPU

cont rol and st at us words, respect ively, in m em ory ( or for an FSTSW/ FNSTSW inst ruct ion in a general- purpose regist er) . The FSTENV/ FNSTENV and FSAVE/ FNSAVE inst ruct ions save t he x87 FPU environm ent and st at e, respect ively, in m em ory. The x87 FPU environm ent includes all t he x87 FPU’s cont rol and st at us regist ers; t he x87 FPU st at e includes t he x87 FPU environm ent and t he dat a regist ers in t he x87 FPU regist er st ack. ( The FSAVE/ FNSAVE inst ruct ion also init ializes t he x87 FPU t o default values, like t he FI NI T/ FNI NI T inst ruct ion, aft er it saves t he original st at e of t he x87 FPU.) The FLDENV and FRSTOR inst ruct ions load t he x87 FPU environm ent and st at e, respect ively, from m em ory int o t he x87 FPU. These inst ruct ions are com m only used when swit ching t asks or cont ext s. The WAI T/ FWAI T inst ruct ions are synchronizat ion inst ruct ions. ( They are act ually m nem onics for t he sam e opcode.) These inst ruct ions check t he x87 FPU st at us word for pending unm asked x87 FPU except ions. I f any pending unm asked x87 FPU except ions are found, t hey are handled before t he processor resum es execut ion of t he inst ruct ions ( int eger, float ing- point , or syst em inst ruct ion) in t he inst ruct ion st ream . The WAI T/ FWAI T inst ruct ions are provided t o allow synchronizat ion of inst ruct ion execut ion bet ween t he x87 FPU and t he processor ’s int eger unit . See Sect ion 8.6, “ x87 FPU Except ion Synchronizat ion,” for m ore inform at ion on t he use of t he WAI T/ FWAI T inst ruct ions.

8.3.12

Waiting vs. Non-waiting Instructions

All of t he x87 FPU inst ruct ions except a few special cont rol inst ruct ions perform a wait operat ion ( sim ilar t o t he WAI T/ FWAI T inst ruct ions) , t o check for and handle pending unm asked x87 FPU float ing- point except ions, before t hey perform t heir prim ary operat ion ( such as adding t wo float ing- point num bers) . These inst ruct ions are called w a it ing inst ruct ions. Som e of t he x87 FPU cont rol inst ruct ions, such as FSTSW/ FNSTSW, have bot h a wait ing and a non- wait ing version. The wait ing version ( wit h t he “ F” prefix) execut es a wait operat ion before it perform s it s prim ary operat ion; whereas, t he non- wait ing version ( wit h t he “ FN” prefix) ignores pending unm asked except ions. Non- wait ing inst ruct ions allow soft ware t o save t he current x87 FPU st at e wit hout first handling pending except ions or t o reset or reinit ialize t he x87 FPU wit hout regard for pending except ions.

NOTES When operat ing a Pent ium or I nt el486 processor in MS- DOS com pat ibilit y m ode, it is possible ( under unusual circum st ances) for a nonwait ing inst ruct ion t o be int errupt ed prior t o being execut ed t o handle a pending x87 FPU except ion. The circum st ances where t his can happen and t he result ing act ion of t he processor are described in

Vol. 1 8-33

PROGRAMMING WITH THE X87 FPU

Sect ion D.2.1.3, “ No-Wait x87 FPU I nst ruct ions Can Get x87 FPU I nt errupt in Window.” When operat ing a P6 fam ily, Pent ium 4, or I nt el Xeon processor in MS- DOS com pat ibilit y m ode, non- wait ing inst ruct ions can not be int errupt ed in t his way ( see Sect ion D.2.2, “ MS- DOS Com pat ibilit y Sub- m ode in t he P6 Fam ily and Pent ium 4 Processors” ) .

8.3.13

Unsupported x87 FPU Instructions

The I nt el 8087 inst ruct ions FENI and FDI SI and t he I nt el 287 m at h coprocessor inst ruct ion FSETPM perform no funct ion in t he I nt el 387 m at h coprocessor and lat er I A- 32 processors. I f t hese opcodes are det ect ed in t he inst ruct ion st ream , t he x87 FPU perform s no specific operat ion and no int ernal x87 FPU st at es are affect ed.

8.4

X87 FPU FLOATING-POINT EXCEPTION HANDLING

The x87 FPU det ect s t he six classes of except ion condit ions described in Sect ion 4.9, “ Overview of Float ing- Point Except ions” :



I nvalid operat ion ( # I ) , wit h t wo subclasses: — St ack overflow or underflow ( # I S) — I nvalid arit hm et ic operat ion ( # I A)

• • • • •

Denorm alized operand ( # D) Divide- by- zero ( # Z) Num eric overflow ( # O) Num eric underflow ( # U) I nexact result ( precision) ( # P)

Each of t he six except ion classes has a corresponding flag bit in t he x87 FPU st at us word and a m ask bit in t he x87 FPU cont rol word ( see Sect ion 8.1.3, “ x87 FPU St at us Regist er,” and Sect ion 8.1.5, “ x87 FPU Cont rol Word,” respect ively) . I n addit ion, t he except ion sum m ary ( ES) flag in t he st at us word indicat es when one or m ore unm asked except ions has been det ect ed. The st ack fault ( SF) flag ( also in t he st at us word) dist inguishes bet ween t he t wo t ypes of invalid- operat ion except ions. The m ask bit s can be set wit h FLDCW, FRSTOR, or FXRSTOR; t hey can be read wit h eit her FSTCW/ FNSTCW, FSAVE/ FNSAVE, or FXSAVE. The flag bit s can be read wit h t he FSTSW/ FNSTSW, FSAVE/ FNSAVE, or FXSAVE inst ruct ion.

NOTE Sect ion 4.9.1, “ Float ing- Point Except ion Condit ions,” provides a general overview of how t he I A- 32 processor det ect s and handles t he

8-34 Vol. 1

PROGRAMMING WITH THE X87 FPU

various classes of float ing- point except ions. This inform at ion pert ains t o x87 FPU as well as SSE/ SSE2/ SSE3 ext ensions. The following sect ions give specific inform at ion about how t he x87 FPU handles float ing- point except ions t hat are unique t o t he x87 FPU.

8.4.1

Arithmetic vs. Non-arithmetic Instructions

When dealing wit h float ing- point except ions, it is useful t o dist inguish bet ween a r it hm e t ic inst r uct ions and n on - a r it hm e t ic in st r u ct ions. Non- arit hm et ic inst ruct ions have no operands or do not m ake subst ant ial changes t o t heir operands. Arit hm et ic inst ruct ions do m ake significant changes t o t heir operands; in part icular, t hey m ake changes t hat could result in float ing- point except ions being signaled. Table 8- 9 list s t he non- arit hm et ic and arit hm et ic inst ruct ions. I t should be not ed t hat som e non- arit hm et ic inst ruct ions can signal a float ing- point st ack ( fault ) except ion, but t his except ion is not t he result of an operat ion on an operand.

Table 8-9. Arithmetic and Non-arithmetic Instructions Non-arithmetic Instructions

Arithmetic Instructions

FABS

F2XM1

FCHS

FADD/FADDP

FCLEX

FBLD

FDECSTP

FBSTP

FFREE

FCOM/FCOMP/FCOMPP

FINCSTP

FCOS

FINIT/FNINIT

FDIV/FDIVP/FDIVR/FDIVRP

FLD (register-to-register)

FIADD

FLD (extended format from memory)

FICOM/FICOMP

FLD constant

FIDIV/FIDIVR

FLDCW

FILD

FLDENV

FIMUL

FNOP

FIST/FISTP1

FRSTOR

FISUB/FISUBR

FSAVE/FNSAVE

FLD (single and double)

FST/FSTP (register-to-register)

FMUL/FMULP

FSTP (extended format to memory)

FPATAN

FSTCW/FNSTCW

FPREM/FPREM1

FSTENV/FNSTENV

FPTAN Vol. 1 8-35

PROGRAMMING WITH THE X87 FPU

Table 8-9. Arithmetic and Non-arithmetic Instructions (Contd.) Non-arithmetic Instructions

Arithmetic Instructions

FSTSW/FNSTSW

FRNDINT

WAIT/FWAIT

FSCALE

FXAM

FSIN

FXCH

FSINCOS FSQRT FST/FSTP (single and double) FSUB/FSUBP/FSUBR/FSUBRP FTST FUCOM/FUCOMP/FUCOMPP FXTRACT FYL2X/FYL2XP1

NOTE: 1. The FISTTP instruction in SSE3 is an arithmetic x87 FPU instruction.

8.5

X87 FPU FLOATING-POINT EXCEPTION CONDITIONS

The following sect ions describe t he various condit ions t hat cause a float ing- point except ion t o be generat ed by t he x87 FPU and t he m asked response of t he x87 FPU when t hese condit ions are det ect ed. I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, list t he float ing- point except ions t hat can be signaled for each float ing- point inst ruct ion. See Sect ion 4.9.2, “ Float ing- Point Except ion Priorit y,” for a descript ion of t he rules for except ion precedence when m ore t han one float ing- point except ion condit ion is det ect ed for an inst ruct ion.

8.5.1

Invalid Operation Exception

The float ing- point invalid- operat ion except ion occurs in response t o t wo sub- classes of operat ions:

• •

St ack overflow or underflow ( # I S) I nvalid arit hm et ic operand ( # I A)

The flag for t his except ion ( I E) is bit 0 of t he x87 FPU st at us word, and t he m ask bit ( I M) is bit 0 of t he x87 FPU cont rol word. The st ack fault flag ( SF) of t he x87 FPU st at us word indicat es t he t ype of operat ion t hat caused t he except ion. When t he SF flag is set t o 1, a st ack operat ion has result ed in st ack overflow or underflow; when t he flag is cleared t o 0, an arit hm et ic inst ruct ion has encount ered an invalid operand.

8-36 Vol. 1

PROGRAMMING WITH THE X87 FPU

Not e t hat t he x87 FPU explicit ly set s t he SF flag when it det ect s a st ack overflow or underflow condit ion, but it does not explicit ly clear t he flag when it det ect s an invalidarit hm et ic- operand condit ion. As a result , t he st at e of t he SF flag can be 1 following an invalid- arit hm et ic- operat ion except ion, if it was not cleared from t he last t im e a st ack overflow or underflow condit ion occurred. See Sect ion 8.1.3.4, “ St ack Fault Flag,” for m ore inform at ion about t he SF flag.

8.5.1.1

Stack Overflow or Underflow Exception (#IS)

The x87 FPU t ag word keeps t rack of t he cont ent s of t he regist ers in t he x87 FPU regist er st ack ( see Sect ion 8.1.7, “ x87 FPU Tag Word” ) . I t t hen uses t his inform at ion t o det ect t wo different t ypes of st ack fault s:



St a ck ove r flow — An inst ruct ion at t em pt s t o load a non- em pt y x87 FPU regist er from m em ory. A non- em pt y regist er is defined as a regist er cont aining a zero ( t ag value of 01) , a valid value ( t ag value of 00) , or a special value ( t ag value of 10) .



St a ck unde r flow — An inst ruct ion references an em pt y x87 FPU regist er as a source operand, including at t em pt ing t o writ e t he cont ent s of an em pt y regist er t o m em ory. An em pt y regist er has a t ag value of 11.

NOTES The t erm st ack overflow originat es from t he sit uat ion where t he program has loaded ( pushed) eight values from m em ory ont o t he x87 FPU regist er st ack and t he next value pushed on t he st ack causes a st ack wraparound t o a regist er t hat already cont ains a value. The t erm st ack underflow originat es from t he opposit e sit uat ion. Here, a program has st ored ( popped) eight values from t he x87 FPU regist er st ack t o m em ory and t he next value popped from t he st ack causes st ack wraparound t o an em pt y regist er. When t he x87 FPU det ect s st ack overflow or underflow, it set s t he I E flag ( bit 0) and t he SF flag ( bit 6) in t he x87 FPU st at us word t o 1. I t t hen set s condit ion- code flag C1 ( bit 9) in t he x87 FPU st at us word t o 1 if st ack overflow occurred or t o 0 if st ack underflow occurred. I f t he invalid- operat ion except ion is m asked, t he x87 FPU ret urns t he float ing point , int eger, or packed decim al int eger indefinit e value t o t he dest inat ion operand, depending on t he inst ruct ion being execut ed. This value overwrit es t he dest inat ion regist er or m em ory locat ion specified by t he inst ruct ion. I f t he invalid- operat ion except ion is not m asked, a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) and t he t opof- st ack point er ( TOP) and source operands rem ain unchanged.

Vol. 1 8-37

PROGRAMMING WITH THE X87 FPU

8.5.1.2

Invalid Arithmetic Operand Exception (#IA)

The x87 FPU is able t o det ect a variet y of invalid arit hm et ic operat ions t hat can be coded in a program . These operat ions are list ed in Table 8- 10. ( This list includes t he invalid operat ions defined in I EEE St andard 754.) When t he x87 FPU det ect s an invalid arit hm et ic operand, it set s t he I E flag ( bit 0) in t he x87 FPU st at us word t o 1. I f t he invalid- operat ion except ion is m asked, t he x87 FPU t hen ret urns an indefinit e value or QNaN t o t he dest inat ion operand and/ or set s t he float ing- point condit ion codes as shown in Table 8- 10. I f t he invalid- operat ion except ion is not m asked, a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) and t he t op- of- st ack point er ( TOP) and source operands rem ain unchanged.

Table 8-10. Invalid Arithmetic Operations and the Masked Responses to Them Condition

Masked Response

Any arithmetic operation on an operand that is in an unsupported format.

Return the QNaN floating-point indefinite value to the destination operand.

Any arithmetic operation on a SNaN.

Return a QNaN to the destination operand (see Table 4-7).

Ordered compare and test operations: one or both operands are NaNs.

Set the condition code flags (C0, C2, and C3) in the x87 FPU status word or the CF, PF, and ZF flags in the EFLAGS register to 111B (not comparable).

Addition: operands are opposite-signed infinities. Subtraction: operands are like-signed infinities.

Return the QNaN floating-point indefinite value to the destination operand.

Multiplication: ∞ by 0; 0 by ∞ . Division: ∞ by ∞ ; 0 by 0.

Return the QNaN floating-point indefinite value to the destination operand. Return the QNaN floating-point indefinite value to the destination operand.

Remainder instructions FPREM, FPREM1: modulus (divisor) is 0 or dividend is ∞ .

Return the QNaN floating-point indefinite; clear condition code flag C2 to 0.

Trigonometric instructions FCOS, FPTAN, FSIN, FSINCOS: source operand is ∞ .

Return the QNaN floating-point indefinite; clear condition code flag C2 to 0.

FSQRT: negative operand (except FSQRT (–0) = – 0); FYL2X: negative operand (except FYL2X (–0) = –∞); FYL2XP1: operand more negative than –1.

Return the QNaN floating-point indefinite value to the destination operand.

FBSTP: Converted value cannot be represented in 18 decimal digits, or source value is an SNaN, QNaN, ± ∞ , or in an unsupported format.

Store packed BCD integer indefinite value in the destination operand.

8-38 Vol. 1

PROGRAMMING WITH THE X87 FPU

Table 8-10. Invalid Arithmetic Operations and the Masked Responses to Them (Contd.) FIST/FISTP: Converted value exceeds representable integer range of the destination operand, or source value is an SNaN, QNaN, ±∞, or in an unsupported format.

Store integer indefinite value in the destination operand.

FXCH: one or both registers are tagged empty.

Load empty registers with the QNaN floatingpoint indefinite value, then perform the exchange.

Norm ally, when one or bot h of t he source operands is a QNaN ( and neit her is an SNaN or in an unsupport ed form at ) , an invalid- operand except ion is not generat ed. An except ion t o t his rule is m ost of t he com pare inst ruct ions ( such as t he FCOM and FCOMI inst ruct ions) and t he float ing- point t o int eger conversion inst ruct ions ( FI ST/ FI STP and FBSTP) . Wit h t hese inst ruct ions, a QNaN source operand will generat e an invalid- operand except ion.

8.5.2

Denormal Operand Exception (#D)

The x87 FPU signals t he denorm al- operand except ion under t he following condit ions:



I f an arit hm et ic inst ruct ion at t em pt s t o operat e on a denorm al operand ( see Sect ion 4.8.3.2, “ Norm alized and Denorm alized Finit e Num bers” ) .



I f an at t em pt is m ade t o load a denorm al single- precision or double- precision float ing- point value int o an x87 FPU regist er. ( I f t he denorm al value being loaded is a double ext ended- precision float ing- point value, t he denorm al- operand except ion is not report ed.)

The flag ( DE) for t his except ion is bit 1 of t he x87 FPU st at us word, and t he m ask bit ( DM) is bit 1 of t he x87 FPU cont rol word. When a denorm al- operand except ion occur s and t he except ion is m asked, t he x 87 FPU set s t he DE flag, t hen pr oceeds w it h t he inst r uct ion. The denor m al operand in single- or double- precision float ing- point form at is aut om at ically norm alized when convert ed t o t he double ext ended- precision float ing- point form at . Subsequent operat ions will benefit from t he addit ional precision of t he int ernal double ext ended- precision float ing- point form at . When a denorm al- operand except ion occurs and t he except ion is not m asked, t he DE flag is set and a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) . The t op- of- st ack point er ( TOP) and source operands rem ain unchanged. For addit ional inform at ion about t he denorm al- operat ion except ion, see Sect ion 4.9.1.2, “ Denorm al Operand Except ion ( # D) .”

Vol. 1 8-39

PROGRAMMING WITH THE X87 FPU

8.5.3

Divide-By-Zero Exception (#Z)

The x87 FPU report s a float ing- point divide- by- zero except ion whenever an inst ruct ion at t em pt s t o divide a finit e non-zero operand by 0. The flag ( ZE) for t his except ion is bit 2 of t he x87 FPU st at us word, and t he m ask bit ( ZM) is bit 2 of t he x87 FPU cont rol word. The FDI V, FDI VP, FDI VR, FDI VRP, FI DI V, and FI DI VR inst ruct ions and t he ot her inst ruct ions t hat perform division int ernally ( FYL2X and FXTRACT) can report t he divide- by- zero except ion. When a divide- by- zero except ion occurs and t he except ion is m asked, t he x87 FPU set s t he ZE flag and ret urns t he values shown in Table 8- 10. I f t he divide- by-zero except ion is not m asked, t he ZE flag is set , a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) , and t he t op- of- st ack point er ( TOP) and source operands rem ain unchanged.

Table 8-11. Divide-By-Zero Conditions and the Masked Responses to Them Condition Divide or reverse divide operation with a 0 divisor. FYL2X instruction. FXTRACT instruction.

8.5.4

Masked Response

Returns an ∞ signed with the exclusive OR of the sign of the two operands to the destination operand.

Returns an ∞ signed with the opposite sign of the non-zero operand to the destination operand.

ST(1) is set to –∞; ST(0) is set to 0 with the same sign as the source operand.

Numeric Overflow Exception (#O)

The x87 FPU report s a float ing- point num eric overflow except ion ( # O) whenever t he rounded result of an arit hm et ic inst ruct ion exceeds t he largest allowable finit e value t hat will fit int o t he float ing- point form at of t he dest inat ion operand. ( See Sect ion 4.9.1.4, “ Num eric Overflow Except ion ( # O) ,” for addit ional inform at ion about t he num eric overflow except ion.) When using t he x87 FPU, num eric overflow can occur on arit hm et ic operat ions where t he result is st ored in an x87 FPU dat a regist er. I t can also occur on st ore float ingpoint operat ions ( using t he FST and FSTP inst ruct ions) , where a wit hin- range value in a dat a regist er is st ored in m em ory in a single- precision or double- precision float ing- point form at . The num eric overflow except ion cannot occur when st oring values in an int eger or BCD int eger form at . I nst ead, t he invalid- arit hm et ic- operand except ion is signaled. The flag ( OE) for t he num eric- overflow except ion is bit 3 of t he x87 FPU st at us word, and t he m ask bit ( OM) is bit 3 of t he x87 FPU cont rol word. When a num eric- overflow except ion occurs and t he except ion is m asked, t he x87 FPU set s t he OE flag and ret urns one of t he values shown in Table 4- 10. The value ret urned depends on t he current rounding m ode of t he x87 FPU ( see Sect ion 8.1.5.3, “ Rounding Cont rol Field” ) .

8-40 Vol. 1

PROGRAMMING WITH THE X87 FPU

The act ion t hat t he x87 FPU t akes when num eric overflow occurs and t he num ericoverflow except ion is not m asked, depends on whet her t he inst ruct ion is supposed t o st ore t he result in m em ory or on t he regist er st ack.



D e st ina t ion is a m e m or y loca t ion — The OE flag is set and a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) . The t op- of- st ack point er ( TOP) and source and dest inat ion operands rem ain unchanged. Because t he dat a in t he st ack is in double ext ended- precision form at , t he except ion handler has t he opt ion eit her of re- execut ing t he st ore inst ruct ion aft er proper adj ust m ent of t he operand or of rounding t he significand on t he st ack t o t he dest inat ion's precision as t he st andard requires. The except ion handler should ult im at ely st ore a value int o t he dest inat ion locat ion in m em ory if t he program is t o cont inue.



D e st in a t ion is t he r e gist e r st a ck — The significand of t he result is rounded according t o current set t ings of t he precision and rounding cont rol bit s in t he x87 FPU cont rol word and t he exponent of t he result is adj ust ed by dividing it by 2 24576 . ( For inst ruct ions not affect ed by t he precision field, t he significand is rounded t o double- ext ended precision.) The result ing value is st ored in t he dest inat ion operand. Condit ion code bit C1 in t he x87 FPU st at us word ( called in t his sit uat ion t he “ round- up bit ” ) is set if t he significand was rounded upward and cleared if t he result was rounded t oward 0. Aft er t he result is st ored, t he OE flag is set and a soft ware except ion handler is invoked. The scaling bias value 24,576 is equal t o 3 ∗ 2 13 . Biasing t he exponent by 24,576 norm ally t ranslat es t he num ber as nearly as possible t o t he m iddle of t he double ext ended- precision float ing- point exponent range so t hat , if desired, it can be used in subsequent scaled operat ions wit h less risk of causing furt her except ions. When using t he FSCALE inst ruct ion, m assive overflow can occur, where t he result is t oo large t o be represent ed, even wit h a bias- adj ust ed exponent . Here, if overflow occurs again, aft er t he result has been biased, a properly signed ∞ is st ored in t he dest inat ion operand.

8.5.5

Numeric Underflow Exception (#U)

The x87 FPU det ect s a float ing- point num eric underflow condit ion whenever t he rounded result of an arit hm et ic inst ruct ion is t iny; t hat is, less t han t he sm allest possible norm alized, finit e value t hat will fit int o t he float ing- point form at of t he dest inat ion operand. ( See Sect ion 4.9.1.5, “ Num eric Underflow Except ion ( # U) ,” for addit ional inform at ion about t he num eric underflow except ion.) Like num eric overflow, num eric underflow can occur on arit hm et ic operat ions where t he result is st ored in an x87 FPU dat a regist er. I t can also occur on st ore float ingpoint operat ions ( wit h t he FST and FSTP inst ruct ions) , where a wit hin- range value in a dat a regist er is st ored in m em ory in t he sm aller single- precision or double- precision float ing- point form at s. A num eric underflow except ion cannot occur when st oring values in an int eger or BCD int eger form at , because a t iny value is always rounded t o an int egral value of 0 or 1, depending on t he rounding m ode in effect .

Vol. 1 8-41

PROGRAMMING WITH THE X87 FPU

The flag ( UE) for t he num eric- underflow except ion is bit 4 of t he x87 FPU st at us word, and t he m ask bit ( UM) is bit 4 of t he x87 FPU cont rol word. When a num eric- underflow condit ion occurs and t he except ion is m asked, t he x87 FPU perform s t he operat ion described in Sect ion 4.9.1.5, “ Num eric Underflow Except ion ( # U) .” When t he except ion is not m asked, t he act ion of t he x87 FPU depends on whet her t he inst ruct ion is supposed t o st ore t he result in a m em ory locat ion or on t he x87 FPU resist er st ack.



D e st ina t ion is a m e m or y loca t ion — ( Can occur only wit h a st ore inst ruct ion.) The UE flag is set and a soft ware except ion handler is invoked ( see Sect ion 8.7, “ Handling x87 FPU Except ions in Soft ware” ) . The t op- of- st ack point er ( TOP) and source and dest inat ion operands rem ain unchanged, and no result is st ored in m em ory. Because t he dat a in t he st ack is in double ext ended- precision form at , t he except ion handler has t he opt ion eit her of re- exchanges t he st ore inst ruct ion aft er proper adj ust m ent of t he operand or of rounding t he significand on t he st ack t o t he dest inat ion's precision as t he st andard requires. The except ion handler should ult im at ely st ore a value int o t he dest inat ion locat ion in m em ory if t he program is t o cont inue.



D e st ina t ion is t he r e gist e r st a ck — The significand of t he result is rounded according t o current set t ings of t he precision and rounding cont rol bit s in t he x87 FPU cont rol word and t he exponent of t he result is adj ust ed by m ult iplying it by 2 24576 . ( For inst ruct ions not affect ed by t he precision field, t he significand is rounded t o double ext ended precision.) The result ing value is st ored in t he dest inat ion operand. Condit ion code bit C1 in t he x87 FPU st at us regist er ( act ing here as a “ round- up bit ” ) is set if t he significand was rounded upward and cleared if t he result was rounded t oward 0. Aft er t he result is st ored, t he UE flag is set and a soft ware except ion handler is invoked. The scaling bias value 24,576 is t he sam e as is used for t he overflow except ion and has t he sam e effect , which is t o t ranslat e t he result as nearly as possible t o t he m iddle of t he double ext endedprecision float ing- point exponent range. When using t he FSCALE inst ruct ion, m assive underflow can occur, where t he result is t oo t iny t o be represent ed, even wit h a bias- adj ust ed exponent . Here, if underflow occurs again aft er t he result has been biased, a properly signed 0 is st ored in t he dest inat ion operand.

8.5.6

Inexact-Result (Precision) Exception (#P)

The inexact- result except ion ( also called t he precision except ion) occurs if t he result of an operat ion is not exact ly represent able in t he dest inat ion form at . ( See Sect ion 4.9.1.6, “ I nexact- Result ( Precision) Except ion ( # P) ,” for addit ional inform at ion about t he num eric overflow except ion.) Not e t hat t he t ranscendent al inst ruct ions ( FSI N, FCOS, FSI NCOS, FPTAN, FPATAN, F2XM1, FYL2X, and FYL2XP1) by nat ure produce inexact result s.

8-42 Vol. 1

PROGRAMMING WITH THE X87 FPU

The inexact- result except ion flag ( PE) is bit 5 of t he x87 FPU st at us word, and t he m ask bit ( PM) is bit 5 of t he x87 FPU cont rol word. I f t he inexact- result except ion is m asked when an inexact- result condit ion occurs and a num eric overflow or underflow condit ion has not occurred, t he x87 FPU handles t he except ion as describe in Sect ion 4.9.1.6, “ I nexact- Result ( Precision) Except ion ( # P) ,” wit h one addit ional act ion. The C1 ( round- up) bit in t he x87 FPU st at us word is set t o indicat e whet her t he inexact result was rounded up ( C1 is set ) or “ not rounded up” ( C1 is cleared) . I n t he “ not rounded up” case, t he least- significant bit s of t he inexact result are t runcat ed so t hat t he result fit s in t he dest inat ion form at . I f t he inexact- result except ion is not m asked when an inexact result occurs and num eric overflow or underflow has not occurred, t he x87 FPU handles t he except ion as described in t he previous paragraph and, in addit ion, invokes a soft ware except ion handler. I f an inexact result occurs in conj unct ion wit h num eric overflow or underflow, t he x87 FPU carries out one of t he following operat ions:



I f an inexact result occurs in conj unct ion wit h m asked overflow or underflow, t he OE or UE flag and t he PE flag are set and t he result is st ored as described for t he overflow or underflow except ions ( see Sect ion 8.5.4, “ Num eric Overflow Except ion ( # O) ,” or Sect ion 8.5.5, “ Num eric Underflow Except ion ( # U) ” ) . I f t he inexact result except ion is unm asked, t he x87 FPU also invokes a soft ware except ion handler.



I f an inexact result occurs in conj unct ion wit h unm asked overflow or underflow and t he dest inat ion operand is a regist er, t he OE or UE flag and t he PE flag are set , t he result is st ored as described for t he overflow or underflow except ions ( see Sect ion 8.5.4, “ Num eric Overflow Except ion ( # O) ,” or Sect ion 8.5.5, “ Num eric Underflow Except ion ( # U) ” ) and a soft ware except ion handler is invoked.

I f an unm asked num eric overflow or underflow except ion occurs and t he dest inat ion operand is a m em ory locat ion ( which can happen only for a float ing- point st ore) , t he inexact- result condit ion is not report ed and t he C1 flag is cleared.

8.6

X87 FPU EXCEPTION SYNCHRONIZATION

Because t he int eger unit and x87 FPU are separat e execut ion unit s, it is possible for t he processor t o execut e float ing- point , int eger, and syst em inst ruct ions concurrent ly. No special program m ing t echniques are required t o gain t he advant ages of concurrent execut ion. ( Float ing- point inst ruct ions are placed in t he inst ruct ion st ream along wit h t he int eger and syst em inst ruct ions.) However, concurrent execut ion can cause problem s for float ing- point except ion handlers. This problem is relat ed t o t he way t he x87 FPU signals t he exist ence of unm asked float ing- point except ions. ( Special except ion synchronizat ion is not required for

Vol. 1 8-43

PROGRAMMING WITH THE X87 FPU

m asked float ing- point except ions, because t he x87 FPU always ret urns a m asked result t o t he dest inat ion operand.) When a float ing- point except ion is unm asked and t he except ion condit ion occurs, t he x87 FPU st ops furt her execut ion of t he float ing- point inst ruct ion and signals t he except ion event . On t he next occurrence of a float ing- point inst ruct ion or a WAI T/ FWAI T inst ruct ion in t he inst ruct ion st ream , t he processor checks t he ES flag in t he x87 FPU st at us word for pending float ing- point except ions. I f float ing- point except ions are pending, t he x87 FPU m akes an im plicit call ( t raps) t o t he float ingpoint soft ware except ion handler. The except ion handler can t hen execut e recovery procedures for select ed or all float ing- point except ions. Synchronizat ion problem s occur in t he t im e bet ween t he m om ent when t he except ion is signaled and when it is act ually handled. Because of concurrent execut ion, int eger or syst em inst ruct ions can be execut ed during t his t im e. I t is t hus possible for t he source or dest inat ion operands for a float ing- point inst ruct ion t hat fault ed t o be overwrit t en in m em ory, m aking it im possible for t he except ion handler t o analyze or recover from t he except ion. To solve t his problem , an except ion synchronizing inst ruct ion ( eit her a float ing- point inst ruct ion or a WAI T/ FWAI T inst ruct ion) can be placed im m ediat ely aft er any float ing- point inst ruct ion t hat m ight present a sit uat ion where st at e inform at ion pert aining t o a float ing- point except ion m ight be lost or corrupt ed. Float ing- point inst ruct ions t hat st ore dat a in m em ory are prim e candidat es for synchronizat ion. For exam ple, t he following t hree lines of code have t he pot ent ial for except ion synchronizat ion problem s: FILD COUNT INC COUNT FSQRT

;Floating-point instruction ;Integer instruction ;Subsequent floating-point instruction

I n t his exam ple, t he I NC inst ruct ion m odifies t he source operand of t he float ing- point inst ruct ion, FI LD. I f an except ion is signaled during t he execut ion of t he FI LD inst ruct ion, t he I NC inst ruct ion would be allowed t o overwrit e t he value st ored in t he COUNT m em ory locat ion before t he float ing- point except ion handler is called. Wit h t he COUNT variable m odified, t he float ing- point except ion handler would not be able t o recover from t he error. Rearranging t he inst ruct ions, as follows, so t hat t he FSQRT inst ruct ion follows t he FI LD inst ruct ion, synchronizes float ing- point except ion handling and elim inat es t he possibilit y of t he COUNT variable being overwrit t en before t he float ing- point except ion handler is invoked. FILD COUNT FSQRT INC COUNT

;Floating-point instruction ;Subsequent floating-point instruction synchronizes ;any exceptions generated by the FILD instruction. ;Integer instruction

The FSQRT inst ruct ion does not require any synchronizat ion, because t he result s of t his inst ruct ion are st ored in t he x87 FPU dat a regist ers and will rem ain t here, undist urbed, unt il t he next float ing- point or WAI T/ FWAI T inst ruct ion is execut ed. To abso-

8-44 Vol. 1

PROGRAMMING WITH THE X87 FPU

lut ely insure t hat any except ions em anat ing from t he FSQRT inst ruct ion are handled ( for exam ple, prior t o a procedure call) , a WAI T inst ruct ion can be placed direct ly aft er t he FSQRT inst ruct ion. Not e t hat som e float ing- point inst ruct ions ( non- wait ing inst ruct ions) do not check for pending unm asked except ions ( see Sect ion 8.3.11, “ x87 FPU Cont rol I nst ruct ions” ) . They include t he FNI NI T, FNSTENV, FNSAVE, FNSTSW, FNSTCW, and FNCLEX inst ruct ions. When an FNI NI T, FNSTENV, FNSAVE, or FNCLEX inst ruct ion is execut ed, all pending except ions are essent ially lost ( eit her t he x87 FPU st at us regist er is cleared or all except ions are m asked) . The FNSTSW and FNSTCW inst ruct ions do not check for pending int errupt s, but t hey do not m odify t he x87 FPU st at us and cont rol regist ers. A subsequent “ wait ing” float ing- point inst ruct ion can t hen handle any pending except ions.

8.7

HANDLING X87 FPU EXCEPTIONS IN SOFTWARE

The x87 FPU in Pent ium and lat er I A- 32 processors provides t wo different m odes of operat ion for invoking a soft ware except ion handler for float ing- point except ions: nat ive m ode and MS- DOS com pat ibilit y m ode. The m ode of operat ion is select ed when CR0.NE[ bit 5] is CR0. ( See Chapt er 2, “ Syst em Archit ect ure Overview,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion about t he NE flag.)

8.7.1

Native Mode

The nat ive m ode for handling float ing- point except ions is select ed by set t ing CR0.NE[ bit 5] t o 1. I n t his m ode, if t he x87 FPU det ect s an except ion condit ion while execut ing a float ing- point inst ruct ion and t he except ion is unm asked ( t he m ask bit for t he except ion is cleared) , t he x87 FPU set s t he flag for t he except ion and t he ES flag in t he x87 FPU st at us word. I t t hen invokes t he soft ware except ion handler t hrough t he float ing- point- error except ion ( # MF, vect or 16) , im m ediat ely before execut ion of any of t he following inst ruct ions in t he processor ’s inst ruct ion st ream :



The next float ing- point inst ruct ion, unless it is one of t he non- wait ing inst ruct ions ( FNI NI T, FNCLEX, FNSTSW, FNSTCW, FNSTENV, and FNSAVE) .

• •

The next WAI T/ FWAI T inst ruct ion. The next MMX inst ruct ion.

I f t he next float ing- point inst ruct ion in t he inst ruct ion st ream is a non- wait ing inst ruct ion, t he x87 FPU execut es t he inst ruct ion wit hout invoking t he soft ware except ion handler.

8.7.2

MS-DOS* Compatibility Sub-mode

I f CR0.NE[ bit 5] is 0, t he MS- DOS com pat ibilit y m ode for handling float ing- point except ions is select ed. I n t his m ode, t he soft ware except ion handler for float ingpoint except ions is invoked ext er nally using t he pr ocessor ’s FERR# , I NTR, and I GNNE# pins. This m et hod of report ing float ing- point err ors and invoking an excep-

Vol. 1 8-45

PROGRAMMING WITH THE X87 FPU

t ion handler is provided t o support t he float ing- point except ion handling m echanism used in PC syst em s t hat ar e running t he MS- DOS or Windows* 95 operat ing syst em . The MS- DOS com pat ibilit y m ode is t ypically used as follows t o invoke t he float ingpoint except ion handler: 1. I f t he x87 FPU det ect s an unm asked float ing- point except ion, it set s t he flag for t he except ion and t he ES flag in t he x87 FPU st at us word. 2. I f t he I GNNE# pin is deassert ed, t he x87 FPU t hen assert s t he FERR# pin eit her im m ediat ely, or else delayed ( deferred) unt il j ust before t he execut ion of t he next wait ing float ing- point inst ruct ion or MMX inst ruct ion. Whet her t he FERR# pin is assert ed im m ediat ely or delayed depends on t he t ype of processor, t he inst ruct ion, and t he t ype of except ion. 3. I f a preceding float ing- point inst ruct ion has set t he except ion flag for an unm asked x87 FPU except ion, t he processor freezes j ust before execut ing t he ne x t WAI T inst ruct ion, wait ing float ing- point inst ruct ion, or MMX inst ruct ion. Whet her t he FERR# pin was assert ed at t he preceding float ing- point inst ruct ion or is j ust now being assert ed, t he freezing of t he processor assures t hat t he x87 FPU except ion handler will be invoked before t he new float ing- point ( or MMX) inst ruct ion get s execut ed. 4. The FERR# pin is connect ed t hrough ext ernal hardware t o I RQ13 of a cascaded, program m able int errupt cont roller ( PI C) . When t he FERR# pin is assert ed, t he PI C is program m ed t o generat e an int errupt 75H. 5. The PI C assert s t he I NTR pin on t he processor t o signal t he int errupt 75H. 6. The BI OS for t he PC syst em handles t he int errupt 75H by branching t o t he int errupt 02H ( NMI ) int errupt handler. 7. The int errupt 02H handler det erm ines if t he int errupt is t he result of an NMI int errupt or a float ing- point except ion. 8. I f a float ing- point except ion is det ect ed, t he int errupt 02H handler branches t o t he float ing- point except ion handler. I f t he I GNNE# pin is assert ed, t he processor ignores float ing- point error condit ions. This pin is provided t o inhibit float ing- point except ions from being generat ed while t he float ing- point except ion handler is servicing a previously signaled float ing- point except ion. Appendix D, “ Guidelines for Writ ing x87 FPU Except ion Handlers,” describes t he MS- DOS com pat ibilit y m ode in m uch great er det ail. This m ode is som ewhat m ore com plicat ed in t he I nt el486 and Pent ium processor im plem ent at ions, as described in Appendix D.

8.7.3

Handling x87 FPU Exceptions in Software

Sect ion 4.9.3, “ Typical Act ions of a Float ing- Point Except ion Handler,” shows act ions t hat m ay be carried out by a float ing- point except ion handler. The st at e of t he x87

8-46 Vol. 1

PROGRAMMING WITH THE X87 FPU

FPU can be saved wit h t he FSTENV/ FNSTENV or FSAVE/ FNSAVE inst ruct ions ( see Sect ion 8.1.10, “ Saving t he x87 FPU’s St at e wit h FSTENV/ FNSTENV and FSAVE/ FNSAVE” ) . I f t he fault ing float ing- point inst ruct ion is followed by one or m ore non- float ing- point inst ruct ions, it m ay not be useful t o re- execut e t he fault ing inst ruct ion. See Sect ion 8.6, “ x87 FPU Except ion Synchronizat ion,” for m ore inform at ion on synchronizing float ing- point except ions. I n cases where t he handler needs t o rest art program execut ion wit h t he fault ing inst ruct ion, t he I RET inst ruct ion cannot be used direct ly. The reason for t his is t hat because t he except ion is not generat ed unt il t he next float ing- point or WAI T/ FWAI T inst ruct ion following t he fault ing float ing- point inst ruct ion, t he ret urn inst ruct ion point er on t he st ack m ay not point t o t he fault ing inst ruct ion. To rest art program execut ion at t he fault ing inst ruct ion, t he except ion handler m ust obt ain a point er t o t he inst ruct ion from t he saved x87 FPU st at e inform at ion, load it int o t he ret urn inst ruct ion point er locat ion on t he st ack, and t hen execut e t he I RET inst ruct ion. See Sect ion D.3.4, “ x87 FPU Except ion Handling Exam ples,” for general exam ples of float ing- point except ion handlers and for specific exam ples of how t o writ e a float ingpoint except ion handler when using t he MS- DOS com pat ibilit y m ode.

Vol. 1 8-47

PROGRAMMING WITH THE X87 FPU

8-48 Vol. 1

CHAPTER 9 PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY The I nt el MMX t echnology was int roduced int o t he I A- 32 archit ect ure in t he Pent ium I I processor fam ily and Pent ium processor wit h MMX t echnology. The ext ensions int roduced in MMX t echnology support a single- inst ruct ion, m ult iple- dat a ( SI MD) execut ion m odel t hat is designed t o accelerat e t he perform ance of advanced m edia and com m unicat ions applicat ions. This chapt er describes MMX t echnology.

9.1

OVERVIEW OF MMX TECHNOLOGY

MMX t echnology defines a sim ple and flexible SI MD execut ion m odel t o handle 64- bit packed int eger dat a. This m odel adds t he following feat ures t o t he I A- 32 archit ect ure, while m aint aining backwards com pat ibilit y wit h all I A- 32 applicat ions and operat ing- syst em code:

• •

Eight new 64- bit dat a regist ers, called MMX regist ers Three new packed dat a t ypes: — 64- bit packed byt e int egers ( signed and unsigned) — 64- bit packed word int egers ( signed and unsigned) — 64- bit packed doubleword int egers ( signed and unsigned)



I nst ruct ions t hat support t he new dat a t ypes and t o handle MMX st at e m anagem ent



Ext ensions t o t he CPUI D inst ruct ion

MMX t echnology is accessible from all t he I A32- archit ect ure execut ion m odes ( prot ect ed m ode, real address m ode, and virt ual 8086 m ode) . I t does not add any new m odes t o t he archit ect ure. The following sect ions of t his chapt er describe MMX t echnology’s program m ing environm ent , including MMX regist er set , dat a t ypes, and inst ruct ion set . Addit ional inst ruct ions t hat operat e on MMX regist ers have been added t o t he I A- 32 archit ect ure by t he SSE/ SSE2 ext ensions. For m ore inform at ion, see:



Sect ion 10.4.4, “ SSE 64- Bit SI MD I nt eger I nst ruct ions,” describes MMX inst ruct ions added t o t he I A- 32 archit ect ure wit h t he SSE ext ensions.



Sect ion 11.4.2, “ SSE2 64- Bit and 128- Bit SI MD I nt eger I nst ruct ions,” describes MMX inst ruct ions added t o t he I A- 32 archit ect ure wit h SSE2 ext ensions.



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, give det ailed descript ions of MMX inst ruct ions.

Vol. 1 9-1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY



Chapt er 11, “ I nt el® MMX™ Technology Syst em Program m ing,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B, describes t he m anner in which MMX t echnology is int egrat ed int o t he I A- 32 syst em program m ing m odel.

9.2

THE MMX TECHNOLOGY PROGRAMMING ENVIRONMENT

Figure 9- 1 shows t he execut ion environm ent for MMX t echnology. All MMX inst ruct ions operat e on MMX regist ers, t he general- purpose regist ers, and/ or m em ory as follows:



M M X r e gist e r s — These eight regist ers ( see Figure 9- 1) are used t o perform operat ions on 64- bit packed int eger dat a. They are nam ed MM0 t hrough MM7. Address Space 2

32

-1

MMX Registers Eight 64-Bit

General-Purpose Registers Eight 32-Bit 0

Figure 9-1. MMX Technology Execution Environment



Ge ne r a l- pu r pose r e gist e r s — The eight general- purpose regist ers ( see Figure 3- 5) are used wit h exist ing I A- 32 addressing m odes t o address operands in m em ory. ( MMX regist ers cannot be used t o address m em ory) . Generalpurpose regist ers are also used t o hold operands for som e MMX t echnology operat ions. They are EAX, EBX, ECX, EDX, EBP, ESI , EDI , and ESP.

9.2.1

MMX Technology in 64-Bit Mode and Compatibility Mode

I n com pat ibilit y m ode and 64- bit m ode, MMX inst ruct ions funct ion like t hey do in prot ect ed m ode. Mem ory operands are specified using t he ModR/ M, SI B encoding described in Sect ion 3.7.5.

9-2 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.2.2

MMX Registers

The MMX regist er set consist s of eight 64- bit regist ers ( see Figure 9- 2) , t hat are used t o perform calculat ions on t he MMX packed int eger dat a t ypes. Values in MMX regist ers have t he sam e form at as a 64- bit quant it y in m em ory. The MMX regist ers have t wo dat a access m odes: 64- bit access m ode and 32- bit access m ode. The 64- bit access m ode is used for:

• • • •

64- bit m em ory accesses 64- bit t ransfers bet ween MMX regist ers All pack, logical, and arit hm et ic inst ruct ions Som e unpack inst ruct ions

The 32- bit access m ode is used for:

• • •

32- bit m em ory accesses 32- bit t ransfer bet ween general- purpose regist ers and MMX regist ers Som e unpack inst ruct ions 63

0 MM7 MM6 MM5 MM4 MM3 MM2 MM1 MM0

Figure 9-2. MMX Register Set Alt hough MMX regist ers are defined in t he I A- 32 archit ect ure as separat e regist ers, t hey are aliased t o t he regist ers in t he FPU dat a regist er st ack ( R0 t hrough R7) . See also Sect ion 9.5, “ Com pat ibilit y wit h x87 FPU Archit ect ure.”

Vol. 1 9-3

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.2.3

MMX Data Types

MMX t echnology int roduced t he following 64- bit dat a t ypes t o t he I A- 32 archit ect ure ( see Figure 9- 3) :

• • •

64- bit packed byt e int egers — eight packed byt es 64- bit packed word int egers — four packed words 64- bit packed doubleword int egers — t wo packed doublewords

MMX inst ruct ions m ove 64- bit packed dat a t ypes ( packed byt es, packed words, or packed doublewords) and t he quadword dat a t ype bet ween MMX regist ers and m em ory or bet ween MMX regist ers in 64- bit blocks. However, when perform ing arit hm et ic or logical operat ions on t he packed dat a t ypes, MMX inst ruct ions operat e in parallel on t he individual byt es, words, or doublewords cont ained in MMX regist ers ( see Sect ion 9.2.5, “ Single I nst ruct ion, Mult iple Dat a ( SI MD) Execut ion Model” ) .

Packed Byte Integers 63

0 Packed Word Integers

63

0 Packed Doubleword Integers

63

0

Figure 9-3. Data Types Introduced with the MMX Technology

9.2.4

Memory Data Formats

When st ored in m em ory: byt es, words and doublewords in t he packed dat a t ypes are st ored in consecut ive addresses. The least significant byt e, word, or doubleword is st ored at t he lowest address and t he m ost significant byt e, word, or doubleword is st ored at t he high address. The ordering of byt es, words, or doublewords in m em ory is always lit t le endian. That is, t he byt es wit h t he low addresses are less significant t han t he byt es wit h high addresses.

9.2.5

Single Instruction, Multiple Data (SIMD) Execution Model

MMX t echnology uses t he single inst ruct ion, m ult iple dat a ( SI MD) t echnique for perform ing arit hm et ic and logical operat ions on byt es, words, or doublewords packed int o MMX regist ers ( see Figure 9- 4) . For exam ple, t he PADDSW inst ruct ion adds 4 signed word int egers from one source operand t o 4 signed word int egers in a second source operand and st ores 4 word int eger result s in a dest inat ion operand. This SI MD t echnique speeds up soft ware perform ance by allow ing t he sam e operat ion t o be car r ied out on m ult iple dat a elem ent s in parallel. MMX t echnology suppor t s parallel

9-4 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

operat ions on by t e, w or d, and doublew or d dat a elem ent s w hen cont ained in MMX r egist er s. The SI MD execut ion m odel support ed in t he MMX t echnology direct ly addresses t he needs of m odern m edia, com m unicat ions, and graphics applicat ions, which oft en use sophist icat ed algorit hm s t hat perform t he sam e operat ions on a large num ber of sm all dat a t ypes ( byt es, words, and doublewords) . For exam ple, m ost audio dat a is represent ed in 16- bit ( word) quant it ies. The MMX inst ruct ions can operat e on 4 words sim ult aneously wit h one inst ruct ion. Video and graphics inform at ion is com m only represent ed as pallet ized 8- bit ( byt e) quant it ies. I n Figure 9- 4, one MMX inst ruct ion operat es on 8 byt es sim ult aneously.

Source 1

Source 2

Destination

X3

X2

Y3

Y2

X1

X0

Y1

OP

OP

OP

X3 OP Y3

X2 OP Y2

X1 OP Y1

Y0

OP

X0 OP Y0

Figure 9-4. SIMD Execution Model

9.3

SATURATION AND WRAPAROUND MODES

When perform ing int eger arit hm et ic, an operat ion m ay result in an out- of- range condit ion, where t he t rue result cannot be represent ed in t he dest inat ion form at . For exam ple, when perform ing arit hm et ic on signed word int egers, posit ive overflow can occur when t he t rue signed result is larger t han 16 bit s. The MMX t echnology provides t hree ways of handling out- of- range condit ions:



W r a pa r ound a r it hm e t ic — Wit h wraparound arit hm et ic, a t rue out- of- range result is t runcat ed ( t hat is, t he carry or overflow bit is ignored and only t he least significant bit s of t he result are ret urned t o t he dest inat ion) . Wraparound arit hm et ic is suit able for applicat ions t hat cont rol t he range of operands t o prevent out- of- range result s. I f t he range of operands is not cont rolled, however, wraparound arit hm et ic can lead t o large errors. For exam ple, adding t wo large signed num bers can cause posit ive overflow and produce a negat ive result .



Sign e d sa t ur a t ion a r it h m e t ic — Wit h signed sat urat ion arit hm et ic, out- ofrange result s are lim it ed t o t he represent able range of signed int egers for t he int eger size being operat ed on ( see Table 9- 1) . For exam ple, if posit ive overflow occurs when operat ing on signed word int egers, t he result is “ sat urat ed” t o

Vol. 1 9-5

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

7FFFH, which is t he largest posit ive int eger t hat can be represent ed in 16 bit s; if negat ive overflow occurs, t he result is sat urat ed t o 8000H.



Un signe d sa t u r a t ion a r it h m e t ic — Wit h unsigned sat urat ion arit hm et ic, outof- range result s are lim it ed t o t he represent able range of unsigned int egers for t he int eger size. So, posit ive overflow when operat ing on unsigned byt e int egers result s in FFH being ret urned and negat ive overflow result s in 00H being ret urned.

.

Table 9-1. Data Range Limits for Saturation Data Type

Lower Limit Hexadecimal

Upper Limit

Decimal

Hexadecimal

Decimal

Signed Byte

80H

-128

7FH

127

Signed Word

8000H

-32,768

7FFFH

32,767

Unsigned Byte

00H

0

FFH

255

Unsigned Word

0000H

0

FFFFH

65,535

Sat urat ion arit hm et ic provides an answer for m any overflow sit uat ions. For exam ple, in color calculat ions, sat urat ion causes a color t o rem ain pure black or pure whit e wit hout allowing inversion. I t also prevent s wraparound art ifact s from ent ering int o com put at ions when range checking of source operands it not used. MMX inst ruct ions do not indicat e overflow or underflow occurrence by generat ing except ions or set t ing flags in t he EFLAGS regist er.

9.4

MMX INSTRUCTIONS

The MMX inst ruct ion set consist s of 47 inst ruct ions, grouped int o t he following cat egories:

• • • • • • • •

Dat a t ransfer Arit hm et ic Com parison Conversion Unpacking Logical Shift Em pt y MMX st at e inst ruct ion ( EMMS)

Table 9- 2 gives a sum m ary of t he inst ruct ions in t he MMX inst ruct ion set . The following sect ions give a brief overview of t he inst ruct ions wit hin each group.

9-6 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

NOTES The MMX inst ruct ions described in t his chapt er are t hose inst ruct ions t hat are available in an I A- 32 processor when CPUI D.01H: EDX.MMX[ bit 23] = 0. Sect ion 10.4.4, “ SSE 64- Bit SI MD I nt eger I nst ruct ions,” and Sect ion 11.4.2, “ SSE2 64- Bit and 128- Bit SI MD I nt eger I nst ruct ions,” list addit ional inst ruct ions included wit h SSE/ SSE2 ext ensions t hat operat e on t he MMX regist ers but are not considered part of t he MMX inst ruct ion set .

Table 9-2. MMX Instruction Set Summary Category Arithmetic

Addition Subtraction Multiplication Multiply and Add

Wraparound PADDB, PADDW, PADDD PSUBB, PSUBW, PSUBD PMULL, PMULH PMADD

Comparison

Compare for Equal PCMPEQB, PCMPEQW, PCMPEQD Compare for PCMPGTPB, Greater Than PCMPGTPW, PCMPGTPD

Conversion

Pack

Unpack

Unpack High

Unpack Low

Signed Saturation

PADDSB, PADDSW PADDUSB, PADDUSW PSUBSB, PSUBSW PSUBUSB, PSUBUSW

PACKSSWB, PACKSSDW

And And Not Or Exclusive OR

Shift

Shift Left Logical Shift Right Logical Shift Right Arithmetic

PACKUSWB

PUNPCKHBW, PUNPCKHWD, PUNPCKHDQ PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ Packed

Logical

Unsigned Saturation

Full Quadword PAND PANDN POR PXOR

PSLLW, PSLLD PSRLW, PSRLD PSRAW, PSRAD

PSLLQ PSRLQ

Vol. 1 9-7

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

Table 9-2. MMX Instruction Set Summary (Contd.) Category

Wraparound

Signed Saturation

Doubleword Transfers Data Transfer

Empty MMX State

9.4.1

Register to Register Load from Memory Store to Memory

MOVD MOVD MOVD

Unsigned Saturation Quadword Transfers MOVQ MOVQ MOVQ

EMMS

Data Transfer Instructions

The MOVD ( Move 32 Bit s) inst ruct ion t ransfers 32 bit s of packed dat a from m em ory t o an MMX regist er and vice versa; or from a general- purpose regist er t o an MMX regist er and vice versa. The MOVQ ( Move 64 Bit s) inst ruct ion t ransfers 64 bit s of packed dat a from m em ory t o an MMX regist er and vice versa; or t ransfers dat a bet ween MMX regist ers.

9.4.2

Arithmetic Instructions

The arit hm et ic inst ruct ions perform addit ion, subt ract ion, m ult iplicat ion, and m ult iply/ add operat ions on packed dat a t ypes. The PADDB/ PADDW/ PADDD ( add packed int egers) inst ruct ions and t he PSUBB/ PSUBW/ PSUBD ( subt ract packed int egers) inst ruct ions add or subt ract t he corresponding signed or unsigned dat a elem ent s of t he source and dest inat ion operands in wraparound m ode. These inst ruct ions operat e on packed byt e, word, and doubleword dat a t ypes. The PADDSB/ PADDSW ( add packed signed int egers wit h signed sat urat ion) inst ruct ions and t he PSUBSB/ PSUBSW ( subt ract packed signed int egers wit h signed sat urat ion) inst ruct ions add or subt ract t he corresponding signed dat a elem ent s of t he source and dest inat ion operands and sat urat e t he result t o t he lim it s of t he signed dat a- t ype range. These inst ruct ions operat e on packed byt e and word dat a t ypes. The PADDUSB/ PADDUSW ( add packed unsigned int egers wit h unsigned sat urat ion) inst ruct ions and t he PSUBUSB/ PSUBUSW ( subt ract packed unsigned int egers wit h unsigned sat urat ion) inst ruct ions add or subt ract t he corresponding unsigned dat a elem ent s of t he source and dest inat ion operands and sat urat e t he result t o t he lim it s of t he unsigned dat a- t ype range. These inst ruct ions operat e on packed byt e and word dat a t ypes.

9-8 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

The PMULHW ( m ult iply packed signed int egers and st ore high result ) and PMULLW ( m ult iply packed signed int egers and st ore low result ) inst ruct ions perform a signed m ult iply of t he corresponding words of t he source and dest inat ion operands and writ e t he high- order or low- order 16 bit s of each of t he result s, respect ively, t o t he dest inat ion operand. The PMADDWD ( m ult iply and add packed int egers) inst ruct ion com put es t he product s of t he corresponding signed words of t he source and dest inat ion operands. The four int erm ediat e 32- bit doubleword product s are sum m ed in pairs ( high- order pair and low- order pair) t o produce t wo 32- bit doubleword result s.

9.4.3

Comparison Instructions

The PCMPEQB/ PCMPEQW/ PCMPEQD ( com pare packed dat a for equal) inst ruct ions and t he PCMPGTB/ PCMPGTW/ PCMPGTD ( com pare packed signed int egers for great er t han) inst ruct ions com pare t he corresponding signed dat a elem ent s ( byt es, words, or doublewords) in t he source and dest inat ion operands for equal t o or great er t han, respect ively. These inst ruct ions generat e a m ask of ones or zeros which are writ t en t o t he dest inat ion operand. Logical operat ions can use t he m ask t o select packed elem ent s. This can be used t o im plem ent a packed condit ional m ove operat ion wit hout a branch or a set of branch inst ruct ions. No flags in t he EFLAGS regist er are affect ed.

9.4.4

Conversion Instructions

The PACKSSWB ( pack words int o byt es wit h signed sat urat ion) and PACKSSDW ( pack doublewords int o words wit h signed sat urat ion) inst ruct ions convert signed words int o signed byt es and signed doublewords int o signed words, respect ively, using signed sat urat ion. PACKUSWB ( pack words int o byt es wit h unsigned sat urat ion) convert s signed words int o unsigned byt es, using unsigned sat urat ion.

9.4.5

Unpack Instructions

The PUNPCKHBW/ PUNPCKHWD/ PUNPCKHDQ ( unpack high- order dat a elem ent s) inst ruct ions and t he PUNPCKLBW/ PUNPCKLWD/ PUNPCKLDQ ( unpack low- order dat a elem ent s) inst ruct ions unpack byt es, words, or doublewords from t he high- or loworder dat a elem ent s of t he source and dest inat ion operands and int erleave t hem in t he dest inat ion operand. By placing all 0s in t he source operand, t hese inst ruct ions can be used t o convert byt e int egers t o word int egers, word int egers t o doubleword int egers, or doubleword int egers t o quadword int egers.

Vol. 1 9-9

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.4.6

Logical Instructions

PAND ( bit wise logical AND) , PANDN ( bit wise logical AND NOT) , POR ( bit wise logical OR) , and PXOR ( bit wise logical exclusive OR) perform bit wise logical operat ions on t he quadword source and dest inat ion operands.

9.4.7

Shift Instructions

The logical shift left , logical shift right and arit hm et ic shift right inst ruct ions shift each elem ent by a specified num ber of bit posit ions. The PSLLW/ PSLLD/ PSLLQ ( shift packed dat a left logical) inst ruct ions and t he PSRLW/ PSRLD/ PSRLQ ( shift packed dat a right logical) inst ruct ions perform a logical left or right shift of t he dat a elem ent s and fill t he em pt y high or low order bit posit ions wit h zeros. These inst ruct ions operat e on packed words, doublewords, and quadwords. The PSRAW/ PSRAD ( shift packed dat a right arit hm et ic) inst ruct ions perform an arit hm et ic right shift , copying t he sign bit for each dat a elem ent int o em pt y bit posit ions on t he upper end of each dat a elem ent . This inst ruct ion operat es on packed words and doublewords.

9.4.8

EMMS Instruction

The EMMS inst ruct ion em pt ies t he MMX st at e by set t ing t he t ags in x87 FPU t ag word t o 11B, indicat ing em pt y regist ers. This inst ruct ion m ust be execut ed at t he end of an MMX rout ine before calling ot her rout ines t hat can execut e float ing- point inst ruct ions. See Sect ion 9.6.3, “ Using t he EMMS I nst ruct ion,” for m ore inform at ion on t he use of t his inst ruct ion.

9.5

COMPATIBILITY WITH X87 FPU ARCHITECTURE

The MMX st at e is aliased t o t he x87 FPU st at e. No new st at es or m odes have been added t o I A- 32 archit ect ure t o support t he MMX t echnology. The sam e float ing- point inst r uct ions t hat save and rest ore t he x87 FPU st at e also handle t he MMX st at e ( for exam ple, during cont ext sw it ching) . MMX t echnology uses t he sam e int erface t echniques bet ween t he x87 FPU and t he operat ing syst em ( prim arily for t ask swit ching purposes) . For m ore det ails, see Chapt er 11, “ I nt el® MMX™ Technology Syst em Program m ing,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

9-10 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.5.1

MMX Instructions and the x87 FPU Tag Word

Aft er each MMX inst ruct ion, t he ent ire x87 FPU t ag word is set t o valid ( 00B) . The EMMS inst ruct ion ( em pt y MMX st at e) set s t he ent ire x87 FPU t ag word t o em pt y ( 11B) . Chapt er 11, “ I nt el® MMX™ Technology Syst em Program m ing,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, provides addit ional inform at ion about t he effect s of x87 FPU and MMX inst ruct ions on t he x87 FPU t ag word. For a descript ion of t he t ag word, see Sect ion 8.1.7, “ x87 FPU Tag Word.”

9.6

WRITING APPLICATIONS WITH MMX CODE

The following sect ions give guidelines for writ ing applicat ion code t hat uses MMX t echnology.

9.6.1

Checking for MMX Technology Support

Before an applicat ion at t em pt s t o use t he MMX t echnology, it should check t hat it is present on t he processor. Check by following t hese st eps: 1. Check t hat t he processor support s t he CPUI D inst ruct ion by at t em pt ing t o execut e t he CPUI D inst ruct ion. I f t he processor does not support t he CPUI D inst ruct ion, t his will generat e an invalid- opcode except ion ( # UD) . 2. Check t hat t he processor support s t he MMX t echnology ( if CPUI D.01H: EDX.MMX[ bit 23] = 1) . 3. Check t hat em ulat ion of t he x87 FPU is disabled ( if CR0.EM[ bit 2] = 0) . I f t he processor at t em pt s t o execut e an unsupport ed MMX inst ruct ion or at t em pt s t o execut e an MMX inst ruct ion wit h CR0.EM[ bit 2] set , t his generat es an invalid- opcode except ion ( # UD) . Exam ple 9- 1 illust rat es how t o use t he CPUI D inst ruct ion t o det ect t he MMX t echnology. This exam ple does not represent t he ent ire CPUI D sequence, but shows t he port ion used for det ect ion of MMX t echnology. Example 9-1. Partial Routine for Detecting MMX Technology with the CPUID Instruction ... ; identify existence of CPUID instruction ... ; identify Intel processor mov EAX, 1 ; request for feature flags CPUID ; 0FH, 0A2H CPUID instruction test EDX, 00800000H ; Is IA MMX technology bit (Bit 23 of EDX) set? jnz ; MMX_Technology_Found

Vol. 1 9-11

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.6.2

Transitions Between x87 FPU and MMX Code

Applicat ions can cont ain bot h x87 FPU float ing- point and MMX inst ruct ions. However, because t he MMX regist ers are aliased t o t he x87 FPU regist er st ack, care m ust be t aken when m aking t ransit ions bet ween x87 FPU inst ruct ions and MMX inst ruct ions t o prevent incoherent or unexpect ed result s. When an MMX inst ruct ion ( ot her t han t he EMMS inst ruct ion) is execut ed, t he processor changes t he x87 FPU st at e as follows:

• • •

The TOS ( t op of st ack) value of t he x87 FPU st at us word is set t o 0. The ent ire x87 FPU t ag word is set t o t he valid st at e ( 00B in all t ag fields) . When an MMX inst ruct ion writ es t o an MMX regist er, it writ es ones ( 11B) t o t he exponent part of t he corresponding float ing- point regist er ( bit s 64 t hrough 79) .

The net result of t hese act ions is t hat any x87 FPU st at e prior t o t he execut ion of t he MMX inst ruct ion is essent ially lost . When an x87 FPU inst ruct ion is execut ed, t he processor assum es t hat t he current st at e of t he x87 FPU regist er st ack and cont rol regist ers is valid and execut es t he inst ruct ion wit hout any preparat ory m odificat ions t o t he x87 FPU st at e. I f t he applicat ion cont ains bot h x87 FPU float ing- point and MMX inst ruct ions, t he following guidelines are recom m ended:



When t ransit ioning bet ween x87 FPU and MMX code, save t he st at e of any x87 FPU dat a or cont rol regist ers t hat need t o be preserved for fut ure use. The FSAVE and FXSAVE inst ruct ions save t he ent ire x87 FPU st at e.



When t ransit ioning bet ween MMX and x87 FPU code, do t he following: — Save any dat a in t he MMX regist ers t hat needs t o be preserved for fut ure use. FSAVE and FXSAVE also save t he st at e of MMX regist ers. — Execut e t he EMMS inst ruct ion t o clear t he MMX st at e from t he x87 dat a and cont rol regist ers.

The following sect ions describe t he use of t he EMMS inst ruct ion and give addit ional guidelines for m ixing x87 FPU and MMX code.

9.6.3

Using the EMMS Instruction

As described in Sect ion 9.6.2, “ Transit ions Bet ween x87 FPU and MMX Code,” when an MMX inst ruct ion execut es, t he x87 FPU t ag word is m arked valid ( 00B) . I n t his st at e, t he execut ion of subsequent x87 FPU inst ruct ions m ay produce unexpect ed x87 FPU float ing- point except ions and/ or incorrect result s because t he x87 FPU regist er st ack appears t o cont ain valid dat a. The EMMS inst ruct ion is provided t o prevent t his problem by m arking t he x87 FPU t ag word as em pt y. The EMMS inst ruct ion should be used in each of t he following cases:



When an applicat ion using t he x87 FPU inst ruct ions calls an MMX t echnology library/ DLL ( use t he EMMS inst ruct ion at t he end of t he MMX code) .

9-12 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY



When an applicat ion using MMX inst ruct ions calls a x87 FPU float ing- point library/ DLL ( use t he EMMS inst ruct ion before calling t he x87 FPU code) .



When a swit ch is m ade bet ween MMX code in a t ask or t hread and ot her t asks or t hreads in cooperat ive operat ing syst em s, unless it is cert ain t hat m ore MMX inst ruct ions will be execut ed before any x87 FPU code.

EMMS is not required when m ixing MMX t echnology inst ruct ions wit h SSE/ SSE2/ SSE3 inst ruct ions ( see Sect ion 11.6.7, “ I nt eract ion of SSE/ SSE2 I nst ruct ions wit h x87 FPU and MMX I nst ruct ions” ) .

9.6.4

Mixing MMX and x87 FPU Instructions

An applicat ion can cont ain bot h x87 FPU float ing- point and MMX inst ruct ions. However, frequent t ransit ions bet ween MMX and x87 FPU inst ruct ions are not recom m ended, because t hey can degrade perform ance in som e processor im plem ent at ions. When m ixing MMX code wit h x87 FPU code, follow t hese guidelines:

• •

Keep t he code in separat e m odules, procedures, or rout ines. Do not rely on regist er cont ent s across t ransit ions bet ween x87 FPU and MMX code m odules.



When t ransit ioning bet ween MMX code and x87 FPU code, save t he MMX regist er st at e ( if it will be needed in t he fut ure) and execut e an EMMS inst ruct ion t o em pt y t he MMX st at e.



When t ransit ioning bet ween x87 FPU code and MMX code, save t he x87 FPU st at e if it will be needed in t he fut ure.

9.6.5

Interfacing with MMX Code

MMX t echnology enables direct access t o all t he MMX regist ers. This m eans t hat all exist ing int erface convent ions t hat apply t o t he use of t he processor ’s generalpurpose regist ers ( EAX, EBX, et c.) also apply t o t he use of MMX regist ers. An efficient int erface t o MMX rout ines m ight pass param et ers and ret urn values t hrough t he MMX regist ers or t hrough a com binat ion of m em ory locat ions ( via t he st ack) and MMX regist ers. Do not use t he EMMS inst ruct ion or m ix MMX and x87 FPU code when using t o t he MMX regist ers t o pass param et ers. I f a high- level language t hat does not support t he MMX dat a t ypes direct ly is used, t he MMX dat a t ypes can be defined as a 64- bit st ruct ure cont aining packed dat a t ypes. When im plem ent ing MMX inst ruct ions in high- level languages, ot her approaches can be t aken, such as:



Passing param et ers t o an MMX rout ine by passing a point er t o a st ruct ure via t he st ack.



Ret urning a value from a funct ion by ret urning a point er t o a st ruct ure.

Vol. 1 9-13

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.6.6

Using MMX Code in a Multitasking Operating System Environment

An applicat ion needs t o ident ify t he nat ure of t he m ult it asking operat ing syst em on which it runs. Each t ask ret ains it s own st at e which m ust be saved when a t ask swit ch occurs. The processor st at e ( cont ext ) consist s of t he general- purpose regist ers and t he float ing- point and MMX regist ers. Operat ing syst em s can be classified int o t wo t ypes:

• •

Cooperat ive m ult it asking operat ing syst em Preem pt ive m ult it asking operat ing syst em

Cooperat ive m ult it asking operat ing syst em s do not save t he FPU or MMX st at e when perform ing a cont ext swit ch. Therefore, t he applicat ion needs t o save t he relevant st at e before relinquishing direct or indirect cont rol t o t he operat ing syst em . Preem pt ive m ult it asking operat ing syst em s are responsible for saving and rest oring t he FPU and MMX st at e when perform ing a cont ext swit ch. Therefore, t he applicat ion does not have t o save or rest ore t he FPU and MMX st at e.

9.6.7

Exception Handling in MMX Code

MMX inst ruct ions generat e t he sam e t ype of m em ory- access except ions as ot her I A32 inst ruct ions ( page fault , segm ent not present , and lim it violat ions) . Exist ing except ion handlers do not have t o be m odified t o handle t hese t ypes of except ions for MMX code. Unless t here is a pending float ing- point except ion, MMX inst ruct ions do not generat e num eric except ions. Therefore, t here is no need t o m odify exist ing except ion handlers or add new ones t o handle num eric except ions. I f a float ing- point except ion is pending, t he subsequent MMX inst ruct ion generat es a num eric error except ion ( int errupt 16 and/ or assert ion of t he FERR# pin) . The MMX inst ruct ion resum es execut ion upon ret urn from t he except ion handler.

9.6.8

Register Mapping

MMX regist ers and t heir t ags are m apped t o physical locat ions of t he float ing- point regist ers and t heir t ags. Regist er aliasing and m apping is described in m ore det ail in Chapt er 11, “ I nt el® MMX™ Technology Syst em Program m ing,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

9-14 Vol. 1

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9.6.9

Effect of Instruction Prefixes on MMX Instructions

Table 9- 3 describes t he effect of inst ruct ion prefixes on MMX inst ruct ions. Unpredict able behavior can range from being t reat ed as a reserved operat ion on one generat ion of I A- 32 processors t o generat ing an invalid opcode except ion on anot her generat ion of processors.

Table 9-3. Effect of Prefixes on MMX Instructions Prefix Type

Effect on MMX Instructions

Address Size Prefix (67H)

Affects instructions with a memory operand. Reserved for instructions without a memory operand and may result in unpredictable behavior.

Operand Size (66H)

Reserved and may result in unpredictable behavior.

Segment Override (2EH, 36H, 3EH, 26H, 64H, 65H)

Affects instructions with a memory operand.

Repeat Prefix (F3H)

Reserved and may result in unpredictable behavior.

Repeat NE Prefix(F2H)

Reserved and may result in unpredictable behavior.

Lock Prefix (F0H)

Reserved; generates invalid opcode exception (#UD).

Branch Hint Prefixes (2EH and 3EH)

Reserved and may result in unpredictable behavior.

Reserved for instructions without a memory operand and may result in unpredictable behavior.

See “ I nst ruct ion Prefixes” in Chapt er 2, “ I nst ruct ion Form at ,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for a descript ion of t he inst ruct ion prefixes.

Vol. 1 9-15

PROGRAMMING WITH INTEL® MMX™ TECHNOLOGY

9-16 Vol. 1

CHAPTER 10 PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE) The st ream ing SI MD ext ensions ( SSE) were int roduced int o t he I A- 32 archit ect ur e in t he Pent ium III processor fam ily. These ext ensions enhance t he perform ance of I A- 32 pr ocessors for advanced 2- D and 3- D graphics, m ot ion video, im age processing, speech recognit ion, audio synt hesis, t elephony, and video conferencing. This chapt er describes SSE. Chapt er 11, “ Program m ing wit h St ream ing SI MD Ext ensions 2 ( SSE2) ,” provides inform at ion t o assist in writ ing applicat ion program s t hat use SSE2 ext ensions. Chapt er 12, “ Program m ing wit h SSE3 and Supplem ent al SSE3,” provides t his inform at ion for SSE3 ext ensions.

10.1

OVERVIEW OF SSE EXTENSIONS

I nt el MMX t echnology int roduced single- inst ruct ion m ult iple- dat a ( SI MD) capabilit y int o t he I A- 32 archit ect ure, wit h t he 64- bit MMX regist ers, 64- bit packed int eger dat a t ypes, and inst ruct ions t hat allowed SI MD operat ions t o be perform ed on packed int egers. SSE ext ensions expand t he SI MD execut ion m odel by adding facilit ies for handling packed and scalar single- precision float ing- point values cont ained in 128- bit regist ers. I f CPUI D.01H: EDX.SSE[ bit 25] = 1, SSE ext ensions are present . SSE ext ensions add t he following feat ures t o t he I A- 32 archit ect ure, while m aint aining backward com pat ibilit y wit h all exist ing I A- 32 processors, applicat ions and operat ing syst em s.



Eight 128- bit dat a regist ers ( called XMM regist ers) in non- 64- bit m odes; sixt een XMM regist ers are available in 64- bit m ode.



The 32- bit MXCSR regist er, which provides cont rol and st at us bit s for operat ions perform ed on XMM regist ers.



The 128- bit packed single- precision float ing- point dat a t ype ( four I EEE singleprecision float ing- point values packed int o a double quadword) .



I nst ruct ions t hat perform SI MD operat ions on single- precision float ing- point values and t hat ext end SI MD operat ions t hat can be perform ed on int egers: — 128- bit Packed and scalar single- precision float ing- point inst ruct ions t hat operat e on dat a locat ed in MMX regist ers — 64- bit SI MD int eger inst ruct ions t hat support addit ional operat ions on packed int eger operands locat ed in MMX regist ers



I nst ruct ions t hat save and rest ore t he st at e of t he MXCSR regist er.

Vol. 1 10-1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)



I nst ruct ions t hat support explicit prefet ching of dat a, cont rol of t he cacheabilit y of dat a, and cont rol t he ordering of st ore operat ions.



Ext ensions t o t he CPUI D inst ruct ion.

These feat ures ext end t he I A- 32 archit ect ure’s SI MD program m ing m odel in four im port ant ways:



The abilit y t o perform SI MD operat ions on four packed single- precision float ingpoint values enhances t he perform ance of I A- 32 processors for advanced m edia and com m unicat ions applicat ions t hat use com put at ion- int ensive algorit hm s t o perform repet it ive operat ions on large arrays of sim ple, nat ive dat a elem ent s.



The abilit y t o perform SI MD single- precision float ing- point operat ions in XMM regist ers and SI MD int eger operat ions in MMX regist ers provides great er flexibilit y and t hroughput for execut ing applicat ions t hat operat e on large arrays of float ing- point and int eger dat a.



Cache cont rol inst ruct ions provide t he abilit y t o st ream dat a in and out of XMM regist ers wit hout pollut ing t he caches and t he abilit y t o prefet ch dat a t o select ed cache levels before it is act ually used. Applicat ions t hat require regular access t o large am ount s of dat a benefit from t hese prefet ching and st ream ing st ore capabilit ies.



The SFENCE ( st ore fence) inst ruct ion provides great er cont rol over t he ordering of st ore operat ions when using weakly- ordered m em ory t ypes.

SSE ext ensions are fully com pat ible wit h all soft ware writ t en for I A- 32 processors. All exist ing soft ware cont inues t o run correct ly, wit hout m odificat ion, on processors t hat incorporat e SSE ext ensions. Enhancem ent s t o CPUI D perm it det ect ion of SSE ext ensions. SSE ext ensions are accessible from all I A- 32 execut ion m odes: prot ect ed m ode, real address m ode, and virt ual- 8086 m ode. The following sect ions of t his chapt er describe t he program m ing environm ent for SSE ext ensions, including: XMM regist ers, t he packed single- precision float ing- point dat a t ype, and SSE inst ruct ions. For addit ional inform at ion, see:

• •

Sect ion 11.6, “ Writ ing Applicat ions wit h SSE/ SSE2 Ext ensions” .



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, provide a det ailed descript ion of t hese inst ruct ions.



Chapt er 12, “ Syst em Program m ing for St ream ing SI MD I nst ruct ion Set s,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, gives guidelines for int egrat ing t hese ext ensions int o an operat ing- syst em environm ent .

Sect ion 11.5, “ SSE, SSE2, and SSE3 Except ions,” describes t he except ions t hat can be generat ed wit h SSE/ SSE2/ SSE3 inst ruct ions.

10-2 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

10.2

SSE PROGRAMMING ENVIRONMENT

Figure 10- 1 shows t he execut ion environm ent for t he SSE ext ensions. All SSE inst ruct ions operat e on t he XMM regist ers, MMX regist ers, and/ or m em ory as follows:



XM M r e gist e r s — These eight regist ers ( see Figure 10- 2 and Sect ion 10.2.2, “ XMM Regist ers” ) are used t o operat e on packed or scalar single- precision float ing- point dat a. Scalar operat ions are operat ions perform ed on individual ( unpacked) single- precision float ing- point values st ored in t he low doubleword of an XMM regist er. XMM regist ers are referenced by t he nam es XMM0 t hrough XMM7. Address Space 232 -1

XMM Registers Eight 128-Bit

MXCSR Register

32 Bits

MMX Registers Eight 64-Bit

General-Purpose Registers Eight 32-Bit 0 EFLAGS Register

32 Bits

Figure 10-1. SSE Execution Environment



M XCSR r e gist e r — This 32- bit regist er ( see Figure 10- 3 and Sect ion 10.2.3, “ MXCSR Cont rol and St at us Regist er ” ) provides st at us and cont rol bit s used in SI MD float ing- point operat ions.



M M X r e gist e r s — These eight regist ers ( see Figure 9- 2) are used t o perform operat ions on 64- bit packed int eger dat a. They are also used t o hold operands for som e operat ions perform ed bet ween t he MMX and XMM regist ers. MMX regist ers are referenced by t he nam es MM0 t hrough MM7.



Ge n e r a l- pur pose r e gist e r s — The eight general- purpose regist ers ( see Figure 3- 5) are used along wit h t he exist ing I A- 32 addressing m odes t o address operands in m em ory. ( MMX and XMM regist ers cannot be used t o address m em ory) . The general- purpose regist ers are also used t o hold operands for som e

Vol. 1 10-3

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

SSE inst ruct ions and are referenced as EAX, EBX, ECX, EDX, EBP, ESI , EDI , and ESP.



EFLAGS r e gist e r — This 32- bit regist er ( see Figure 3- 8) is used t o record result of som e com pare operat ions.

10.2.1

SSE in 64-Bit Mode and Compatibility Mode

I n com pat ibilit y m ode, SSE ext ensions funct ion like t hey do in prot ect ed m ode. I n 64- bit m ode, eight addit ional XMM regist ers are accessible. Regist ers XMM8-XMM15 are accessed by using REX prefixes. Mem ory operands are specified using t he ModR/ M, SI B encoding described in Sect ion 3.7.5. Som e SSE inst ruct ions m ay be used t o operat e on general- purpose regist ers. Use t he REX.W prefix t o access 64- bit general- purpose regist ers. Not e t hat if a REX prefix is used when it has no m eaning, t he prefix is ignored.

10.2.2

XMM Registers

Eight 128- bit XMM dat a regist ers were int roduced int o t he I A- 32 archit ect ure wit h SSE ext ensions ( see Figure 10- 2) . These regist ers can be accessed direct ly using t he nam es XMM0 t o XMM7; and t hey can be accessed independent ly from t he x87 FPU and MMX regist ers and t he general- purpose regist ers ( t hat is, t hey are not aliased t o any ot her of t he processor ’s regist ers) . 127

0 XMM7 XMM6 XMM5 XMM4 XMM3 XMM2 XMM1 XMM0

Figure 10-2. XMM Registers SSE inst ruct ions use t he XMM regist ers only t o operat e on packed single- precision float ing- point operands. SSE2 ext ensions expand t he funct ions of t he XMM regist ers t o operand on packed or scalar double- precision float ing- point operands and packed

10-4 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

int eger operands ( see Sect ion 11.2, “ SSE2 Program m ing Environm ent ,” and Sect ion 12.1, “ SSE3/ SSSE3 Program m ing Environm ent and Dat a t ypes” ) . XMM regist ers can only be used t o perform calculat ions on dat a; t hey cannot be used t o address m em ory. Addressing m em ory is accom plished by using t he generalpurpose regist ers. Dat a can be loaded int o XMM regist ers or writ t en from t he regist ers t o m em ory in 32- bit , 64- bit , and 128- bit increm ent s. When st oring t he ent ire cont ent s of an XMM regist er in m em ory ( 128- bit st ore) , t he dat a is st ored in 16 consecut ive byt es, wit h t he low- order byt e of t he regist er being st ored in t he first byt e in m em ory.

10.2.3

MXCSR Control and Status Register

The 32- bit MXCSR regist er ( see Figure 10- 3) cont ains cont rol and st at us inform at ion for SSE, SSE2, and SSE3 SI MD float ing- point operat ions. This regist er cont ains:

• • •

flag and m ask bit s for SI MD float ing- point except ions



denorm als- are- zeros flag t hat cont rols how SI MD float ing- point inst ruct ions handle denorm al source operands

rounding cont rol field for SI MD float ing- point operat ions flush- t o- zero flag t hat provides a m eans of cont rolling underflow condit ions on SI MD float ing- point operat ions

The cont ent s of t his regist er can be loaded from m em ory wit h t he LDMXCSR and FXRSTOR inst ruct ions and st ored in m em ory wit h STMXCSR and FXSAVE. Bit s 16 t hrough 31 of t he MXCSR regist er are reserved and are cleared on a powerup or reset of t he processor; at t em pt ing t o writ e a non- zero value t o t hese bit s, using eit her t he FXRSTOR or LDMXCSR inst ruct ions, will result in a general- prot ect ion except ion ( # GP) being generat ed.

Vol. 1 10-5

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

31

16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Reserved

F Z

R C

P U O Z D I D P U O Z D I A M M M M M M E E E E E E Z

Flush to Zero Rounding Control Precision Mask Underflow Mask Overflow Mask Divide-by-Zero Mask Denormal Operation Mask Invalid Operation Mask Denormals Are Zeros* Precision Flag Underflow Flag Overflow Flag Divide-by-Zero Flag Denormal Flag Invalid Operation Flag * The denormals-are-zeros flag was introduced in the Pentium 4 and Intel Xeon processor.

Figure 10-3. MXCSR Control/Status Register

10.2.3.1

SIMD Floating-Point Mask and Flag Bits

Bit s 0 t hrough 5 of t he MXCSR regist er indicat e whet her a SI MD float ing- point except ion has been det ect ed. They are “ st icky” flags. That is, aft er a flag is set , it rem ains set unt il explicit ly cleared. To clear t hese flags, use t he LDMXCSR or t he FXRSTOR inst ruct ion t o writ e zeroes t o t hem . Bit s 7 t hrough 12 provide individual m ask bit s for t he SI MD float ing- point except ions. An except ion t ype is m asked if t he corresponding m ask bit is set , and it is unm asked if t he bit is clear. These m ask bit s are set upon a power- up or reset . This causes all SI MD float ing- point except ions t o be init ially m asked. I f LDMXCSR or FXRSTOR clears a m ask bit and set s t he corresponding except ion flag bit , a SI MD float ing- point except ion will not be generat ed as a result of t his change. The unm asked except ion will be generat ed only upon t he execut ion of t he next SSE/ SSE2/ SSE3 inst ruct ion t hat det ect s t he unm asked except ion condit ion. For m ore inform at ion about t he use of t he SI MD float ing- point except ion m ask and flag bit s, see Sect ion 11.5, “ SSE, SSE2, and SSE3 Except ions,” and Sect ion 12.8, “ SSE3/ SSSE3 Except ions.”

10.2.3.2

SIMD Floating-Point Rounding Control Field

Bit s 13 and 14 of t he MXCSR regist er ( t he rounding cont rol [ RC] field) cont rol how t he result s of SI MD float ing- point inst ruct ions are rounded. See Sect ion 4.8.4,

10-6 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

“ Rounding,” for a descript ion of t he funct ion and encoding of t he rounding cont rol bit s.

10.2.3.3

Flush-To-Zero

Bit 15 ( FZ) of t he MXCSR regist er enables t he flush- t o- zero m ode, which cont rols t he m asked response t o a SI MD float ing- point underflow condit ion. When t he underflow except ion is m asked and t he flush- t o-zero m ode is enabled, t he processor perform s t he following operat ions when it det ect s a float ing- point underflow condit ion:

• •

Ret urns a zero result wit h t he sign of t he t rue result Set s t he precision and underflow except ion flags

I f t he underflow except ion is not m asked, t he flush- t o-zero bit is ignored. The flush- t o- zero m ode is not com pat ible wit h I EEE St andard 754. The I EEEm andat ed m asked response t o underflow is t o deliver t he denorm alized result ( see Sect ion 4.8.3.2, “ Norm alized and Denorm alized Finit e Num bers” ) . The flush- t o- zero m ode is provided prim arily for perform ance reasons. At t he cost of a slight precision loss, fast er execut ion can be achieved for applicat ions where underflows are com m on and rounding t he underflow result t o zero can be t olerat ed. The flush- t o- zero bit is cleared upon a power- up or reset of t he processor, disabling t he flush- t o- zero m ode.

10.2.3.4

Denormals-Are-Zeros

Bit 6 ( DAZ) of t he MXCSR regist er enables t he denorm als- are- zeros m ode, which cont rols t he processor ’s response t o a SI MD float ing- point denorm al operand condit ion. When t he denorm als- are-zeros flag is set , t he processor convert s all denorm al source operands t o a zero wit h t he sign of t he original operand before perform ing any com put at ions on t hem . The processor does not set t he denorm al- operand except ion flag ( DE) , regardless of t he set t ing of t he denorm al- operand except ion m ask bit ( DM) ; and it does not generat e a denorm al- operand except ion if t he except ion is unm asked. The denorm als- are-zeros m ode is not com pat ible wit h I EEE St andard 754 ( see Sect ion 4.8.3.2, “ Norm alized and Denorm alized Finit e Num bers” ) . The denorm alsare- zeros m ode is provided t o im prove processor perform ance for applicat ions such as st ream ing m edia processing, where rounding a denorm al operand t o zero does not appreciably affect t he qualit y of t he processed dat a. The denorm als- are-zeros flag is cleared upon a power- up or reset of t he processor, disabling t he denorm als- are- zeros m ode. The denorm als- are-zeros m ode was int roduced in t he Pent ium 4 and I nt el Xeon processor wit h t he SSE2 ext ensions; however, it is fully com pat ible wit h t he SSE SI MD float ing- point inst ruct ions ( t hat is, t he denorm als- are- zeros flag affect s t he operat ion of t he SSE SI MD float ing- point inst ruct ions) . I n earlier I A- 32 processors and in som e m odels of t he Pent ium 4 processor, t his flag ( bit 6) is reserved. See

Vol. 1 10-7

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

Sect ion 11.6.3, “ Checking for t he DAZ Flag in t he MXCSR Regist er,” for inst ruct ions for det ect ing t he availabilit y of t his feat ure. At t em pt ing t o set bit 6 of t he MXCSR regist er on processors t hat do not support t he DAZ flag will cause a general- prot ect ion except ion ( # GP) . See Sect ion 11.6.6, “ Guidelines for Writ ing t o t he MXCSR Regist er,” for inst ruct ions for prevent ing such general- prot ect ion except ions by using t he MXCSR_MASK value ret urned by t he FXSAVE inst ruct ion.

10.2.4

Compatibility of SSE Extensions with SSE2/SSE3/MMX and the x87 FPU

The st at e ( XMM regist ers and MXCSR regist er) int roduced int o t he I A- 32 execut ion environm ent wit h t he SSE ext ensions is shared wit h SSE2 and SSE3 ext ensions. SSE/ SSE2/ SSE3 inst ruct ions are fully com pat ible; t hey can be execut ed t oget her in t he sam e inst ruct ion st ream wit h no need t o save st at e when swit ching bet ween inst ruct ion set s. XMM regist ers are independent of t he x87 FPU and MMX regist ers, so SSE/ SSE2/ SSE3 operat ions perform ed on t he XMM regist ers can be perform ed in parallel wit h operat ions on t he x87 FPU and MMX regist ers ( see Sect ion 11.6.7, “ I nt eract ion of SSE/ SSE2 I nst ruct ions wit h x87 FPU and MMX I nst ruct ions” ) . The FXSAVE and FXRSTOR inst ruct ions save and rest ore t he SSE/ SSE2/ SSE3 st at es along wit h t he x87 FPU and MMX st at e.

10.3

SSE DATA TYPES

SSE ext ensions int roduced one dat a t ype, t he 128- bit packed single- precision float ing- point dat a t ype, t o t he I A- 32 archit ect ure ( see Figure 10- 4) . This dat a t ype consist s of four I EEE 32- bit single- precision float ing- point values packed int o a double quadword. ( See Figure 4- 3 for t he layout of a single- precision float ing- point value; refer t o Sect ion 4.2.2, “ Float ing- Point Dat a Types,” for a det ailed descript ion of t he single- precision float ing- point form at .)

Contains 4 Single-Precision Floating-Point Values 127

96 95

64 63

32 31

0

Figure 10-4. 128-Bit Packed Single-Precision Floating-Point Data Type This 128- bit packed single- precision float ing- point dat a t ype is operat ed on in t he XMM regist ers or in m em ory. Conversion inst ruct ions are provided t o convert t wo packed single- precision float ing- point values int o t wo packed doubleword int egers or

10-8 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

a scalar single- precision float ing- point value int o a doubleword int eger ( see Figure 11- 8) . SSE ext ensions provide conversion inst ruct ions bet ween XMM regist ers and MMX regist ers, and bet ween XMM regist ers and general- purpose bit regist ers. See Figure 11- 8. The address of a 128- bit packed m em ory operand m ust be aligned on a 16- byt e boundary, except in t he following cases:

• •

The MOVUPS inst ruct ion support s unaligned accesses. Scalar inst ruct ions t hat use a 4- byt e m em ory operand t hat is not subj ect t o alignm ent requirem ent s.

Figure 4- 2 shows t he byt e order of 128- bit ( double quadword) dat a t ypes in m em ory.

10.4

SSE INSTRUCTION SET

SSE inst ruct ions are divided int o four funct ional groups

• • • •

Packed and scalar single- precision float ing- point inst ruct ions 64- bit SI MD int eger inst ruct ions St at e m anagem ent inst ruct ions Cacheabilit y cont rol, prefet ch, and m em ory ordering inst ruct ions

The following sect ions give an overview of each of t he inst ruct ions in t hese groups.

10.4.1

SSE Packed and Scalar Floating-Point Instructions

The packed and scalar single- precision float ing- point inst ruct ions are divided int o t he following subgroups:

• • • • • •

Dat a m ovem ent inst ruct ions Arit hm et ic inst ruct ions Logical inst ruct ions Com parison inst ruct ions Shuffle inst ruct ions Conversion inst ruct ions

The packed single- precision float ing- point inst ruct ions perform SI MD operat ions on packed single- precision float ing- point operands ( see Figure 10- 5) . Each source operand cont ains four single- precision float ing- point values, and t he dest inat ion operand cont ains t he result s of t he operat ion ( OP) perform ed in parallel on t he corresponding values ( X0 and Y0, X1 and Y1, X2 and Y2, and X3 and Y3) in each operand.

Vol. 1 10-9

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

X3

X2

Y3

X1

Y2

X0

Y1

OP

OP

OP

X3 OP Y3

X2 OP Y2

X1 OP Y1

Y0

OP

X0 OP Y0

Figure 10-5. Packed Single-Precision Floating-Point Operation The scalar single- precision float ing- point inst ruct ions operat e on t he low ( least significant ) doublewords of t he t wo source operands ( X0 and Y0) ; see Figure 10- 6. The t hree m ost significant doublewords ( X1, X2, and X3) of t he first source operand are passed t hrough t o t he dest inat ion. The scalar operat ions are sim ilar t o t he float ing- point operat ions perform ed in t he x87 FPU dat a regist ers wit h t he precision cont rol field in t he x87 FPU cont rol word set for single precision ( 24- bit significand) , except t hat x87 st ack operat ions use a 15- bit exponent range for t he result , while SSE operat ions use an 8- bit exponent range.

X3

Y3

X2

Y2

X1

Y1

X0

Y0

OP

X3

X2

X1

X0 OP Y0

Figure 10-6. Scalar Single-Precision Floating-Point Operation

10.4.1.1

SSE Data Movement Instructions

SSE dat a m ovem ent inst ruct ions m ove single- precision float ing- point dat a bet ween XMM regist ers and bet ween an XMM regist er and m em ory.

10-10 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The MOVAPS ( m ove aligned packed single- precision float ing- point values) inst ruct ion t ransfers a double quadword operand cont aining four packed single- precision float ing- point values from m em ory t o an XMM regist er and vice versa, or bet ween XMM regist ers. The m em ory address m ust be aligned t o a 16- byt e boundary; ot herwise, a general- prot ect ion except ion ( # GP) is generat ed. The MOVUPS ( m ove unaligned packed single- precision, float ing- point ) inst ruct ion perform s t he sam e operat ions as t he MOVAPS inst ruct ion, except t hat 16- byt e alignm ent of a m em ory address is not required. The MOVSS ( m ove scalar single- precision float ing- point ) inst ruct ion t ransfers a 32bit single- precision float ing- point operand from m em ory t o t he low doubleword of an XMM regist er and vice versa, or bet ween XMM regist ers. The MOVLPS ( m ove low packed single- precision float ing- point ) inst ruct ion m oves t wo packed single- precision float ing- point values from m em ory t o t he low quadword of an XMM regist er and vice versa. The high quadword of t he regist er is left unchanged. The MOVHPS ( m ove high packed single- precision float ing- point ) inst ruct ion m oves t wo packed single- precision float ing- point values from m em ory t o t he high quadword of an XMM regist er and vice versa. The low quadword of t he regist er is left unchanged. The MOVLHPS ( m ove packed single- precision float ing- point low t o high) inst ruct ion m oves t wo packed single- precision float ing- point values from t he low quadword of t he source XMM regist er int o t he high quadword of t he dest inat ion XMM regist er. The low quadword of t he dest inat ion regist er is left unchanged. The MOVHLPS ( m ove packed single- precision float ing- point high t o low) inst ruct ion m oves t wo packed single- precision float ing- point values from t he high quadword of t he source XMM regist er int o t he low quadword of t he dest inat ion XMM regist er. The high quadword of t he dest inat ion regist er is left unchanged. The MOVMSKPS ( m ove packed single- precision float ing- point m ask) inst ruct ion t ransfers t he m ost significant bit of each of t he four packed single- precision float ingpoint num bers in an XMM regist er t o a general- purpose regist er. This 4- bit value can t hen be used as a condit ion t o perform branching.

10.4.1.2

SSE Arithmetic Instructions

SSE arit hm et ic inst ruct ions perform addit ion, subt ract ion, m ult iply, divide, reciprocal, square root , reciprocal of square root , and m axim um / m inim um operat ions on packed and scalar single- precision float ing- point values. The ADDPS ( add packed single- precision float ing- point values) and SUBPS ( subt ract packed single- precision float ing- point values) inst ruct ions add and subt ract , respect ively, t wo packed single- precision float ing- point operands. The ADDSS ( add scalar single- precision float ing- point values) and SUBSS ( subt ract scalar single- precision float ing- point values) inst ruct ions add and subt ract , respec-

Vol. 1 10-11

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

t ively, t he low single- precision float ing- point values of t wo operands and st ore t he result in t he low doubleword of t he dest inat ion operand. The MULPS ( m ult iply packed single- precision float ing- point values) inst ruct ion m ult iplies t wo packed single- precision float ing- point operands. The MULSS ( m ult iply scalar single- precision float ing- point values) inst ruct ion m ult iplies t he low single- precision float ing- point values of t wo operands and st ores t he result in t he low doubleword of t he dest inat ion operand. The DI VPS ( divide packed, single- precision float ing- point values) inst ruct ion divides t wo packed single- precision float ing- point operands. The DI VSS ( divide scalar single- precision float ing- point values) inst ruct ion divides t he low single- precision float ing- point values of t wo operands and st ores t he result in t he low doubleword of t he dest inat ion operand. The RCPPS ( com put e reciprocals of packed single- precision float ing- point values) inst ruct ion com put es t he approxim at e reciprocals of values in a packed single- precision float ing- point operand. The RCPSS ( com put e reciprocal of scalar single- precision float ing- point values) inst ruct ion com put es t he approxim at e reciprocal of t he low single- precision float ingpoint value in t he source operand and st ores t he result in t he low doubleword of t he dest inat ion operand. The SQRTPS ( com put e square root s of packed single- precision float ing- point values) inst ruct ion com put es t he square root s of t he values in a packed single- precision float ing- point operand. The SQRTSS ( com put e square root of scalar single- precision float ing- point values) inst ruct ion com put es t he square root of t he low single- precision float ing- point value in t he source operand and st ores t he result in t he low doubleword of t he dest inat ion operand. The RSQRTPS ( com put e reciprocals of square root s of packed single- precision float ing- point values) inst ruct ion com put es t he approxim at e reciprocals of t he square root s of t he values in a packed single- precision float ing- point operand. The RSQRTSS ( reciprocal of square root of scalar single- precision float ing- point value) inst ruct ion com put es t he approxim at e reciprocal of t he square root of t he low single- precision float ing- point value in t he source operand and st ores t he result in t he low doubleword of t he dest inat ion operand. The MAXPS ( ret urn m axim um of packed single- precision float ing- point values) inst ruct ion com pares t he corresponding values from t wo packed single- precision float ing- point operands and ret urns t he num erically great er value from each com parison t o t he dest inat ion operand. The MAXSS ( ret urn m axim um of scalar single- precision float ing- point values) inst ruct ion com pares t he low values from t wo packed single- precision float ing- point operands and ret urns t he num erically great er value from t he com parison t o t he low doubleword of t he dest inat ion operand.

10-12 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The MI NPS ( ret urn m inim um of packed single- precision float ing- point values) inst ruct ion com pares t he corresponding values from t wo packed single- precision float ing- point operands and ret urns t he num erically lesser value from each com parison t o t he dest inat ion operand. The MI NSS ( ret urn m inim um of scalar single- precision float ing- point values) inst ruct ion com pares t he low values from t wo packed single- precision float ing- point operands and ret urns t he num erically lesser value from t he com parison t o t he low doubleword of t he dest inat ion operand.

10.4.2

SSE Logical Instructions

SSE logical inst ruct ions perform AND, AND NOT, OR, and XOR operat ions on packed single- precision float ing- point values. The ANDPS ( bit wise logical AND of packed single- precision float ing- point values) inst ruct ion ret urns t he logical AND of t wo packed single- precision float ing- point operands. The ANDNPS ( bit w ise logical AND NOT of pack ed single- pr ecision, float ing- point values) inst r uct ion r et ur ns t he logical AND NOT of t w o packed single- pr ecision float ing- point operands. The ORPS ( bit wise logical OR of packed single- precision, float ing- point values) inst ruct ion ret urns t he logical OR of t wo packed single- precision float ing- point operands. The XORPS ( bit wise logical XOR of packed single- precision, float ing- point values) inst ruct ion ret urns t he logical XOR of t wo packed single- precision float ing- point operands.

10.4.2.1

SSE Comparison Instructions

The com pare inst ruct ions com pare packed and scalar single- precision float ing- point values and ret urn t he result s of t he com parison eit her t o t he dest inat ion operand or t o t he EFLAGS regist er. The CMPPS ( com pare packed single- precision float ing- point values) inst ruct ion com pares t he corresponding values from t wo packed single- precision float ing- point operands, using an im m ediat e operand as a predicat e, and ret urns a 32- bit m ask result of all 1s or all 0s for each com parison t o t he dest inat ion operand. The value of t he im m ediat e operand allows t he select ion of any of 8 com pare condit ions: equal, less t han, less t han equal, unordered, not equal, not less t han, not less t han or equal, or ordered. The CMPSS ( com pare scalar single- precision, float ing- point values) inst ruct ion com pares t he low values from t wo packed single- precision float ing- point operands, using an im m ediat e operand as a predicat e, and ret urns a 32- bit m ask result of all 1s or all 0s for t he com parison t o t he low doubleword of t he dest inat ion operand. The im m ediat e operand select s t he com pare condit ions as wit h t he CMPPS inst ruct ion.

Vol. 1 10-13

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The COMI SS ( com pare scalar single- precision float ing- point values and set EFLAGS) and UCOMI SS ( unordered com pare scalar single- precision float ing- point values and set EFLAGS) inst ruct ions com pare t he low values of t wo packed single- precision float ing- point operands and set t he ZF, PF, and CF flags in t he EFLAGS regist er t o show t he result ( great er t han, less t han, equal, or unordered) . These t wo inst ruct ions differ as follows: t he COMI SS inst ruct ion signals a float ing- point invalid- operat ion ( # I ) except ion when a source operand is eit her a QNaN or an SNaN; t he UCOMI SS inst ruct ion only signals an invalid- operat ion except ion when a source operand is an SNaN.

10.4.2.2

SSE Shuffle and Unpack Instructions

SSE shuffle and unpack inst ruct ions shuffle or int erleave t he cont ent s of t wo packed single- precision float ing- point values and st ore t he result s in t he dest inat ion operand. The SHUFPS ( shuffle packed single- precision float ing- point values) inst ruct ion places any t wo of t he four packed single- precision float ing- point values from t he dest inat ion operand int o t he t wo low- order doublewords of t he dest inat ion operand, and places any t wo of t he four packed single- precision float ing- point values from t he source operand in t he t wo high- order doublewords of t he dest inat ion operand ( see Figure 10- 7) . By using t he sam e regist er for t he source and dest inat ion operands, t he SHUFPS inst ruct ion can shuffle four single- precision float ing- point values int o any order.

DEST

X3

SRC

Y3

DEST

Y3 ... Y0

X2

Y2

Y3 ... Y0

X1

Y1

X3 ... X0

X0

Y0

X3 ... X0

Figure 10-7. SHUFPS Instruction, Packed Shuffle Operation The UNPCKHPS ( unpack and int erleave high packed single- precision float ing- point values) inst ruct ion perform s an int erleaved unpack of t he high- order single- precision float ing- point values from t he source and dest inat ion operands and st ores t he result in t he dest inat ion operand ( see Figure 10- 8) .

10-14 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

DEST

X3

X2

X1

X0

SRC

Y3

Y2

Y1

Y0

DEST

Y3

X3

Y2

X2

Figure 10-8. UNPCKHPS Instruction, High Unpack and Interleave Operation The UNPCKLPS ( unpack and int erleave low packed single- precision float ing- point values) inst ruct ion perform s an int erleaved unpack of t he low- order single- precision float ing- point values from t he source and dest inat ion operands and st ores t he result in t he dest inat ion operand ( see Figure 10- 9) .

DEST

X3

X2

X1

SRC

Y3

Y2

Y1

DEST

Y1

X1

Y0

X0

Y0

X0

Figure 10-9. UNPCKLPS Instruction, Low Unpack and Interleave Operation

10.4.3

SSE Conversion Instructions

SSE conversion inst ruct ions ( see Figure 11- 8) support packed and scalar conversions bet ween single- precision float ing- point and doubleword int eger form at s. The CVTPI 2PS ( convert packed doubleword int egers t o packed single- precision float ing- point values) inst ruct ion convert s t wo packed signed doubleword int egers int o t wo packed single- precision float ing- point values. When t he conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er.

Vol. 1 10-15

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The CVTSI 2SS ( convert doubleword int eger t o scalar single- precision float ing- point value) inst ruct ion convert s a signed doubleword int eger int o a single- precision float ing- point value. When t he conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTPS2PI ( convert packed single- precision float ing- point values t o packed doubleword int egers) inst ruct ion convert s t wo packed single- precision float ing- point values int o t wo packed signed doubleword int egers. When t he conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTTPS2PI ( convert wit h t runcat ion packed single- precision float ing- point values t o packed doubleword int egers) inst ruct ion is sim ilar t o t he CVTPS2PI inst ruct ion, except t hat t runcat ion is used t o round a source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) . The CVTSS2SI ( convert scalar single- precision float ing- point value t o doubleword int eger) inst ruct ion convert s a single- precision float ing- point value int o a signed doubleword int eger. When t he conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTTSS2SI ( convert wit h t runcat ion scalar single- precision float ing- point value t o doubleword int eger) inst ruct ion is sim ilar t o t he CVTSS2SI inst ruct ion, except t hat t runcat ion is used t o round t he source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) .

10.4.4

SSE 64-Bit SIMD Integer Instructions

SSE ext ensions add t he following 64- bit packed int eger inst ruct ions t o t he I A- 32 archit ect ure. These inst ruct ions operat e on dat a in MMX regist ers and 64- bit m em ory locat ions.

NOTE When SSE2 ext ensions are present in an I A- 32 processor, t hese inst ruct ions are ext ended t o operat e on 128- bit operands in XMM regist ers and 128- bit m em ory locat ions. The PAVGB ( com put e average of packed unsigned byt e int egers) and PAVGW ( com put e average of packed unsigned word int egers) inst ruct ions com put e a SI MD average of t wo packed unsigned byt e or word int eger operands, respect ively. For each corresponding pair of dat a elem ent s in t he packed source operands, t he elem ent s are added t oget her, a 1 is added t o t he t em porary sum , and t hat result is shift ed right one bit posit ion. The PEXTRW ( ext ract word) inst ruct ion copies a select ed word from an MMX regist er int o a general- purpose regist er. The PI NSRW ( insert word) inst ruct ion copies a word from a general- purpose regist er or from m em ory int o a select ed word locat ion in an MMX regist er.

10-16 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The PMAXUB ( m axim um of packed unsigned byt e int egers) inst ruct ion com pares t he corresponding unsigned byt e int egers in t wo packed operands and ret urns t he great er of each com parison t o t he dest inat ion operand. The PMI NUB ( m inim um of packed unsigned byt e int egers) inst ruct ion com pares t he corresponding unsigned byt e int egers in t wo packed operands and ret urns t he lesser of each com parison t o t he dest inat ion operand. The PMAXSW ( m axim um of packed signed word int egers) inst ruct ion com pares t he corresponding signed word int egers in t wo packed operands and ret urns t he great er of each com parison t o t he dest inat ion operand. The PMI NSW ( m inim um of packed signed word int egers) inst ruct ion com pares t he corresponding signed word int egers in t wo packed operands and ret urns t he lesser of each com parison t o t he dest inat ion operand. The PMOVMSKB ( m ove byt e m ask) inst ruct ion creat es an 8- bit m ask from t he packed byt e int egers in an MMX regist er and st ores t he result in t he low byt e of a generalpurpose regist er. The m ask cont ains t he m ost significant bit of each byt e in t he MMX regist er. ( When operat ing on 128- bit operands, a 16- bit m ask is creat ed.) The PMULHUW ( m ult iply packed unsigned word int egers and st ore high result ) inst ruct ion perform s a SI MD unsigned m ult iply of t he words in t he t wo source operands and ret urns t he high word of each result t o an MMX regist er. The PSADBW ( com put e sum of absolut e differences) inst ruct ion com put es t he SI MD absolut e differences of t he corresponding unsigned byt e int egers in t wo source operands, sum s t he differences, and st ores t he sum in t he low word of t he dest inat ion operand. The PSHUFW ( shuffle packed word int egers) inst ruct ion shuffles t he words in t he source operand according t o t he order specified by an 8- bit im m ediat e operand and ret urns t he result t o t he dest inat ion operand.

10.4.5

MXCSR State Management Instructions

The MXCSR st at e m anagem ent inst ruct ions ( LDMXCSR and STMXCSR) load and save t he st at e of t he MXCSR regist er, respect ively. The LDMXCSR inst ruct ion loads t he MXCSR regist er from m em ory, while t he STMXCSR inst ruct ion st ores t he cont ent s of t he regist er t o m em ory.

10.4.6

Cacheability Control, Prefetch, and Memory Ordering Instructions

SSE ext ensions int roduce several new inst ruct ions t o give program s m ore cont rol over t he caching of dat a. They also int roduces t he PREFETCHh inst ruct ions, which provide t he abilit y t o prefet ch dat a t o a specified cache level, and t he SFENCE inst ruct ion, which enforces program ordering on st ores. These inst ruct ions are described in t he following sect ions.

Vol. 1 10-17

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

10.4.6.1

Cacheability Control Instructions

The following t hree inst ruct ions enable dat a from t he MMX and XMM regist ers t o be st ored t o m em ory using a non- t em poral hint . The non- t em poral hint direct s t he processor t o when possible st ore t he dat a t o m em ory wit hout writ ing t he dat a int o t he cache hierarchy. See Sect ion 10.4.6.2, “ Caching of Tem poral vs. Non-Tem poral Dat a,” for inform at ion about non- t em poral st ores and hint s. The MOVNTQ ( st ore quadword using non- t em poral hint ) inst ruct ion st ores packed int eger dat a from an MMX regist er t o m em ory, using a non- t em poral hint . The MOVNTPS ( st ore packed single- precision float ing- point values using nont em poral hint ) inst ruct ion st ores packed float ing- point dat a from an XMM regist er t o m em ory, using a non- t em poral hint . The MASKMOVQ ( st ore select ed byt es of quadword) inst ruct ion st ores select ed byt e int egers from an MMX regist er t o m em ory, using a byt e m ask t o select ively writ e t he individual byt es. This inst ruct ion also uses a non- t em poral hint .

10.4.6.2

Caching of Temporal vs. Non-Temporal Data

Dat a referenced by a program can be t em poral ( dat a will be used again) or nont em poral ( dat a will be referenced once and not reused in t he im m ediat e fut ure) . For exam ple, program code is generally t em poral, whereas, m ult im edia dat a, such as t he display list in a 3- D graphics applicat ion, is oft en non- t em poral. To m ake efficient use of t he processor ’s caches, it is generally desirable t o cache t em poral dat a and not cache non- t em poral dat a. Overloading t he processor ’s caches wit h non- t em poral dat a is som et im es referred t o as “ pollut ing t he caches.” The SSE and SSE2 cacheabilit y cont rol inst ruct ions enable a program t o writ e non- t em poral dat a t o m em ory in a m anner t hat m inim izes pollut ion of caches. These SSE and SSE2 non- t em poral st ore inst ruct ions m inim ize cache pollut ions by t reat ing t he m em ory being accessed as t he writ e com bining ( WC) t ype. I f a program specifies a non- t em poral st ore wit h one of t hese inst ruct ions and t he dest inat ion region is m apped as cacheable m em ory ( writ e back [ WB] , writ e t hrough [ WT] or WC m em ory t ype) , t he processor will do t he following:



I f t he m em ory locat ion being writ t en t o is present in t he cache hierarchy, t he dat a in t he caches is evict ed.



The non- t em poral dat a is writ t en t o m em ory wit h WC sem ant ics.

See also: Chapt er 10, “ Mem ory Cache Cont rol,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. Using t he WC sem ant ics, t he st ore t ransact ion will be weakly ordered, m eaning t hat t he dat a m ay not be writ t en t o m em ory in program order, and t he st ore will not writ e allocat e ( t hat is, t he processor will not fet ch t he corresponding cache line int o t he cache hierarchy, prior t o perform ing t he st ore) . Also, different processor im plem ent at ions m ay choose t o collapse and com bine t hese st ores.

10-18 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

The m em ory t ype of t he region being writ t en t o can override t he non- t em poral hint , if t he m em ory address specified for t he non- t em poral st ore is in uncacheable m em ory. Uncacheable as referred t o here m eans t hat t he region being writ t en t o has been m apped wit h eit her an uncacheable ( UC) or writ e prot ect ed ( WP) m em ory t ype. I n general, WC sem ant ics require soft ware t o ensure coherence, wit h respect t o ot her processors and ot her syst em agent s ( such as graphics cards) . Appropriat e use of synchronizat ion and fencing m ust be perform ed for producer- consum er usage m odels. Fencing ensures t hat all syst em agent s have global visibilit y of t he st ored dat a; for inst ance, failure t o fence m ay result in a writ t en cache line st aying wit hin a processor and not being visible t o ot her agent s. For processors t hat im plem ent non- t em poral st ores by updat ing dat a in- place t hat already resides in t he cache hierarchy, t he dest inat ion region should also be m apped as WC. I f m apped as WB or WT, t here is t he pot ent ial for speculat ive processor reads t o bring t he dat a int o t he caches; in t his case, non- t em poral st ores would t hen updat e in place, and dat a would not be flushed from t he processor by a subsequent fencing operat ion. The m em ory t ype visible on t he bus in t he presence of m em ory t ype aliasing is im plem ent at ion specific. As one possible exam ple, t he m em ory t ype writ t en t o t he bus m ay reflect t he m em ory t ype for t he first st ore t o t his line, as seen in program order; ot her alt ernat ives are possible. This behavior should be considered reserved, and dependence on t he behavior of any part icular im plem ent at ion risks fut ure incom pat ibilit y.

10.4.6.3

PREFETCHh Instructions

The PREFETCHh inst ruct ions perm it program s t o load dat a int o t he processor at a suggest ed cache level, so t hat t he dat a is closer t o t he processor ’s load and st ore unit when it is needed. These inst ruct ions fet ch 32 aligned byt es ( or m ore, depending on t he im plem ent at ion) cont aining t he addressed byt e t o a locat ion in t he cache hierarchy specified by t he t em poral localit y hint ( see Table 10- 1) . I n t his t able, t he firstlevel cache is closest t o t he processor and second- level cache is fart her away from t he processor t han t he first- level cache. The hint s specify a prefet ch of eit her t em poral or non- t em poral dat a ( see Sect ion 10.4.6.2, “ Caching of Tem poral vs. NonTem poral Dat a” ) . Subsequent accesses t o t em poral dat a are t reat ed like norm al accesses, while t hose t o non- t em poral dat a will cont inue t o m inim ize cache pollut ion. I f t he dat a is already present at a level of t he cache hierarchy t hat is closer t o t he processor, t he PREFETCHh inst ruct ion will not result in any dat a m ovem ent . The PREFETCHh inst ruct ions do not affect funct ional behavior of t he program . See Sect ion 11.6.13, “ Cacheabilit y Hint I nst ruct ions,” for addit ional inform at ion about t he PREFETCHh inst ruct ions.

Vol. 1 10-19

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

Table 10-1. PREFETCHh Instructions Caching Hints PREFETCHh Instruction Mnemonic Actions PREFETCHT0

Temporal data—fetch data into all levels of cache hierarchy: • Pentium III processor—1st-level cache or 2nd-level cache • Pentium 4 and Intel Xeon processor—2nd-level cache

PREFETCHT1

Temporal data—fetch data into level 2 cache and higher • Pentium III processor—2nd-level cache • Pentium 4 and Intel Xeon processor—2nd-level cache

PREFETCHT2

Temporal data—fetch data into level 2 cache and higher • Pentium III processor—2nd-level cache • Pentium 4 and Intel Xeon processor—2nd-level cache

PREFETCHNTA

Non-temporal data—fetch data into location close to the processor, minimizing cache pollution • Pentium III processor—1st-level cache • Pentium 4 and Intel Xeon processor—2nd-level cache

10.4.6.4

SFENCE Instruction

The SFENCE ( St ore Fence) inst ruct ion cont rols writ e ordering by creat ing a fence for m em ory st ore operat ions. This inst ruct ion guarant ees t hat t he result of every st ore inst ruct ion t hat precedes t he st ore fence in program order is globally visible before any st ore inst ruct ion t hat follows t he fence. The SFENCE inst ruct ion provides an efficient way of ensuring ordering bet ween procedures t hat produce weakly- ordered dat a and procedures t hat consum e t hat dat a.

10.5

FXSAVE AND FXRSTOR INSTRUCTIONS

The FXSAVE and FXRSTOR inst ruct ions were int roduced int o t he I A- 32 archit ect ure in t he Pent ium I I processor fam ily ( prior t o t he int roduct ion of t he SSE ext ensions) . The original versions of t hese inst ruct ions perform ed a fast save and rest ore, respect ively, of t he x87 FPU regist er st at e. ( By saving t he st at e of t he x87 FPU dat a regist ers, t he FXSAVE and FXRSTOR inst ruct ions im plicit ly save and rest ore t he st at e of t he MMX regist ers.) The SSE ext ensions expanded t he scope of t hese inst ruct ions t o save and rest ore t he st at es of t he XMM regist ers and t he MXCSR regist er, along wit h t he x87 FPU and MMX st at e. The FXSAVE and FXRSTOR inst ruct ions can be used in place of t he FSAVE/ FNSAVE and FRSTOR inst ruct ions; however, t he operat ion of t he FXSAVE and FXRSTOR inst ruct ions are not ident ical t o t he operat ion of FSAVE/ FNSAVE and FRSTOR.

10-20 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

NOTE The FXSAVE and FXRSTOR inst ruct ions are not considered part of t he SSE inst ruct ion group. They have a separat e CPUI D feat ure bit t o indicat e whet her t hey are present ( if CPUI D.01H: EDX.FXSR[ bit 24] = 1) . The CPUI D feat ure bit for SSE ext ensions does not indicat e t he presence of FXSAVE and FXRSTOR.

10.6

HANDLING SSE INSTRUCTION EXCEPTIONS

See Sect ion 11.5, “ SSE, SSE2, and SSE3 Except ions,” for a det ailed discussion of t he general and SI MD float ing- point except ions t hat can be generat ed wit h t he SSE inst ruct ions and for guidelines for handling t hese except ions when t hey occur.

10.7

WRITING APPLICATIONS WITH THE SSE EXTENSIONS

See Sect ion 11.6, “ Writ ing Applicat ions wit h SSE/ SSE2 Ext ensions,” for addit ional inform at ion about writ ing applicat ions and operat ing- syst em code using t he SSE ext ensions.

Vol. 1 10-21

PROGRAMMING WITH STREAMING SIMD EXTENSIONS (SSE)

10-22 Vol. 1

CHAPTER 11 PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2) The st ream ing SI MD ext ensions 2 ( SSE2) were int roduced int o t he I A- 32 archit ect ure in t he Pent ium 4 and I nt el Xeon processors. These ext ensions enhance t he perform ance of I A- 32 processors for advanced 3- D graphics, video decoding/ encoding, speech recognit ion, E- com m erce, I nt ernet , scient ific, and engineering applicat ions. This chapt er describes t he SSE2 ext ensions and provides inform at ion t o assist in writ ing applicat ion program s t hat use t hese and t he SSE ext ensions.

11.1

OVERVIEW OF SSE2 EXTENSIONS

SSE2 ext ensions use t he single inst ruct ion m ult iple dat a ( SI MD) execut ion m odel t hat is used wit h MMX t echnology and SSE ext ensions. They ext end t his m odel wit h support for packed double- precision float ing- point values and for 128- bit packed int egers. I f CPUI D.01H: EDX.SSE2[ bit 26] = 1, SSE2 ext ensions are present . SSE2 ext ensions add t he following feat ures t o t he I A- 32 archit ect ure, while m aint aining backward com pat ibilit y wit h all exist ing I A- 32 processors, applicat ions and operat ing syst em s.



Six dat a t ypes: — 128- bit packed double- precision float ing- point ( t wo I EEE St andard 754 double- precision float ing- point values packed int o a double quadword) — 128- bit packed byt e int egers — 128- bit packed word int egers — 128- bit packed doubleword int egers — 128- bit packed quadword int egers



I nst ruct ions t o support t he addit ional dat a t ypes and ext end exist ing SI MD int eger operat ions: — Packed and scalar double- precision float ing- point inst ruct ions — Addit ional 64- bit and 128- bit SI MD int eger inst ruct ions — 128- bit versions of SI MD int eger inst ruct ions int roduced wit h t he MMX t echnology and t he SSE ext ensions — Addit ional cacheabilit y- cont rol and inst ruct ion- ordering inst ruct ions

Vol. 1 11-1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)



Modificat ions t o exist ing I A- 32 inst ruct ions t o support SSE2 feat ures: — Ext ensions and m odificat ions t o t he CPUI D inst ruct ion — Modificat ions t o t he RDPMC inst ruct ion

These new feat ures ext end t he I A- 32 archit ect ure’s SI MD program m ing m odel in t hree im port ant ways:



They provide t he abilit y t o perform SI MD operat ions on pairs of packed doubleprecision float ing- point values. This perm it s higher precision com put at ions t o be carried out in XMM regist ers, which enhances processor perform ance in scient ific and engineering applicat ions and in applicat ions t hat use advanced 3- D geom et ry t echniques ( such as ray t racing) . Addit ional flexibilit y is provided wit h inst ruct ions t hat operat e on single ( scalar) double- precision float ing- point values locat ed in t he low quadword of an XMM regist er.



They provide t he abilit y t o operat e on 128- bit packed int egers ( byt es, words, doublewords, and quadwords) in XMM regist ers. This provides great er flexibilit y and great er t hroughput when perform ing SI MD operat ions on packed int egers. The capabilit y is part icularly useful for applicat ions such as RSA aut hent icat ion and RC5 encrypt ion. Using t he full set of SI MD regist ers, dat a t ypes, and inst ruct ions provided wit h t he MMX t echnology and SSE/ SSE2 ext ensions, program m ers can develop algorit hm s t hat finely m ix packed single- and double- precision float ing- point dat a and 64- and 128- bit packed int eger dat a.



SSE2 ext ensions enhance t he support int roduced wit h SSE ext ensions for cont rolling t he cacheabilit y of SI MD dat a. SSE2 cache cont rol inst ruct ions provide t he abilit y t o st ream dat a in and out of t he XMM regist ers wit hout pollut ing t he caches and t he abilit y t o prefet ch dat a before it is act ually used.

SSE2 ext ensions are fully com pat ible wit h all soft ware writ t en for I A- 32 processors. All exist ing soft ware cont inues t o run correct ly, wit hout m odificat ion, on processors t hat incorporat e SSE2 ext ensions, as well as in t he presence of applicat ions t hat incorporat e t hese ext ensions. Enhancem ent s t o t he CPUI D inst ruct ion perm it det ect ion of t he SSE2 ext ensions. Also, because t he SSE2 ext ensions use t he sam e regist ers as t he SSE ext ensions, no new operat ing- syst em support is required for saving and rest oring program st at e during a cont ext swit ch beyond t hat provided for t he SSE ext ensions. SSE2 ext ensions are accessible from all I A- 32 execut ion m odes: prot ect ed m ode, real address m ode, virt ual 8086 m ode. The following sect ions in t his chapt er describe t he program m ing environm ent for SSE2 ext ensions including: t he 128- bit XMM float ing- point regist er set , dat a t ypes, and SSE2 inst ruct ions. I t also describes except ions t hat can be generat ed wit h t he SSE and SSE2 inst ruct ions and gives guidelines for writ ing applicat ions wit h SSE and SSE2 ext ensions. For addit ional inform at ion about SSE2 ext ensions, see:



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, provide a det ailed descript ion of individual SSE3 inst ruct ions.

11-2 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)



Chapt er 12, “ Syst em Program m ing for St ream ing SI MD I nst ruct ion Set s,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, gives guidelines for int egrat ing t he SSE and SSE2 ext ensions int o an operat ingsyst em environm ent .

11.2

SSE2 PROGRAMMING ENVIRONMENT

Figure 11- 1 shows t he program m ing environm ent for SSE2 ext ensions. No new regist ers or ot her inst ruct ion execut ion st at e are defined wit h SSE2 ext ensions. SSE2 inst ruct ions use t he XMM regist ers, t he MMX regist ers, and/ or I A- 32 general- purpose regist ers, as follows:



XM M r e gist e r s — These eight regist ers ( see Figure 10- 2) are used t o operat e on packed or scalar double- precision float ing- point dat a. Scalar operat ions are operat ions perform ed on individual ( unpacked) double- precision float ing- point values st ored in t he low quadword of an XMM regist er. XMM regist ers are also used t o perform operat ions on 128- bit packed int eger dat a. They are referenced by t he nam es XMM0 t hrough XMM7. Address Space 232

XMM Registers Eight 128-Bit

MXCSR Register

-1

32 Bits

MMX Registers Eight 64-Bit

General-Purpose Registers Eight 32-Bit 0 EFLAGS Register

32 Bits

Figure 11-1. Steaming SIMD Extensions 2 Execution Environment



M XCSR r e gist e r — This 32- bit regist er ( see Figure 10- 3) provides st at us and cont rol bit s used in float ing- point operat ions. The denorm als- are- zeros and flush- t o- zero flags in t his regist er provide a higher perform ance alt ernat ive for t he handling of denorm al source operands and denorm al ( underflow) result s. For

Vol. 1 11-3

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

m ore inform at ion on t he funct ions of t hese flags see Sect ion 10.2.3.4, “ Denorm als-Are- Zeros,” and Sect ion 10.2.3.3, “ Flush-To- Zero.”



M M X r e gist e r s — These eight regist ers ( see Figure 9- 2) are used t o perform operat ions on 64- bit packed int eger dat a. They are also used t o hold operands for som e operat ions perform ed bet ween MMX and XMM regist ers. MMX regist ers are referenced by t he nam es MM0 t hrough MM7.



Ge ne r a l- pu r pose r e gist e r s — The eight general- purpose regist ers ( see Figure 3- 5) are used along wit h t he exist ing I A- 32 addressing m odes t o address operands in m em ory. MMX and XMM regist ers cannot be used t o address m em ory. The general- purpose regist ers are also used t o hold operands for som e SSE2 inst ruct ions. These regist ers are referenced by t he nam es EAX, EBX, ECX, EDX, EBP, ESI , EDI , and ESP.



EFLAGS r e gist e r — This 32- bit regist er ( see Figure 3- 8) is used t o record t he result s of som e com pare operat ions.

11.2.1

SSE2 in 64-Bit Mode and Compatibility Mode

I n com pat ibilit y m ode, SSE2 ext ensions funct ion like t hey do in prot ect ed m ode. I n 64- bit m ode, eight addit ional XMM regist ers are accessible. Regist ers XMM8-XMM15 are accessed by using REX prefixes. Mem ory operands are specified using t he ModR/ M, SI B encoding described in Sect ion 3.7.5. Som e SSE2 inst ruct ions m ay be used t o operat e on general- purpose regist ers. Use t he REX.W prefix t o access 64- bit general- purpose regist ers. Not e t hat if a REX prefix is used when it has no m eaning, t he prefix is ignored.

11.2.2

Compatibility of SSE2 Extensions with SSE, MMX Technology and x87 FPU Programming Environment

SSE2 ext ensions do not int roduce any new st at e t o t he I A- 32 execut ion environm ent beyond t hat of SSE. SSE2 ext ensions represent an enhancem ent of SSE ext ensions; t hey are fully com pat ible and share t he sam e st at e inform at ion. SSE and SSE2 inst ruct ions can be execut ed t oget her in t he sam e inst ruct ion st ream wit hout t he need t o save st at e when swit ching bet ween inst ruct ion set s. XMM regist ers are independent of t he x87 FPU and MMX regist ers; so SSE and SSE2 operat ions perform ed on XMM regist ers can be perform ed in parallel wit h x87 FPU or MMX t echnology operat ions ( see Sect ion 11.6.7, “ I nt eract ion of SSE/ SSE2 I nst ruct ions wit h x87 FPU and MMX I nst ruct ions” ) . The FXSAVE and FXRSTOR inst ruct ions save and rest ore t he SSE and SSE2 st at es along wit h t he x87 FPU and MMX st at es.

11-4 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.2.3

Denormals-Are-Zeros Flag

The denorm als- are- zeros flag ( bit 6 in t he MXCSR regist er) was int roduced int o t he I A- 32 archit ect ure wit h t he SSE2 ext ensions. See Sect ion 10.2.3.4, “ Denorm als-AreZeros,” for a descript ion of t his flag.

11.3

SSE2 DATA TYPES

SSE2 ext ensions int roduced one 128- bit packed float ing- point dat a t ype and four 128- bit SI MD int eger dat a t ypes t o t he I A- 32 archit ect ure ( see Figure 11- 2) .



Pa ck ed double- precision floa t ing- point — This 128- bit dat a t ype consist s of t wo I EEE 64- bit double- precision float ing- point values packed int o a double quadword. ( See Figure 4- 3 for t he layout of a 64- bit double- precision float ingpoint value; refer t o Sect ion 4.2.2, “ Float ing- Point Dat a Types,” for a det ailed descript ion of double- precision float ing- point values.)



1 2 8 - bit pa ck e d int e ge r s — The four 128- bit packed int eger dat a t ypes can cont ain 16 byt e int egers, 8 word int egers, 4 doubleword int egers, or 2 quadword int egers. ( Refer t o Sect ion 4.6.2, “ 128- Bit Packed SI MD Dat a Types,” for a det ailed descript ion of t he 128- bit packed int egers.)

128-Bit Packed DoublePrecision Floating-Point 127

64 63

0 128-Bit Packed Byte Integers

127

0

127

0

128-Bit Packed Word Integers

128-Bit Packed Doubleword Integers 127

0 128-Bit Packed Quadword Integers

127

0

Figure 11-2. Data Types Introduced with the SSE2 Extensions All of t hese dat a t ypes are operat ed on in XMM regist ers or m em ory. I nst ruct ions are provided t o convert bet ween t hese 128- bit dat a t ypes and t he 64- bit and 32- bit dat a t ypes.

Vol. 1 11-5

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

The address of a 128- bit packed m em ory operand m ust be aligned on a 16- byt e boundary, except in t he following cases:

• •

a MOVUPD inst ruct ion which support s unaligned accesses scalar inst ruct ions t hat use an 8- byt e m em ory operand t hat is not subj ect t o alignm ent requirem ent s

Figure 4- 2 shows t he byt e order of 128- bit ( double quadword) and 64- bit ( quadword) dat a t ypes in m em ory.

11.4

SSE2 INSTRUCTIONS

The SSE2 inst ruct ions are divided int o four funct ional groups:

• • •

Packed and scalar double- precision float ing- point inst ruct ions



Cacheabilit y- cont rol and inst ruct ion- ordering inst ruct ions

64- bit and 128- bit SI MD int eger inst ruct ions 128- bit ext ensions of SI MD int eger inst ruct ions int roduced wit h t he MMX t echnology and t he SSE ext ensions

The following sect ions provide m ore inform at ion about each group.

11.4.1

Packed and Scalar Double-Precision Floating-Point Instructions

The packed and scalar double- precision float ing- point inst ruct ions are divided int o t he following sub- groups:

• • • • • •

Dat a m ovem ent inst ruct ions Arit hm et ic inst ruct ions Com parison inst ruct ions Conversion inst ruct ions Logical inst ruct ions Shuffle inst ruct ions

The packed double- precision float ing- point inst ruct ions perform SI MD operat ions sim ilarly t o t he packed single- precision float ing- point inst ruct ions ( see Figure 11- 3) . Each source operand cont ains t wo double- precision float ing- point values, and t he dest inat ion operand cont ains t he result s of t he operat ion ( OP) perform ed in parallel on t he corresponding values ( X0 and Y0, and X1 and Y1) in each operand.

11-6 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

X1

Y1

X0

Y0

OP

OP

X1 OP Y1

X0 OP Y0

Figure 11-3. Packed Double-Precision Floating-Point Operations The scalar double- precision float ing- point inst ruct ions operat e on t he low ( least significant ) quadwords of t wo source operands ( X0 and Y0) , as shown in Figure 11- 4. The high quadword ( X1) of t he first source operand is passed t hrough t o t he dest inat ion. The scalar operat ions are sim ilar t o t he float ing- point operat ions perform ed in x87 FPU dat a regist ers wit h t he precision cont rol field in t he x87 FPU cont rol word set for double precision ( 53- bit significand) , except t hat x87 st ack operat ions use a 15- bit exponent range for t he result while SSE2 operat ions use an 11- bit exponent range. See Sect ion 11.6.8, “ Com pat ibilit y of SI MD and x87 FPU Float ing- Point Dat a Types,” for m ore inform at ion about obt aining com pat ible result s when perform ing bot h scalar double- precision float ing- point operat ions in XMM regist ers and in x87 FPU dat a regist ers.

X1

Y1

X0

Y0

OP

X1

X0 OP Y0

Figure 11-4. Scalar Double-Precision Floating-Point Operations

Vol. 1 11-7

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.4.1.1

Data Movement Instructions

Dat a m ovem ent inst ruct ions m ove double- precision float ing- point dat a bet ween XMM regist ers and bet ween XMM regist ers and m em ory. The MOVAPD ( m ove aligned packed double- precision float ing- point ) inst ruct ion t ransfers a 128- bit packed double- precision float ing- point operand from m em ory t o an XMM regist er or vice versa, or bet ween XMM regist ers. The m em ory address m ust be aligned t o a 16- byt e boundary; if not , a general- prot ect ion except ion ( GP# ) is generat ed. The MOVUPD ( m ove unaligned packed double- precision float ing- point ) inst ruct ion t ransfers a 128- bit packed double- precision float ing- point operand from m em ory t o an XMM regist er or vice versa, or bet ween XMM regist ers. Alignm ent of t he m em ory address is not required. The MOVSD ( m ove scalar double- precision float ing- point ) inst ruct ion t ransfers a 64- bit double- precision float ing- point operand from m em ory t o t he low quadword of an XMM regist er or vice versa, or bet ween XMM regist ers. Alignm ent of t he m em ory address is not required, unless alignm ent checking is enabled. The MOVHPD ( m ove high packed double- precision float ing- point ) inst ruct ion t ransfers a 64- bit double- precision float ing- point operand from m em ory t o t he high quadword of an XMM regist er or vice versa. The low quadword of t he regist er is left unchanged. Alignm ent of t he m em ory address is not required, unless alignm ent checking is enabled. The MOVLPD ( m ove low packed double- precision float ing- point ) inst ruct ion t ransfers a 64- bit double- precision float ing- point operand from m em ory t o t he low quadword of an XMM regist er or vice versa. The high quadword of t he regist er is left unchanged. Alignm ent of t he m em ory address is not required, unless alignm ent checking is enabled. The MOVMSKPD ( m ove packed double- precision float ing- point m ask) inst ruct ion ext ract s t he sign bit of each of t he t wo packed double- precision float ing- point num bers in an XMM regist er and saves t hem in a general- purpose regist er. This 2- bit value can t hen be used as a condit ion t o perform branching.

11.4.1.2

SSE2 Arithmetic Instructions

SSE2 arit hm et ic inst ruct ions perform addit ion, subt ract ion, m ult iply, divide, square root , and m axim um / m inim um operat ions on packed and scalar double- precision float ing- point values. The ADDPD ( add packed double- precision float ing- point values) and SUBPD ( subt ract packed double- precision float ing- point values) inst ruct ions add and subt ract , respect ively, t wo packed double- precision float ing- point operands. The ADDSD ( add scalar double- precision float ing- point values) and SUBSD ( subt ract scalar double- precision float ing- point values) inst ruct ions add and subt ract , respect ively, t he low double- precision float ing- point values of t wo operands and st ores t he result in t he low quadword of t he dest inat ion operand.

11-8 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

The MULPD ( m ult iply packed double- precision float ing- point values) inst ruct ion m ult iplies t wo packed double- precision float ing- point operands. The MULSD ( m ult iply scalar double- precision float ing- point values) inst ruct ion m ult iplies t he low double- precision float ing- point values of t wo operands and st ores t he result in t he low quadword of t he dest inat ion operand. The DI VPD ( divide packed double- precision float ing- point values) inst ruct ion divides t wo packed double- precision float ing- point operands. The DI VSD ( divide scalar double- precision float ing- point values) inst ruct ion divides t he low double- precision float ing- point values of t wo operands and st ores t he result in t he low quadword of t he dest inat ion operand. The SQRTPD ( com put e square root s of packed double- precision float ing- point values) inst ruct ion com put es t he square root s of t he values in a packed double- precision float ing- point operand. The SQRTSD ( com put e square root of scalar double- precision float ing- point values) inst ruct ion com put es t he square root of t he low double- precision float ing- point value in t he source operand and st ores t he result in t he low quadword of t he dest inat ion operand. The MAXPD ( ret urn m axim um of packed double- precision float ing- point values) inst ruct ion com pares t he corresponding values in t wo packed double- precision float ing- point operands and ret urns t he num erically great er value from each com parison t o t he dest inat ion operand. The MAXSD ( ret urn m axim um of scalar double- precision float ing- point values) inst ruct ion com pares t he low double- precision float ing- point values from t wo packed double- precision float ing- point operands and ret urns t he num erically higher value from t he com parison t o t he low quadword of t he dest inat ion operand. The MI NPD ( ret urn m inim um of packed double- precision float ing- point values) inst ruct ion com pares t he corresponding values from t wo packed double- precision float ing- point operands and ret urns t he num erically lesser value from each com parison t o t he dest inat ion operand. The MI NSD ( ret urn m inim um of scalar double- precision float ing- point values) inst ruct ion com pares t he low values from t wo packed double- precision float ing- point operands and ret urns t he num erically lesser value from t he com parison t o t he low quadword of t he dest inat ion operand.

11.4.1.3

SSE2 Logical Instructions

SSE2 logical inst ruct ions perform AND, AND NOT, OR, and XOR operat ions on packed double- precision float ing- point values. The ANDPD ( bit wise logical AND of packed double- precision float ing- point values) inst ruct ion ret urns t he logical AND of t wo packed double- precision float ing- point operands.

Vol. 1 11-9

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

The ANDNPD ( bit wise logical AND NOT of packed double- precision float ing- point values) inst ruct ion ret urns t he logical AND NOT of t wo packed double- precision float ing- point operands. The ORPD ( bit wise logical OR of packed double- precision float ing- point values) inst ruct ion ret urns t he logical OR of t wo packed double- precision float ing- point operands. The XORPD ( bit wise logical XOR of packed double- precision float ing- point values) inst ruct ion ret urns t he logical XOR of t wo packed double- precision float ing- point operands.

11.4.1.4

SSE2 Comparison Instructions

SSE2 com pare inst ruct ions com pare packed and scalar double- precision float ingpoint values and ret urn t he result s of t he com parison eit her t o t he dest inat ion operand or t o t he EFLAGS regist er. The CMPPD ( com pare packed double- precision float ing- point values) inst ruct ion com pares t he corresponding values from t wo packed double- precision float ing- point operands, using an im m ediat e operand as a predicat e, and ret urns a 64- bit m ask result of all 1s or all 0s for each com parison t o t he dest inat ion operand. The value of t he im m ediat e operand allows t he select ion of any of eight com pare condit ions: equal, less t han, less t han equal, unordered, not equal, not less t han, not less t han or equal, or ordered. The CMPSD ( com pare scalar double- precision float ing- point values) inst ruct ion com pares t he low values from t wo packed double- precision float ing- point operands, using an im m ediat e operand as a predicat e, and ret urns a 64- bit m ask result of all 1s or all 0s for t he com parison t o t he low quadword of t he dest inat ion operand. The im m ediat e operand select s t he com pare condit ion as wit h t he CMPPD inst ruct ion. The COMI SD ( com pare scalar double- precision float ing- point values and set EFLAGS) and UCOMI SD ( unordered com pare scalar double- precision float ing- point values and set EFLAGS) inst ruct ions com pare t he low values of t wo packed double- precision float ing- point operands and set t he ZF, PF, and CF flags in t he EFLAGS regist er t o show t he result ( great er t han, less t han, equal, or unordered) . These t wo inst ruct ions differ as follows: t he COMI SD inst ruct ion signals a float ing- point invalid- operat ion ( # I ) except ion when a source operand is eit her a QNaN or an SNaN; t he UCOMI SD inst ruct ion only signals an invalid- operat ion except ion when a source operand is an SNaN.

11.4.1.5

SSE2 Shuffle and Unpack Instructions

SSE2 shuffle inst ruct ions shuffle t he cont ent s of t wo packed double- precision float ing- point values and st ore t he result s in t he dest inat ion operand. The SHUFPD ( shuffle packed double- precision float ing- point values) inst ruct ion places eit her of t he t wo packed double- precision float ing- point values from t he dest inat ion operand in t he low quadword of t he dest inat ion operand, and places eit her of

11-10 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

t he t wo packed double- precision float ing- point values from source operand in t he high quadword of t he dest inat ion operand ( see Figure 11- 5) . By using t he sam e regist er for t he source and dest inat ion operands, t he SHUFPD inst ruct ion can swap t wo packed double- precision float ing- point values.

DEST

X1

SRC

Y1

DEST

Y1 or Y0

X0

Y0

X1 or X0

Figure 11-5. SHUFPD Instruction, Packed Shuffle Operation The UNPCKHPD ( unpack and int erleave high packed double- precision float ing- point values) inst ruct ion perform s an int erleaved unpack of t he high values from t he source and dest inat ion operands and st ores t he result in t he dest inat ion operand ( see Figure 11- 6) . The UNPCKLPD ( unpack and int erleave low packed double- precision float ing- point values) inst ruct ion perform s an int erleaved unpack of t he low values from t he source and dest inat ion operands and st ores t he result in t he dest inat ion operand ( see Figure 11- 7) .

DEST

X1

X0

SRC

Y1

Y0

DEST

Y1

X1

Figure 11-6. UNPCKHPD Instruction, High Unpack and Interleave Operation

Vol. 1 11-11

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

DEST

X1

X0

SRC

Y1

Y0

DEST

Y0

X0

Figure 11-7. UNPCKLPD Instruction, Low Unpack and Interleave Operation

11.4.1.6

SSE2 Conversion Instructions

SSE2 conversion inst ruct ions ( see Figure 11- 8) support packed and scalar conversions bet ween:

• • •

Double- precision and single- precision float ing- point form at s Double- precision float ing- point and doubleword int eger form at s Single- precision float ing- point and doubleword int eger form at s

Conve r sion be t w e e n double - pr e cision a nd sin gle - pr e cision floa t ing- point s va lu e s — The following inst ruct ions convert operands bet ween double- precision and single- precision float ing- point form at s. The operands being operat ed on are cont ained in XMM regist ers or m em ory ( at m ost , one operand can reside in m em ory; t he dest inat ion is always an MMX regist er) . The CVTPS2PD ( convert packed single- precision float ing- point values t o packed double- precision float ing- point values) inst ruct ion convert s t wo packed singleprecision float ing- point values t o t wo double- precision float ing- point values. The CVTPD2PS ( convert packed double- precision float ing- point values t o packed single- precision float ing- point values) inst ruct ion convert s t wo packed doubleprecision float ing- point values t o t wo single- precision float ing- point values. When a conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTSS2SD ( convert scalar single- precision float ing- point value t o scalar doubleprecision float ing- point value) inst ruct ion convert s a single- precision float ing- point value t o a double- precision float ing- point value. The CVTSD2SS ( convert scalar double- precision float ing- point value t o scalar singleprecision float ing- point value) inst ruct ion convert s a double- precision float ing- point value t o a single- precision float ing- point value. When t he conversion is inexact , t he result is rounded according t o t he rounding m ode select ed in t he MXCSR regist er.

11-12 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

2 Doubleword Integer (XMM/mem)

C C VT VT P TP D2 D DQ 2D Q

C VT D Q 2P

D

CVTSD2SS CVTPD2PS

C VT PI 2P S D 2P PI VT C

I 2S SI SD D2 VT S C TT V C

D 2S SI VT C

C CV VT TT PD PD 2P 2P I I

4 Doubleword Integer (XMM/mem)

CVTSS2SD CVTPS2PD

2 Doubleword Integer (MMX/mem)

Doubleword Integer (r32/mem)

CV CV TPS TT 2D PS Q 2D Q

S 2P

PI S2 2PI P T S CV TTP CV

Q D VT C

SI I S2 2S S T SS CV TT S CV 2S SI T CV

Single-Precision Floating Point (XMM/mem)

Double-Precision Floating-Point (XMM/mem)

Figure 11-8. SSE and SSE2 Conversion Instructions Con ve r sion be t w e e n double - pr e cision floa t in g- poin t va lu e s a n d double w or d in t e ge r s — The following inst ruct ions convert operands bet ween double- precision float ing- point and doubleword int eger form at s. Operands are housed in XMM regist ers, MMX regist ers, general regist ers or m em ory ( at m ost one operand can reside in m em ory; t he dest inat ion is always an XMM, MMX, or general regist er) . The CVTPD2PI ( convert packed double- precision float ing- point values t o packed doubleword int egers) inst ruct ion convert s t w o packed double- precision float ing- point num bers t o t wo packed signed doubleword int egers, wit h t he result st ored in an MMX regist er. When rounding t o an int eger value, t he source value is rounded according t o t he rounding m ode in t he MXCSR regist er. The CVTTPD2PI ( convert wit h t runcat ion packed double- precision float ing- point values t o packed doubleword int egers) inst ruct ion is sim ilar t o t he CVTPD2PI inst ruct ion except t hat t runcat ion is used t o round a source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) . The CVTPI 2PD ( convert packed doubleword int egers t o packed double- precision float ing- point values) inst ruct ion convert s t wo packed signed doubleword int egers t o t wo double- precision float ing- point values.

Vol. 1 11-13

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

The CVTPD2DQ ( convert packed double- precision float ing- point values t o packed doubleword int egers) inst ruct ion convert s t wo packed double- precision float ing- point num bers t o t wo packed signed doubleword int egers, wit h t he result st ored in t he low quadword of an XMM regist er. When rounding an int eger value, t he source value is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTTPD2DQ ( convert wit h t runcat ion packed double- precision float ing- point values t o packed doubleword int egers) inst ruct ion is sim ilar t o t he CVTPD2DQ inst ruct ion except t hat t runcat ion is used t o round a source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) . The CVTDQ2PD ( convert packed doubleword int egers t o packed double- precision float ing- point values) inst ruct ion convert s t wo packed signed doubleword int egers locat ed in t he low- order doublewords of an XMM regist er t o t wo double- precision float ing- point values. The CVTSD2SI ( convert scalar double- precision float ing- point value t o doubleword int eger) inst ruct ion convert s a double- precision float ing- point value t o a doubleword int eger, and st ores t he result in a general- purpose regist er. When rounding an int eger value, t he source value is rounded according t o t he rounding m ode select ed in t he MXCSR regist er. The CVTTSD2SI ( convert wit h t runcat ion scalar double- precision float ing- point value t o doubleword int eger) inst ruct ion is sim ilar t o t he CVTSD2SI inst ruct ion except t hat t runcat ion is used t o round t he source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) . The CVTSI 2SD ( convert doubleword int eger t o scalar double- precision float ing- point value) inst ruct ion convert s a signed doubleword int eger in a general- purpose regist er t o a double- precision float ing- point num ber, and st ores t he result in an XMM regist er. Conve r sion be t w e e n sin gle - pr e cision floa t in g- poin t a nd double w or d in t e ge r for m a t s — These inst ruct ions convert bet ween packed single- precision float ingpoint and packed doubleword int eger form at s. Operands are housed in XMM regist ers, MMX regist ers, general regist ers, or m em ory ( t he lat t er for at m ost one source operand) . The dest inat ion is always an XMM, MMX, or general regist er. These SSE2 inst ruct ions supplem ent conversion inst ruct ions ( CVTPI 2PS, CVTPS2PI , CVTTPS2PI , CVTSI 2SS, CVTSS2SI , and CVTTSS2SI ) int roduced wit h SSE ext ensions. The CVTPS2DQ ( convert packed single- precision float ing- point values t o packed doubleword int egers) inst ruct ion convert s four packed single- precision float ing- point values t o four packed signed doubleword int egers, wit h t he source and dest inat ion operands in XMM regist ers or m em ory ( t he lat t er for at m ost one source operand) . When t he conversion is inexact , t he rounded value according t o t he rounding m ode select ed in t he MXCSR regist er is ret urned. The CVTTPS2DQ ( convert wit h t runcat ion packed single- precision float ing- point values t o packed doubleword int egers) inst ruct ion is sim ilar t o t he CVTPS2DQ inst ruct ion except t hat t runcat ion is used t o round a source value t o an int eger value ( see Sect ion 4.8.4.2, “ Truncat ion wit h SSE and SSE2 Conversion I nst ruct ions” ) . The CVTDQ2PS ( convert packed doubleword int egers t o packed single- precision float ing- point values) inst ruct ion convert s four packed signed doubleword int egers t o four packed single- precision float ing- point num bers, wit h t he source and dest inat ion

11-14 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

operands in XMM regist ers or m em ory ( t he lat t er for at m ost one source operand) . When t he conversion is inexact , t he rounded value according t o t he rounding m ode select ed in t he MXCSR regist er is ret urned.

11.4.2

SSE2 64-Bit and 128-Bit SIMD Integer Instructions

SSE2 ext ensions add several 128- bit packed int eger inst ruct ions t o t he I A- 32 archit ect ure. Where appropriat e, a 64- bit version of each of t hese inst ruct ions is also provided. The 128- bit versions of inst ruct ions operat e on dat a in XMM regist ers; 64- bit versions operat e on dat a in MMX regist ers. The inst ruct ions follow. The MOVDQA ( m ove aligned double quadword) inst ruct ion t ransfers a double quadword operand from m em ory t o an XMM regist er or vice versa; or bet ween XMM regist ers. The m em ory address m ust be aligned t o a 16- byt e boundary; ot herwise, a general- prot ect ion except ion ( # GP) is generat ed. The MOVDQU ( m ove unaligned double quadword) inst ruct ion perform s t he sam e operat ions as t he MOVDQA inst ruct ion, except t hat 16- byt e alignm ent of a m em ory address is not required. The PADDQ ( packed quadword add) inst ruct ion adds t wo packed quadword int eger operands or t wo single quadword int eger operands, and st ores t he result s in an XMM or MMX regist er, respect ively. This inst ruct ion can operat e on eit her unsigned or signed ( t wo’s com plem ent not at ion) int eger operands. The PSUBQ ( packed quadword subt ract ) inst ruct ion subt ract s t wo packed quadword int eger operands or t wo single quadword int eger operands, and st ores t he result s in an XMM or MMX regist er, respect ively. Like t he PADDQ inst ruct ion, PSUBQ can operat e on eit her unsigned or signed ( t wo’s com plem ent not at ion) int eger operands. The PMULUDQ ( m ult iply packed unsigned doubleword int egers) inst ruct ion perform s an unsigned m ult iply of unsigned doubleword int egers and ret urns a quadword result . Bot h 64- bit and 128- bit versions of t his inst ruct ion are available. The 64- bit version operat es on t wo doubleword int egers st ored in t he low doubleword of each source operand, and t he quadword result is ret urned t o an MMX regist er. The 128- bit version perform s a packed m ult iply of t wo pairs of doubleword int egers. Here, t he doublewords are packed in t he first and t hird doublewords of t he source operands, and t he quadword result s are st ored in t he low and high quadwords of an XMM regist er. The PSHUFLW ( shuffle packed low words) inst ruct ion shuffles t he word int egers packed int o t he low quadword of t he source operand and st ores t he shuffled result in t he low quadword of t he dest inat ion operand. An 8- bit im m ediat e operand specifies t he shuffle order. The PSHUFHW ( shuffle packed high words) inst ruct ion shuffles t he word int egers packed int o t he high quadword of t he source operand and st ores t he shuffled result in t he high quadword of t he dest inat ion operand. An 8- bit im m ediat e operand specifies t he shuffle order.

Vol. 1 11-15

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

The PSHUFD ( shuffle packed doubleword int egers) inst ruct ion shuffles t he doubleword int egers packed int o t he source operand and st ores t he shuffled result in t he dest inat ion operand. An 8- bit im m ediat e operand specifies t he shuffle order. The PSLLDQ ( shift double quadword left logical) inst ruct ion shift s t he cont ent s of t he source operand t o t he left by t he am ount of byt es specified by an im m ediat e operand. The em pt y low- order byt es are cleared ( set t o 0) . The PSRLDQ ( shift double quadword right logical) inst ruct ion shift s t he cont ent s of t he source operand t o t he right by t he am ount of byt es specified by an im m ediat e operand. The em pt y high- order byt es are cleared ( set t o 0) . The PUNPCKHQDQ ( Unpack high quadwords) inst ruct ion int erleaves t he high quadword of t he source operand and t he high quadword of t he dest inat ion operand and writ es t hem t o t he dest inat ion regist er. The PUNPCKLQDQ ( Unpack low quadwords) inst ruct ion int erleaves t he low quadwords of t he source operand and t he low quadwords of t he dest inat ion operand and writ es t hem t o t he dest inat ion regist er. Two addit ional SSE inst ruct ions enable dat a m ovem ent from t he MMX regist ers t o t he XMM regist ers. The MOVQ2DQ ( m ove quadword int eger from MMX t o XMM regist ers) inst ruct ion m oves t he quadword int eger from an MMX source regist er t o an XMM dest inat ion regist er. The MOVDQ2Q ( m ove quadword int eger from XMM t o MMX regist ers) inst ruct ion m oves t he low quadword int eger from an XMM source regist er t o an MMX dest inat ion regist er.

11.4.3

128-Bit SIMD Integer Instruction Extensions

All of 64- bit SI MD int eger inst ruct ions int roduced wit h MMX t echnology and SSE ext ensions ( wit h t he except ion of t he PSHUFW inst ruct ion) have been ext ended by SSE2 ext ensions t o operat e on 128- bit packed int eger operands locat ed in XMM regist ers. The 128- bit versions of t hese inst ruct ions follow t he sam e SI MD convent ions regarding packed operands as t he 64- bit versions. For exam ple, where t he 64- bit version of t he PADDB inst ruct ion operat es on 8 packed byt es, t he 128- bit version operat es on 16 packed byt es.

11.4.4

Cacheability Control and Memory Ordering Instructions

SSE2 ext ensions t hat give program s m ore cont rol over t he caching, loading, and st oring of dat a. are described below.

11-16 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.4.4.1

FLUSH Cache Line

The CLFLUSH ( flush cache line) inst ruct ion writ es and invalidat es t he cache line associat ed wit h a specified linear address. The invalidat ion is for all levels of t he processor ’s cache hierarchy, and it is broadcast t hroughout t he cache coherency dom ain.

NOTE CLFLUSH was int roduced wit h t he SSE2 ext ensions. However, t he inst ruct ion can be im plem ent ed in I A- 32 processors t hat do not im plem ent t he SSE2 ext ensions. Det ect CLFLUSH using t he feat ure bit ( if CPUI D.01H: EDX.CLFSH[ bit 19] = 1) .

11.4.4.2

Cacheability Control Instructions

The following four inst ruct ions enable dat a from XMM and general- purpose regist ers t o be st ored t o m em ory using a non- t em poral hint . The non- t em poral hint direct s t he processor t o st ore dat a t o m em ory wit hout writ ing t he dat a int o t he cache hierarchy whenever t his is possible. See Sect ion 10.4.6.2, “ Caching of Tem poral vs. NonTem poral Dat a,” for m ore inform at ion about non- t em poral st ores and hint s. The MOVNTDQ ( st ore double quadword using non- t em poral hint ) inst ruct ion st ores packed int eger dat a from an XMM regist er t o m em ory, using a non- t em poral hint . The MOVNTPD ( st ore packed double- precision float ing- point values using nont em poral hint ) inst ruct ion st ores packed double- precision float ing- point dat a from an XMM regist er t o m em ory, using a non- t em poral hint . The MOVNTI ( st ore doubleword using non- t em poral hint ) inst ruct ion st ores int eger dat a from a general- purpose regist er t o m em ory, using a non- t em poral hint . The MASKMOVDQU ( st ore select ed byt es of double quadword) inst ruct ion st ores select ed byt e int egers from an XMM regist er t o m em ory, using a byt e m ask t o select ively writ e t he individual byt es. The m em ory locat ion does not need t o be aligned on a nat ural boundary. This inst ruct ion also uses a non- t em poral hint .

11.4.4.3

Memory Ordering Instructions

SSE2 ext ensions int roduce t wo new fence inst ruct ions ( LFENCE and MFENCE) as com panions t o t he SFENCE inst ruct ion int roduced wit h SSE ext ensions. The LFENCE inst ruct ion est ablishes a m em ory fence for loads. I t guarant ees ordering bet ween t wo loads and prevent s speculat ive loads from passing t he load fence ( t hat is, no speculat ive loads are allowed unt il all loads specified before t he load fence have been carried out ) . The MFENCE inst ruct ion com bines t he funct ions of LFENCE and SFENCE by est ablishing a m em ory fence for bot h loads and st ores. I t guarant ees t hat all loads and st ores specified before t he fence are globally observable prior t o any loads or st ores being carried out aft er t he fence.

Vol. 1 11-17

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.4.4.4

Pause

The PAUSE inst ruct ion is provided t o im pr ove t he perform ance of “ spin- wait loops” execut ed on a Pent ium 4 or I nt el Xeon processor. On a Pent ium 4 processor, it also provides t he added benefit of reducing processor power consum pt ion while execut ing a spin- wait loop. I t is recom m ended t hat a PAUSE inst ruct ion always be included in t he code sequence for a spin- wait loop.

11.4.5

Branch Hints

SSE2 ext ensions designat e t wo inst ruct ion prefixes ( 2EH and 3EH) t o provide branch hint s t o t he processor ( see “ I nst ruct ion Prefixes” in Chapt er 2 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A) . These prefixes can only be used wit h t he Jcc inst ruct ion and only at t he m achine code level ( t hat is, t here are no m nem onics for t he branch hint s) .

11.5

SSE, SSE2, AND SSE3 EXCEPTIONS

SSE/ SSE2/ SSE3 ext ensions generat e t wo general t ypes of except ions:

• •

Non- num eric except ions SI MD float ing- point except ions1

SSE/ SSE2/ SSE3 inst ruct ions can generat e t he sam e t ype of m em ory- access and non- num eric except ions as ot her I A- 32 archit ect ure inst ruct ions. Exist ing except ion handlers can generally handle t hese except ions wit hout any code m odificat ion. See “ Providing Non- Num eric Except ion Handlers for Except ions Generat ed by t he SSE, SSE2 and SSE3 I nst ruct ions” in Chapt er 12 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for a list of t he non- num eric except ions t hat can be generat ed by SSE/ SSE2/ SSE3 inst ruct ions and for guidelines for handling t hese except ions. SSE/ SSE2/ SSE3 inst ruct ions do not generat e num eric except ions on packed int eger operat ions; however, t hey can generat e num eric ( SI MD float ing- point ) except ions on packed single- precision and double- precision float ing- point operat ions. These SI MD float ing- point except ions are defined in t he I EEE St andard 754 for Binary Float ingPoint Arit hm et ic and are t he sam e except ions t hat are generat ed for x87 FPU inst ruct ions. See Sect ion 11.5.1, “ SI MD Float ing- Point Except ions,” for a descript ion of t hese except ions.

1. The FISTTP instruction in SSE3 does not generate SIMD floating-point exceptions, but it can generate x87 FPU floating-point exceptions.

11-18 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.5.1

SIMD Floating-Point Exceptions

SI MD float ing- point except ions are t hose except ions t hat can be generat ed by SSE/ SSE2/ SSE3 inst ruct ions t hat operat e on packed or scalar float ing- point operands. Six classes of SI MD float ing- point except ions can be generat ed:

• • • • • •

I nvalid operat ion ( # I ) Divide- by-zero ( # Z) Denorm al operand ( # D) Num eric overflow ( # O) Num eric underflow ( # U) I nexact result ( Precision) ( # P)

All of t hese except ions ( except t he denorm al operand except ion) are defined in I EEE St andard 754, and t hey are t he sam e except ions t hat are generat ed wit h t he x87 float ing- point inst ruct ions. Sect ion 4.9, “ Overview of Float ing- Point Except ions,” gives a det ailed descript ion of t hese except ions and of how and when t hey are generat ed. The following sect ions discuss t he im plem ent at ion of t hese except ions in SSE/ SSE2/ SSE3 ext ensions. All SI MD float ing- point except ions are precise and occur as soon as t he inst ruct ion com plet es execut ion. Each of t he six except ion condit ions has a corresponding flag ( I E, DE, ZE, OE, UE, and PE) and m ask bit ( I M, DM, ZM, OM, UM, and PM) in t he MXCSR regist er ( see Figure 10- 3) . The m ask bit s can be set wit h t he LDMXCSR or FXRSTOR inst ruct ion; t he m ask and flag bit s can be read wit h t he STMXCSR or FXSAVE inst ruct ion. The OSXMMEXCEPT flag ( bit 10) of cont rol regist er CR4 provides addit ional cont rol over generat ion of SI MD float ing- point except ions by allowing t he operat ing syst em t o indicat e whet her or not it support s soft ware except ion handlers for SI MD float ingpoint except ions. I f an unm asked SI MD float ing- point except ion is generat ed and t he OSXMMEXCEPT flag is set , t he processor invokes a soft ware except ion handler by generat ing a SI MD float ing- point except ion ( # XF) . I f t he OSXMMEXCEPT bit is clear, t he processor generat es an invalid- opcode except ion ( # UD) on t he first SSE or SSE2 inst ruct ion t hat det ect s a SI MD float ing- point except ion condit ion. See Sect ion 11.6.2, “ Checking for SSE/ SSE2 Support .”

11.5.2

SIMD Floating-Point Exception Conditions

The following sect ions describe t he condit ions t hat cause a SI MD float ing- point except ion t o be generat ed and t he m asked response of t he processor when t hese condit ions are det ect ed. See Sect ion 4.9.2, “ Float ing- Point Except ion Priorit y,” for a descript ion of t he rules for except ion precedence when m ore t han one float ing- point except ion condit ion is det ect ed for an inst ruct ion.

Vol. 1 11-19

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.5.2.1

Invalid Operation Exception (#I)

The float ing- point invalid- operat ion except ion ( # I ) occurs in response t o an invalid arit hm et ic operand. The flag ( I E) and m ask ( I M) bit s for t he invalid operat ion except ion are bit s 0 and 7, respect ively, in t he MXCSR regist er. I f t he invalid- operat ion except ion is m asked, t he processor ret ur ns a QNaN, QNaN float ing- point indefinit e, int eger indefinit e, one of t he sour ce operands t o t he dest inat ion operand, or it set s t he EFLAGS, depending on t he operat ion being perform ed. When a value is ret urned t o t he dest inat ion operand, it overwrit es t he dest inat ion r egist er specified by t he inst ruct ion. Table 11- 1 list s t he invalid- ar it hm et ic operat ions t hat t he processor det ect s for inst r uct ions and t he m asked responses t o t hese operat ions.

Table 11-1. Masked Responses of SSE/SSE2/SSE3 Instructions to Invalid Arithmetic Operations Condition

Masked Response

ADDPS, ADDSS, ADDPD, ADDSD, SUBPS, SUBSS, SUBPD, SUBSD, MULPS, MULSS, MULPD, MULSD, DIVPS, DIVSS, DIVPD, DIVSD, ADDSUBPD, ADDSUBPD, HADDPD, HADDPS, HSUBPD or HSUBPS instruction with an SNaN operand

Return the SNaN converted to a QNaN; Refer to Table 4-7 for more details

SQRTPS, SQRTSS, SQRTPD, or SQRTSD with SNaN operands

Return the SNaN converted to a QNaN

SQRTPS, SQRTSS, SQRTPD, or SQRTSD with negative operands (except zero)

Return the QNaN floating-point Indefinite

MAXPS, MAXSS, MAXPD, MAXSD, MINPS, MINSS, MINPD, or MINSD instruction with QNaN or SNaN operands

Return the source 2 operand value

CMPPS, CMPSS, CMPPD or CMPSD instruction with QNaN or SNaN operands

Return a mask of all 0s (except for the predicates “not-equal,” “unordered,” “not-lessthan,” or “not-less-than-or-equal,” which returns a mask of all 1s)

CVTPD2PS, CVTSD2SS, CVTPS2PD, CVTSS2SD with SNaN operands

Return the SNaN converted to a QNaN

COMISS or COMISD with QNaN or SNaN operand(s)

Set EFLAGS values to “not comparable”

Addition of opposite signed infinities or subtraction of like-signed infinities

Return the QNaN floating-point Indefinite

Multiplication of infinity by zero

Return the QNaN floating-point Indefinite

Divide of (0/0) or ( ∞ / ∞ )

11-20 Vol. 1

Return the QNaN floating-point Indefinite

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Table 11-1. Masked Responses of SSE/SSE2/SSE3 Instructions to Invalid Arithmetic Operations (Contd.) Condition Conversion to integer when the value in the source register is a NaN, ∞, or exceeds the representable range for CVTPS2PI, CVTTPS2PI, CVTSS2SI, CVTTSS2SI, CVTPD2PI, CVTSD2SI, CVTPD2DQ, CVTTPD2PI, CVTTSD2SI, CVTTPD2DQ, CVTPS2DQ, or CVTTPS2DQ

Masked Response Return the integer Indefinite

I f t he invalid operat ion except ion is not m asked, a soft ware except ion handler is invoked and t he operands rem ain unchanged. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware.” Norm ally, when one or m ore of t he source operands are QNaNs ( and neit her is an SNaN or in an unsupport ed form at ) , an invalid- operat ion except ion is not generat ed. The following inst ruct ions are except ions t o t his rule: t he COMI SS and COMI SD inst ruct ions; and t he CMPPS, CMPSS, CMPPD, and CMPSD inst ruct ions ( when t he predicat e is less t han, less- t han or equal, not less- t han, or not less- t han or equal) . Wit h t hese inst ruct ions, a QNaN source operand will generat e an invalid- operat ion except ion. The invalid- operat ion except ion is not affect ed by t he flush- t o- zero m ode or by t he denorm als- are- zeros m ode.

11.5.2.2

Denormal-Operand Exception (#D)

The processor signals t he denorm al- operand except ion if an arit hm et ic inst ruct ion at t em pt s t o operat e on a denorm al operand. The flag ( DE) and m ask ( DM) bit s for t he denorm al- operand except ion are bit s 1 and 8, respect ively, in t he MXCSR regist er. The CVTPI 2PD, CVTPD2PI , CVTTPD2PI , CVTDQ2PD, CVTPD2DQ, CVTTPD2DQ, CVTSI 2SD, CVTSD2SI , CVTTSD2SI , CVTPI 2PS, CVTPS2PI , CVTTPS2PI , CVTSS2SI , CVTTSS2SI , CVTSI 2SS, CVTDQ2PS, CVTPS2DQ, and CVTTPS2DQ conversion inst ruct ions do not signal denorm al except ions. The RCPSS, RCPPS, RSQRTSS, and RSQRTPS inst ruct ions do not signal any kind of float ing- point except ion. The denorm als- are- zero flag ( bit 6) of t he MXCSR regist er provides an addit ional opt ion for handling denorm al- operand except ions. When t his flag is set , denorm al source operands are aut om at ically convert ed t o zeros wit h t he sign of t he source operand ( see Sect ion 10.2.3.4, “ Denorm als- Are- Zeros” ) . The denorm al operand except ion is not affect ed by t he flush- t o- zero m ode. See Sect ion 4.9.1.2, “ Denorm al Operand Except ion ( # D) ,” for m ore inform at ion about t he denorm al except ion. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware,” for inform at ion on handling unm asked except ions.

Vol. 1 11-21

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

11.5.2.3

Divide-By-Zero Exception (#Z)

The processor report s a divide- by-zero except ion when a DI VPS, DI VSS, DI VPD or DI VSD inst ruct ion at t em pt s t o divide a finit e non- zero operand by 0. The flag ( ZE) and m ask ( ZM) bit s for t he divide- by-zero except ion are bit s 2 and 9, respect ively, in t he MXCSR regist er. See Sect ion 4.9.1.3, “ Divide- By-Zero Except ion ( # Z) ,” for m ore inform at ion about t he divide- by- zero except ion. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware,” for inform at ion on handling unm asked except ions. The divide- by-zero except ion is not affect ed by t he flush- t o-zero m ode or by t he denorm als- are- zeros m ode.

11.5.2.4

Numeric Overflow Exception (#O)

The processor report s a num eric overflow except ion whenever t he rounded result of an arit hm et ic inst ruct ion exceeds t he largest allowable finit e value t hat fit s in t he dest inat ion operand. This except ion can be generat ed wit h t he ADDPS, ADDSS, ADDPD, ADDSD, SUBPS, SUBSS, SUBPD, SUBSD, MULPS, MULSS, MULPD, MULSD, DI VPS, DI VSS, DI VPD, DI VSD, CVTPD2PS, CVTSD2SS, ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD and HSUBPS inst ruct ions. The flag ( OE) and m ask ( OM) bit s for t he num eric overflow except ion are bit s 3 and 10, respect ively, in t he MXCSR regist er. See Sect ion 4.9.1.4, “ Num eric Overflow Except ion ( # O) ,” for m ore inform at ion about t he num eric- overflow except ion. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware,” for inform at ion on handling unm asked except ions. The num eric overflow except ion is not affect ed by t he flush- t o-zero m ode or by t he denorm als- are- zeros m ode.

11.5.2.5

Numeric Underflow Exception (#U)

The processor report s a num eric underflow except ion whenever t he rounded result of an arit hm et ic inst ruct ion is less t han t he sm allest possible norm alized, finit e value t hat will fit in t he dest inat ion operand and t he num eric- underflow except ion is not m asked. I f t he num eric underflow except ion is m asked, bot h underflow and t he inexact- result condit ion m ust be det ect ed before num eric underflow is report ed. This except ion can be generat ed wit h t he ADDPS, ADDSS, ADDPD, ADDSD, SUBPS, SUBSS, SUBPD, SUBSD, MULPS, MULSS, MULPD, MULSD, DI VPS, DI VSS, DI VPD, DI VSD, CVTPD2PS, CVTSD2SS, ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD, and HSUBPS inst ruct ions. The flag ( UE) and m ask ( UM) bit s for t he num eric underflow except ion are bit s 4 and 11, respect ively, in t he MXCSR regist er. The flush- t o- zero flag ( bit 15) of t he MXCSR regist er provides an addit ional opt ion for handling num eric underflow except ions. When t his flag is set and t he num eric underflow except ion is m asked, t iny result s ( result s t hat t rigger t he underflow except ion) are ret urned as a zero wit h t he sign of t he t rue result ( see Sect ion 10.2.3.3, “ Flush-

11-22 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

To- Zero” ) . The num eric underflow except ion is not affect ed by t he denorm als- arezero m ode. See Sect ion 4.9.1.5, “ Num eric Underflow Except ion ( # U) ,” for m ore inform at ion about t he num eric underflow except ion. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware,” for inform at ion on handling unm asked except ions.

11.5.2.6

Inexact-Result (Precision) Exception (#P)

The inexact- result except ion ( also called t he precision except ion) occurs if t he result of an operat ion is not exact ly represent able in t he dest inat ion form at . For exam ple, t he fract ion 1/ 3 cannot be precisely represent ed in binary form . This except ion occurs frequent ly and indicat es t hat som e ( norm ally accept able) accuracy has been lost . The except ion is support ed for applicat ions t hat need t o perform exact arit hm et ic only. Because t he rounded result is generally sat isfact ory for m ost applicat ions, t his except ion is com m only m asked. The flag ( PE) and m ask ( PM) bit s for t he inexact- result except ion are bit s 2 and 12, respect ively, in t he MXCSR regist er. See Sect ion 4.9.1.6, “ I nexact- Result ( Precision) Except ion ( # P) ,” for m ore inform at ion about t he inexact- result except ion. See Sect ion 11.5.4, “ Handling SI MD Float ing- Point Except ions in Soft ware,” for inform at ion on handling unm asked except ions. I n flush- t o- zero m ode, t he inexact result except ion is report ed. The inexact result except ion is not affect ed by t he denorm als- are-zero m ode.

11.5.3

Generating SIMD Floating-Point Exceptions

When t he processor execut es a packed or scalar float ing- point inst ruct ion, it looks for and report s on SI MD float ing- point except ion condit ions using t wo sequent ial st eps: 1. Looks for, report s on, and handles pre- com put at ion except ion condit ions ( invalid- operand, divide- by- zero, and denorm al operand) 2. Looks for, report s on, and handles post- com put at ion except ion condit ions ( num eric overflow, num eric underflow, and inexact result ) I f bot h pre- and post- com put at ional except ions are unm asked, it is possible for t he processor t o generat e a SI MD float ing- point except ion ( # XF) t wice during t he execut ion of an SSE, SSE2 or SSE3 inst ruct ion: once when it det ect s and handles a precom put at ional except ion and when it det ect s a post- com put at ional except ion.

11.5.3.1

Handling Masked Exceptions

I f all except ions are m asked, t he processor handles t he except ions it det ect s by placing t he m asked result ( or result s for packed operands) in a dest inat ion operand

Vol. 1 11-23

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

and cont inuing program execut ion. The m asked result m ay be a rounded norm alized value, signed infinit y, a denorm al finit e num ber, zero, a QNaN float ing- point indefinit e, or a QNaN depending on t he except ion condit ion det ect ed. I n m ost cases, t he corresponding except ion flag bit in MXCSR is also set . The one sit uat ion where an except ion flag is not set is when an underflow condit ion is det ect ed and it is not accom panied by an inexact result . When operat ing on packed float ing- point operands, t he processor ret urns a m asked result for each of t he sub- operand com put at ions and set s a separat e set of int ernal except ion flags for each com put at ion. I t t hen perform s a logical- OR on t he int ernal except ion flag set t ings and set s t he except ion flags in t he MXCSR regist er according t o t he result s of OR operat ions. For exam ple, Figure 11- 9 shows t he result s of an MULPS inst ruct ion. I n t he exam ple, all SI MD float ing- point except ions are m asked. Assum e t hat a denorm al except ion condit ion is det ect ed prior t o t he m ult iplicat ion of sub- operands X0 and Y0, no except ion condit ion is det ect ed for t he m ult iplicat ion of X1 and Y1, a num eric overflow except ion condit ion is det ect ed for t he m ult iplicat ion of X2 and Y2, and anot her denorm al except ion is det ect ed prior t o t he m ult iplicat ion of sub- operands X3 and Y3. Because denorm al except ions are m asked, t he processor uses t he denorm al source values in t he m ult iplicat ions of ( X0 and Y0) and of ( X3 and Y3) passing t he result s of t he m ult iplicat ions t hrough t o t he dest inat ion operand. Wit h t he denorm al operand, t he result of t he X0 and Y0 com put at ion is a norm alized finit e value, wit h no except ions det ect ed. However, t he X3 and Y3 com put at ion produces a t iny and inexact result . This causes t he corresponding int ernal num eric underflow and inexact- result except ion flags t o be set .

X3

Y3 (Denormal)

MULPS

Tiny, Inexact, Finite

X2

Y2

MULPS



X1

X0 (Denormal)

Y1

MULPS

Y0

MULPS

Normalized Finite Normalized Finite

Figure 11-9. Example Masked Response for Packed Operations For t he m ult iplicat ion of X2 and Y2, t he processor st ores t he float ing- point ∞ in t he dest inat ion operand, and set s t he corresponding int ernal sub- operand num eric overflow flag. The result of t he X1 and Y1 m ult iplicat ion is passed t hrough t o t he dest inat ion operand, wit h no int ernal sub- operand except ion flags being set . Following t he com put at ions, t he individual sub- operand except ions flags for denorm al operand,

11-24 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

num eric underflow, inexact result , and num eric overflow are OR’d and t he corresponding flags are set in t he MXCSR regist er. The net result of t his com put at ion is t hat :

• • • • •

Mult iplicat ion of X0 and Y0 produces a norm alized finit e result Mult iplicat ion of X1 and Y1 produces a norm alized finit e result

Mult iplicat ion of X2 and Y2 produces a float ing- point ∞ result

Mult iplicat ion of X3 and Y3 produces a t iny, inexact , finit e result Denorm al operand, num eric underflow, num eric underflow, and inexact result flags are set in t he MXCSR regist er

11.5.3.2

Handling Unmasked Exceptions

I f all except ions are unm asked, t he processor: 1. First det ect s any pre- com put at ion except ions: it ORs t hose except ions, set s t he appropriat e except ion flags, leaves t he source and dest inat ion operands unalt ered, and goes t o st ep 2. I f it does not det ect any pre- com put at ion except ions, it goes t o st ep 5. 2. Checks CR4.OSXMMEXCPT[ bit 10] . I f t his flag is set , t he processor goes t o st ep 3; if t he flag is clear, it generat es an invalid- opcode except ion ( # UD) and m akes an im plicit call t o t he invalid- opcode except ion handler. 3. Generat es a SI MD float ing- point except ion ( # XF) and m akes an im plicit call t o t he SI MD float ing- point except ion handler. 4. I f t he except ion handler is able t o fix t he source operands t hat generat ed t he precom put at ion except ions or m ask t he condit ion in such a way as t o allow t he processor t o cont inue execut ing t he inst ruct ion, t he processor resum es inst ruct ion execut ion as described in st ep 5. 5. Upon ret urning from t he except ion handler ( or if no pre- com put at ion except ions were det ect ed) , t he processor checks for post- com put at ion except ions. I f t he processor det ect s any post- com put at ion except ions: it ORs t hose except ions, set s t he appropriat e except ion flags, leaves t he source and dest inat ion operands unalt ered, and repeat s st eps 2, 3, and 4. 6. Upon ret urning from t he except ions handler in st ep 4 ( or if no post- com put at ion except ions were det ect ed) , t he processor com plet es t he execut ion of t he inst ruct ion. The im plicat ion of t his pr ocedur e is t hat for unm asked except ions, t he pr ocessor can generat e a SI MD float ing- point except ion ( # XF) t w ice: once if it det ect s pr ecom put at ion except ion condit ions and a second t im e if it det ect s post - com put at ion except ion condit ions. For exam ple, if SI MD float ing- point except ions ar e unm asked for t he com put at ion show n in Figur e 11- 9, t he pr ocessor w ould generat e one SI MD float ing- point except ion for denor m al operand condit ions and a second SI MD float ing- point except ion for over flow and under flow ( no inexact r esult except ion

Vol. 1 11-25

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

w ould be generat ed because t he m ult iplicat ions of X0 and Y0 and of X1 and Y1 ar e ex act ) .

11.5.3.3

Handling Combinations of Masked and Unmasked Exceptions

I n sit uat ions where bot h m asked and unm asked except ions are det ect ed, t he processor will set except ion flags for t he m asked and t he unm asked except ions. However, it w ill not ret urn m asked result s unt il aft er t he processor has det ect ed and handled unm asked post- com put at ion except ions and ret urned from t he except ion handler ( as in st ep 6 above) t o finish execut ing t he inst ruct ion.

11.5.4

Handling SIMD Floating-Point Exceptions in Software

Sect ion 4.9.3, “ Typical Act ions of a Float ing- Point Except ion Handler,” shows act ions t hat m ay be carried out by a SI MD float ing- point except ion handler. The SSE/ SSE2/ SSE3 st at e is saved wit h t he FXSAVE inst ruct ion ( see Sect ion 11.6.5, “ Saving and Rest oring t he SSE/ SSE2 St at e” ) .

11.5.5

Interaction of SIMD and x87 FPU Floating-Point Exceptions

SI MD float ing- point except ions are generat ed independent ly from x87 FPU float ingpoint except ions. SI MD float ing- point except ions do not cause assert ion of t he FERR# pin ( independent of t he value of CR0.NE[ bit 5] ) . They ignore t he assert ion and deassert ion of t he I GNNE# pin. I f applicat ions use SSE/ SSE2/ SSE3 inst ruct ions along wit h x87 FPU inst ruct ions ( in t he sam e t ask or program ) , consider t he following:



SI MD float ing- point except ions are report ed independent ly from t he x87 FPU float ing- point except ions. SI MD and x87 FPU float ing- point except ions can be unm asked independent ly. Separat e x87 FPU and SI MD float ing- point except ion handlers m ust be provided if t he sam e except ion is unm asked for x87 FPU and for SSE/ SSE2/ SSE3 operat ions.



The rounding m ode specified in t he MXCSR regist er does not affect x87 FPU inst ruct ions. Likewise, t he rounding m ode specified in t he x87 FPU cont rol word does not affect t he SSE/ SSE2/ SSE3 inst ruct ions. To use t he sam e rounding m ode, t he rounding cont rol bit s in t he MXCSR regist er and in t he x87 FPU cont rol word m ust be set explicit ly t o t he sam e value.



The flush- t o- zero m ode set in t he MXCSR regist er for SSE/ SSE2/ SSE3 inst ruct ions has no count er part in t he x87 FPU. For com pat ibilit y wit h t he x87 FPU, set t he flush- t o- zer o bit t o 0.



The denorm als- are- zeros m ode set in t he MXCSR regist er for SSE/ SSE2/ SSE3 inst ruct ions has no count erpart in t he x87 FPU. For com pat ibilit y wit h t he x87 FPU, set t he denorm als- are-zeros bit t o 0.

11-26 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)



An applicat ion t hat expect s t o det ect x87 FPU except ions t hat occur during t he execut ion of x87 FPU inst ruct ions will not be not ified if except ions occurs during t he execut ion of corresponding SSE/ SSE2/ SSE3 1 inst ruct ions, unless t he except ion m asks t hat are enabled in t he x87 FPU cont rol word have also been enabled in t he MXCSR regist er and t he applicat ion is capable of handling SI MD float ing- point except ions ( # XF) . — Masked except ions t hat occur during an SSE/ SSE2/ SSE3 library call cannot be det ect ed by unm asking t he except ions aft er t he call ( in an at t em pt t o generat e t he fault based on t he fact t hat an except ion flag is set ) . A SI MD float ing- point except ion flag t hat is set when t he corresponding except ion is unm asked will not generat e a fault ; only t he next occurrence of t hat unm asked except ion will generat e a fault . — An applicat ion which checks t he x87 FPU st at us word t o det erm ine if any m asked except ion flags were set during an x87 FPU library call will also need t o check t he MXCSR regist er t o det ect a sim ilar occurrence of a m asked except ion flag being set during an SSE/ SSE2/ SSE3 library call.

11.6

WRITING APPLICATIONS WITH SSE/SSE2 EXTENSIONS

The follow ing sect ions give som e guidelines for w r it ing applicat ion pr ogram s and operat ing- sy st em code t hat uses t he SSE and SSE2 ex t ensions. Because SSE and SSE2 ex t ensions shar e t he sam e st at e and per for m com panion operat ions, t hese guidelines apply t o bot h set s of ex t ensions. Chapt er 12 in t he I nt el® 64 and I A- 32 Archit ect ures Soft w are Developer’s Manual, Volum e 3A, discusses t he int erface t o t he processor for cont ext sw it ching as w ell as ot her operat ing syst em considerat ions w hen wr it ing code t hat uses SSE/ SSE2/ SSE3 ext ensions.

11.6.1

General Guidelines for Using SSE/SSE2 Extensions

The following guidelines describe how t o t ake full advant age of t he perform ance gains available wit h t he SSE and SSE2 ext ensions:

• •

Ensure t hat t he processor support s t he SSE and SSE2 ext ensions. Ensure t hat your operat ing syst em support s t he SSE and SSE2 ext ensions. ( Operat ing syst em support for t he SSE ext ensions im plies support for SSE2 ext ension and vice versa.)

1. SSE3 refers to ADDSUBPD, ADDSUBPS, HADDPD, HADDPS, HSUBPD and HSUBPS; the only other SSE3 instruction that can raise floating-point exceptions is FISTTP: it can generate x87 FPU invalid operation and inexact result exceptions.

Vol. 1 11-27

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)



Use st ack and dat a alignm ent t echniques t o keep dat a properly aligned for efficient m em ory use.



Use t he non- t em poral st ore inst ruct ions offered wit h t he SSE and SSE2 ext ensions.



Em ploy t he opt im izat ion and scheduling t echniques described in t he I nt el Pent ium 4 Opt im izat ion Reference Manual ( see Sect ion 1.4, “ Relat ed Lit erat ure,” for t he order num ber for t his m anual) .

11.6.2

Checking for SSE/SSE2 Support

Before an applicat ion at t em pt s t o use t he SSE and/ or SSE2 ext ensions, it should check t hat t hey are present on t he processor and t hat t he operat ing syst em support s t hem . The applicat ion can m ake t his check by following t hese st eps: 1. Check t hat t he processor support s t he CPUI D inst ruct ion by at t em pt ing t o execut e t he CPUI D inst ruct ion. I f t he processor does not support t he CPUI D inst ruct ion, it will generat e an invalid- opcode except ion ( # UD) . 2. Check t hat t he processor support s t he SSE and/ or SSE2 ext ensions ( t rue if CPUI D.01H: EDX.SSE[ bit 25] = 1 and/ or CPUI D.01H: EDX.SSE2[ bit 26] = 1) . 3. Check t hat t he processor support s t he FXSAVE and FXRSTOR inst ruct ions ( t rue if CPUI D.01H: EDX.FXSR[ bit 24] = 1) . 4. Check t hat t he operat ing syst em support s t he FXSAVE and FXRSTOR inst ruct ions. ( execut e a MOV inst ruct ion, t rue if CR4. OSFXSR[ bit 9] = 1) . 5. Check t hat t he operat ing syst em support s SI MD float ing- point except ion handling. ( execut e a MOV inst ruct ion, t rue if CR4.OSXMMEXCPT[ bit 10] = 1) .

NOTE CR4.OSFXSR[ bit 9] and CR4.OSXMMEXCPT[ bit 10] m ust be set by t he operat ing syst em . The processor has no ot her way of det ect ing operat ing- syst em support for t he FXSAVE and FXRSTOR inst ruct ions or for handling SI MD float ing- point except ions. 6. Check t hat em ulat ion of t he x87 FPU is disabled ( execut e a MOV inst ruct ion, t rue if CR0.EM[ bit 2] = 0) . I f t he processor at t em pt s t o execut e an unsupport ed SSE or SSE2 inst ruct ion, t he processor will generat e an invalid- opcode except ion ( # UD) .

11.6.3

Checking for the DAZ Flag in the MXCSR Register

The denorm als- are-zero flag in t he MXCSR regist er is available in m ost of t he Pent ium 4 processors and in t he I nt el Xeon processor, wit h t he except ion of som e

11-28 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

early st eppings. To check for t he presence of t he DAZ flag in t he MXCSR regist er, do t he following: 1. Est ablish a 512- byt e FXSAVE area in m em ory. 2. Clear t he FXSAVE area t o all 0s. 3. Execut e t he FXSAVE inst ruct ion, using t he address of t he first byt e of t he cleared FXSAVE area as a source operand. See “ FXSAVE—Save x87 FPU, MMX, SSE, and SSE2 St at e” in Chapt er 3 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for a descript ion of t he FXSAVE inst ruct ion and t he layout of t he FXSAVE im age. 4. Check t he value in t he MXCSR_MASK field in t he FXSAVE im age ( byt es 28 t hrough 31) . — I f t he value of t he MXCSR_MASK field is 00000000H, t he DAZ flag and denorm als- are-zero m ode are not support ed. — I f t he value of t he MXCSR_MASK field is non-zero and bit 6 is set , t he DAZ flag and denorm als- are-zero m ode are support ed. I f t he DAZ flag is not support ed, t hen it is a reserved bit and at t em pt ing t o writ e a 1 t o it will cause a general- prot ect ion except ion ( # GP) . See Sect ion 11.6.6, “ Guidelines for Writ ing t o t he MXCSR Regist er,” for general guidelines for prevent ing generalprot ect ion except ions when writ ing t o t he MXCSR regist er.

11.6.4

Initialization of SSE/SE2 Extensions

The SSE and SSE2 st at e is cont ained in t he XMM and MXCSR regist ers. Upon a hardware reset of t he processor, t his st at e is init ialized as follows ( see Table 11- 2) :



All SI MD float ing- point except ions are m asked ( bit s 7 t hrough 12 of t he MXCSR regist er is set t o 1) .



All SI MD float ing- point except ion flags are cleared ( bit s 0 t hrough 5 of t he MXCSR regist er is set t o 0) .



The rounding cont rol is set t o round- nearest ( bit s 13 and 14 of t he MXCSR regist er are set t o 00B) .

• •

The flush- t o- zero m ode is disabled ( bit 15 of t he MXCSR regist er is set t o 0) .



The denorm als- are- zeros m ode is disabled ( bit 6 of t he MXCSR regist er is set t o 0) . I f t he denorm als- are- zeros m ode is not support ed, t his bit is reserved and will be set t o 0 on init ializat ion. Each of t he XMM regist ers is cleared ( set t o all zeros) .

Vol. 1 11-29

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Table 11-2. SSE and SSE2 State Following a Power-up/Reset or INIT Registers XMM0 through XMM7 MXCSR

Power-Up or Reset

INIT

+0.0

Unchanged

1F80H

Unchanged

I f t he processor is reset by assert ing t he I NI T# pin, t he SSE and SSE2 st at e is not changed.

11.6.5

Saving and Restoring the SSE/SSE2 State

The FXSAVE inst ruct ion saves t he x87 FPU, MMX, SSE and SSE2 st at es ( which includes t he cont ent s of eight XMM regist ers and t he MXCSR regist ers) in a 512- byt e block of m em ory. The FXRSTOR inst ruct ion rest ores t he saved SSE and SSE2 st at e from m em ory. See t he FXSAVE inst ruct ion in Chapt er 3 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for t he layout of t he 512- byt e st at e block. I n addit ion t o saving and rest oring t he SSE and SSE2 st at e, FXSAVE and FXRSTOR also save and rest ore t he x87 FPU st at e ( because MMX regist ers are aliased t o t he x87 FPU dat a regist ers t his includes saving and rest oring t he MMX st at e) . For great er code efficiency, it is suggest ed t hat FXSAVE and FXRSTOR be subst it ut ed for t he FSAVE, FNSAVE and FRSTOR inst ruct ions in t he following sit uat ions:

• •

When a cont ext swit ch is being m ade in a m ult it asking environm ent During calls and ret urns from int errupt and except ion handlers

I n sit uat ions where t he code is swit ching bet ween x87 FPU and MMX t echnology com put at ions ( wit hout a cont ext swit ch or a call t o an int errupt or except ion) , t he FSAVE/ FNSAVE and FRSTOR inst ruct ions are m ore efficient t han t he FXSAVE and FXRSTOR inst ruct ions.

11.6.6

Guidelines for Writing to the MXCSR Register

The MXCSR has several reserved bit s, and at t em pt ing t o writ e a 1 t o any of t hese bit s will cause a general- prot ect ion except ion ( # GP) t o be generat ed. To allow soft ware t o ident ify t hese reserved bit s, t he MXCSR_MASK value is provided. Soft ware can det erm ine t his m ask value as follows: 1. Est ablish a 512- byt e FXSAVE area in m em ory. 2. Clear t he FXSAVE area t o all 0s. 3. Execut e t he FXSAVE inst ruct ion, using t he address of t he first byt e of t he cleared FXSAVE area as a source operand. See “ FXSAVE—Save x87 FPU, MMX, SSE, and SSE2 St at e” in Chapt er 3 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware

11-30 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Developer’s Manual, Volum e 2A, for a descript ion of FXSAVE and t he layout of t he FXSAVE im age. 4. Check t he value in t he MXCSR_MASK field in t he FXSAVE im age ( byt es 28 t hrough 31) . — I f t he value of t he MXCSR_MASK field is 00000000H, t hen t he MXCSR_MASK value is t he default value of 0000FFBFH. Not e t hat t his value indicat es t hat bit 6 of t he MXCSR regist er is reserved; t his set t ing indicat es t hat t he denorm als- are- zero m ode is not support ed on t he processor. — I f t he value of t he MXCSR_MASK field is non- zero, t he MXCSR_MASK value should be used as t he MXCSR_MASK. All bit s set t o 0 in t he MXCSR_MASK value indicat e reserved bit s in t he MXCSR regist er. Thus, if t he MXCSR_MASK value is AND’d wit h a value t o be writ t en int o t he MXCSR regist er, t he result ing value will be assured of having all it s reserved bit s set t o 0, prevent ing t he possibilit y of a general- prot ect ion except ion being generat ed when t he value is writ t en t o t he MXCSR regist er. For exam ple, t he default MXCSR_MASK value when 00000000H is ret urned in t he FXSAVE im age is 0000FFBFH. I f soft ware AND’s a value t o be writ t en t o MXCSR regist er wit h 0000FFBFH, bit 6 of t he result ( t he DAZ flag) will be ensured of being set t o 0, which is t he required set t ing t o prevent general- prot ect ion except ions on processors t hat do not support t he denorm als- are- zero m ode. To prevent general- prot ect ion except ions, t he MXCSR_MASK value should be AND’d wit h t he value t o be writ t en int o t he MXCSR regist er in t he following sit uat ions:



Operat ing syst em rout ines t hat receive a param et er from an applicat ion program and t hen writ e t hat value t o t he MXCSR regist er ( eit her wit h an FXRSTOR or LDMXCSR inst ruct ion)



Any applicat ion program t hat writ es t o t he MXCSR regist er and t hat needs t o run robust ly on several different I A- 32 processors

Not e t hat all bit s in t he MXCSR_MASK value t hat are set t o 1 indicat e feat ures t hat are support ed by t he MXCSR regist er; t hey can be t reat ed as feat ure flags for ident ifying processor capabilit ies.

11.6.7

Interaction of SSE/SSE2 Instructions with x87 FPU and MMX Instructions

The XMM regist ers and t he x87 FPU and MMX regist ers represent separat e execut ion environm ent s, which has cert ain ram ificat ions when execut ing SSE, SSE2, MMX, and x87 FPU inst ruct ions in t he sam e code m odule or when m ixing code m odules t hat cont ain t hese inst ruct ions:



Those SSE and SSE2 inst ruct ions t hat operat e only on XMM regist ers ( such as t he packed and scalar float ing- point inst ruct ions and t he 128- bit SI MD int eger inst ruct ions) in t he sam e inst ruct ion st ream wit h 64- bit SI MD int eger or x87 FPU inst ruct ions wit hout any rest rict ions. For exam ple, an applicat ion can perform t he

Vol. 1 11-31

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

m aj orit y of it s float ing- point com put at ions in t he XMM regist ers, using t he packed and scalar float ing- point inst ruct ions, and at t he sam e t im e use t he x87 FPU t o perform t rigonom et ric and ot her t ranscendent al com put at ions. Likewise, an applicat ion can perform packed 64- bit and 128- bit SI MD int eger operat ions t oget her wit hout rest rict ions.



Those SSE and SSE2 inst ruct ions t hat operat e on MMX regist ers ( such as t he CVTPS2PI , CVTTPS2PI , CVTPI 2PS, CVTPD2PI , CVTTPD2PI , CVTPI 2PD, MOVDQ2Q, MOVQ2DQ, PADDQ, and PSUBQ inst ruct ions) can also be execut ed in t he sam e inst ruct ion st ream as 64- bit SI MD int eger or x87 FPU inst ruct ions, however, here t hey are subj ect t o t he rest rict ions on t he sim ult aneous use of MMX t echnology and x87 FPU inst ruct ions, which include: — Transit ion from x87 FPU t o MMX t echnology inst ruct ions or t o SSE or SSE2 inst ruct ions t hat operat e on MMX regist ers should be preceded by saving t he st at e of t he x87 FPU. — Transit ion from MMX t echnology inst ruct ions or from SSE or SSE2 inst ruct ions t hat operat e on MMX regist ers t o x87 FPU inst ruct ions should be preceded by execut ion of t he EMMS inst ruct ion.

11.6.8

Compatibility of SIMD and x87 FPU Floating-Point Data Types

SSE and SSE2 ex t ensions operat e on t he sam e single- precision and double- precision float ing- point dat a t ypes t hat t he x87 FPU operat es on. However, when operat ing on t hese dat a t ypes, t he SSE and SSE2 ext ensions operat e on t hem in t heir nat ive form at ( single- precision or double- precision) , in cont rast t o t he x87 FPU w hich ext ends t hem t o double ext ended- precision float ing- point form at t o perfor m com put at ions and t hen rounds t he result back t o a single- precision or double- precision form at before wr it ing result s t o m em or y. Because t he x87 FPU operat es on a higher precision form at and t hen rounds t he result t o a lower precision form at , it m ay r et urn a slight ly differ ent result when perform ing t he sam e operat ion on t he sam e single- precision or double- precision float ing- point values t han is ret urned by t he SSE and SSE2 ext ensions. The difference occurs only in t he least- significant bit s of t he significand.

11.6.9

Mixing Packed and Scalar Floating-Point and 128-Bit SIMD Integer Instructions and Data

SSE and SSE2 ext ensions define t yped operat ions on packed and scalar float ingpoint dat a t ypes and on 128- bit SI MD int eger dat a t ypes, but I A- 32 processors do not enforce t his t yping at t he archit ect ural level. They only enforce it at t he m icroarchit ect ural level. Therefore, when a Pent ium 4 or I nt el Xeon processor loads a packed or scalar float ing- point operand or a 128- bit packed int eger operand from m em ory int o an XMM regist er, it does not check t hat t he act ual dat a being loaded m at ches t he dat a t ype specified in t he inst ruct ion. Likewise, when t he processor perform s an

11-32 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

arit hm et ic operat ion on t he dat a in an XMM regist er, it does not check t hat t he dat a being operat ed on m at ches t he dat a t ype specified in t he inst ruct ion. As a general rule, because dat a t yping of SI MD float ing- point and int eger dat a t ypes is not enforced at t he archit ect ural level, it is t he responsibilit y of t he program m er, assem bler, or com piler t o insure t hat code enforces dat a t yping. Failure t o enforce correct dat a t yping can lead t o com put at ions t hat ret urn unexpect ed result s. For exam ple, in t he following code sam ple, t wo packed single- precision float ing- point operands are m oved from m em ory int o XMM regist ers ( using MOVAPS inst ruct ions) ; t hen a double- precision packed add operat ion ( using t he ADDPD inst ruct ion) is perform ed on t he operands: movaps

xmm0, [eax] ; EAX register contains pointer to packed ; single-precision floating-point operand

movaps

xmm1, [ebx]

addpd

xmm0, xmm1

Pent ium 4 and I nt el Xeon processors execut e t hese inst ruct ions wit hout generat ing an invalid- operand except ion ( # UD) and will produce t he expect ed result s in regist er XMM0 ( t hat is, t he high and low 64- bit s of each regist er will be t reat ed as a doubleprecision float ing- point value and t he processor will operat e on t hem accordingly) . Because t he dat a t ypes operat ed on and t he dat a t ype expect ed by t he ADDPD inst ruct ion were inconsist ent , t he inst ruct ion m ay result in a SI MD float ing- point except ion ( such as num eric overflow [ # O] or invalid operat ion [ # I ] ) being generat ed, but t he act ual source of t he problem ( inconsist ent dat a t ypes) is not det ect ed. The abilit y t o operat e on an operand t hat cont ains a dat a t ype t hat is inconsist ent wit h t he t yping of t he inst ruct ion being execut ed, perm it s som e valid operat ions t o be perform ed. For exam ple, t he following inst ruct ions load a packed double- precision float ing- point operand from m em ory t o regist er XMM0, and a m ask t o regist er XMM1; t hen t hey use XORPD t o t oggle t he sign bit s of t he t wo packed values in regist er XMM0. movapd

xmm0, [eax]

; EAX register contains pointer to packed ; double-precision floating-point operand

movaps

xmm1, [ebx]

; EBX register contains pointer to packed ; double-precision floating-point mask

xorpd

xmm0, xmm1

; XOR operation toggles sign bits using ; the mask in xmm1

I n t his exam ple: XORPS or PXOR can be used in place of XORPD and yield t he sam e correct result . However, because of t he t ype m ism at ch bet ween t he operand dat a t ype and t he inst ruct ion dat a t ype, a lat ency penalt y will be incurred due t o im plem ent at ions of t he inst ruct ions at t he m icroarchit ect ure level.

Vol. 1 11-33

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Lat ency penalt ies can also be incurred by using m ove inst ruct ions of t he wrong t ype. For exam ple, MOVAPS and MOVAPD can bot h be used t o m ove a packed single- precision operand from m em ory t o an XMM regist er. However, if MOVAPD is used, a lat ency penalt y will be incurred when a correct ly t yped inst ruct ion at t em pt s t o use t he dat a in t he regist er. Not e t hat t hese lat ency penalt ies are not incurred when m oving dat a from XMM regist ers t o m em ory.

11.6.10 Interfacing with SSE/SSE2 Procedures and Functions SSE and SSE2 ext ensions allow direct access t o XMM regist ers. This m eans t hat all exist ing int erface convent ions bet ween procedures and funct ions t hat apply t o t he use of t he general- purpose regist ers ( EAX, EBX, et c.) also apply t o XMM regist er usage.

11.6.10.1 Passing Parameters in XMM Registers The st at e of XMM regist ers is preserved across procedure ( or funct ion) boundaries. Param et ers can be passed from one procedure t o anot her using XMM regist ers.

11.6.10.2 Saving XMM Register State on a Procedure or Function Call The st at e of XMM regist ers can be saved in t wo ways: using an FXSAVE inst ruct ion or a m ove inst ruct ion. FXSAVE saves t he st at e of all XMM regist ers ( along wit h t he st at e of MXCSR and t he x87 FPU regist ers) . This inst ruct ion is t ypically used for m aj or changes in t he cont ext of t he execut ion environm ent , such as a t ask swit ch. FXRSTOR rest ores t he XMM, MXCSR, and x87 FPU regist ers st ored wit h FXSAVE. I n cases where only XMM regist ers m ust be saved, or where select ed XMM regist ers need t o be saved, m ove inst ruct ions ( MOVAPS, MOVUPS, MOVSS, MOVAPD, MOVUPD, MOVSD, MOVDQA, and MOVDQU) can be used. These inst ruct ions can also be used t o rest ore t he cont ent s of XMM regist ers. To avoid perform ance degradat ion when saving XMM regist ers t o m em ory or when loading XMM regist ers from m em ory, be sure t o use t he appropriat ely t yped m ove inst ruct ions. The m ove inst ruct ions can also be used t o save t he cont ent s of XMM regist ers on t he st ack. Here, t he st ack point er ( in t he ESP regist er) can be used as t he m em ory address t o t he next available byt e in t he st ack. Not e t hat t he st ack point er is not aut om at ically increm ent ed when using a m ove inst ruct ion ( as it is wit h PUSH) . A m ove- inst ruct ion procedure t hat saves t he cont ent s of an XMM regist er t o t he st ack is responsible for decrem ent ing t he value in t he ESP regist er by 16. Likewise, a m ove- inst ruct ion procedure t hat loads an XMM regist er from t he st ack needs also t o increm ent t he ESP regist er by 16. To avoid perform ance degradat ion when m oving t he cont ent s of XMM regist ers, use t he appropriat ely t yped m ove inst ruct ions.

11-34 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Use t he LDMXCSR and STMXCSR inst ruct ions t o save and rest ore, respect ively, t he cont ent s of t he MXCSR regist er on a procedure call and ret urn.

11.6.10.3 Caller-Save Requirement for Procedure and Function Calls When m aking procedure ( or funct ion) calls from SSE or SSE2 code, a caller- save convent ion is recom m ended for saving t he st at e of t he calling procedure. Using t his convent ion, any regist er whose cont ent m ust survive int act across a procedure call m ust be st ored in m em ory by t he calling procedure prior t o execut ing t he call. The prim ary reason for using t he caller- save convent ion is t o prevent perform ance degradat ion. XMM regist ers can cont ain packed or scalar double- precision float ingpoint , packed single- precision float ing- point , and 128- bit packed int eger dat a t ypes. The called procedure has no way of knowing t he dat a t ypes in XMM regist ers following a call; so it is unlikely t o use t he correct ly t yped m ove inst ruct ion t o st ore t he cont ent s of XMM regist ers in m em ory or t o rest ore t he cont ent s of XMM regist ers from m em ory. As described in Sect ion 11.6.9, “ Mixing Packed and Scalar Float ing- Point and 128- Bit SI MD I nt eger I nst ruct ions and Dat a,” execut ing a m ove inst ruct ion t hat does not m at ch t he t ype for t he dat a being m oved t o/ from XMM regist ers will be carried out correct ly, but can lead t o a great er inst ruct ion lat ency.

11.6.11 Updating Existing MMX Technology Routines Using 128-Bit SIMD Integer Instructions SSE2 ext ensions ext end all 64- bit MMX SI MD int eger inst ruct ions t o operat e on 128bit SI MD int egers using XMM regist ers. The ext ended 128- bit SI MD int eger inst ruct ions operat e like t he 64- bit SI MD int eger inst ruct ions; t his sim plifies t he port ing of MMX t echnology applicat ions. However, t here are considerat ions:



To t ake advant age of wider 128- bit SI MD int eger inst ruct ions, MMX t echnology code m ust be recom piled t o reference t he XMM regist ers inst ead of MMX regist ers.



Com put at ion inst ruct ions t hat reference m em ory operands t hat are not aligned on 16- byt e boundaries should be replaced wit h an unaligned 128- bit load ( MOVUDQ inst ruct ion) followed by a version of t he sam e com put at ion operat ion t hat uses regist er inst ead of m em ory operands. Use of 128- bit packed int eger com put at ion inst ruct ions wit h m em ory operands t hat are not 16- byt e aligned result s in a general prot ect ion except ion ( # GP) .



Ext ension of t he PSHUFW inst ruct ion ( shuffle word across 64- bit int eger operand) across a full 128- bit operand is em ulat ed by a com binat ion of t he following inst ruct ions: PSHUFHW, PSHUFLW, and PSHUFD.

Vol. 1 11-35

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)



Use of t he 64- bit shift by bit inst ruct ions ( PSRLQ, PSLLQ) can be ext ended t o 128 bit s in eit her of t wo ways: — Use of PSRLQ and PSLLQ, along wit h m asking logic operat ions. — Rewrit ing t he code sequence t o use PSRLDQ and PSLLDQ ( shift double quadword operand by byt es)



Loop count ers need t o be updat ed, since each 128- bit SI MD int eger inst ruct ion operat es on t wice t he am ount of dat a as it s 64- bit SI MD int eger count erpart .

11.6.12 Branching on Arithmetic Operations There are no condit ion codes in SSE or SSE2 st at es. A packed- dat a com parison inst ruct ion generat es a m ask which can t hen be t ransferred t o an int eger regist er. The following code sequence provides an exam ple of how t o perform a condit ional branch, based on t he result of an SSE2 arit hm et ic operat ion. cmppd movmskpd test jne

XMM0, XMM1 EAX, XMM0 EAX, 0,2 BRANCH TARGET

; generates a mask in XMM0 ; moves a 2 bit mask to eax ; compare with desired result

The COMI SD and UCOMI SD inst ruct ions updat e t he EFLAGS as t he result of a scalar com parison. A condit ional branch can t hen be scheduled im m ediat ely following COMI SD/ UCOMI SD.

11.6.13 Cacheability Hint Instructions SSE and SSE2 cacheabilit y cont rol inst ruct ions enable t he program m er t o cont rol prefet ching, caching, loading and st oring of dat a. When correct ly used, t hese inst ruct ions im prove applicat ion perform ance. To m ake efficient use of t he processor ’s super- scalar m icroarchit ect ure, a program needs t o provide a st eady st ream of dat a t o t he execut ing program t o avoid st alling t he processor. PREFETCHh inst ruct ions m inim ize t he lat ency of dat a accesses in perform ance- crit ical sect ions of applicat ion code by allowing dat a t o be fet ched int o t he processor cache hierarchy in advance of act ual usage. PREFETCHh inst ruct ions do not change t he user-visible sem ant ics of a program , alt hough t hey m ay affect perform ance. The operat ion of t hese inst ruct ions is im plem ent at ion- dependent . Program m ers m ay need t o t une code for each I A- 32 processor im plem ent at ion. Excessive usage of PREFETCHh inst ruct ions m ay wast e m em ory bandwidt h and reduce perform ance. For m ore det ailed inform at ion on t he use of prefet ch hint s, refer t o Chapt er 6, “ Opt im izing Cache Usage”, in t he I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual. The non- t em poral st ore inst ruct ions ( MOVNTI , MOVNTPD, MOVNTPS, MOVNTDQ, MOVNTQ, MASKMOVQ, and MASKMOVDQU) m inim ize cache pollut ion when writ ing non- t em poral dat a t o m em ory ( see Sect ion 10.4.6.2, “ Caching of Tem poral vs. Non-

11-36 Vol. 1

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Tem poral Dat a,” and Sect ion 10.4.6.1, “ Cacheabilit y Cont rol I nst ruct ions” ) . They prevent non- t em poral dat a from being writ t en int o processor caches on a st ore operat ion. These inst ruct ions are im plem ent at ion specific. Program m ers m ay have t o t une t heir applicat ions for each I A- 32 processor im plem ent at ion t o t ake advant age of t hese inst ruct ions. Besides reducing cache pollut ion, t he use of weakly- ordered m em ory t ypes can be im port ant under cert ain dat a sharing relat ionships, such as a producer- consum er relat ionship. The use of weakly ordered m em ory can m ake t he assem bling of dat a m ore efficient ; but care m ust be t aken t o ensure t hat t he consum er obt ains t he dat a t hat t he producer int ended. Som e com m on usage m odels t hat m ay be affect ed in t his way by weakly- ordered st ores are:

• • •

Library funct ions t hat use weakly ordered m em ory t o writ e result s Com piler- generat ed code t hat writ es weakly- ordered result s Hand- craft ed code

The degree t o which a consum er of dat a knows t hat t he dat a is weakly ordered can vary for t hese cases. As a result , t he SFENCE or MFENCE inst ruct ion should be used t o ensure ordering bet ween rout ines t hat produce weakly- ordered dat a and rout ines t hat consum e t he dat a. SFENCE and MFENCE provide a perform ance- efficient way t o ensure ordering by guarant eeing t hat every st ore inst ruct ion t hat precedes SFENCE/ MFENCE in program order is globally visible before a st ore inst ruct ion t hat follows t he fence.

11.6.14 Effect of Instruction Prefixes on the SSE/SSE2 Instructions Table 11- 3 descr ibes t he effect s of inst r uct ion pr efixes on SSE and SSE2 inst r uct ions. ( Table 11- 3 also applies t o SI MD int eger and SI MD float ing- point inst r uct ions in SSE3.) Unpr edict able behavior can range from prefixes being t reat ed as a r eser ved operat ion on one generat ion of I A- 32 processors t o generat ing an invalid opcode except ion on anot her generat ion of processors. See also “ I nst ruct ion Prefixes” in Chapt er 2 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A, for com plet e descript ion of inst ruct ion prefixes.

NOTE Som e SSE/ SSE2/ SSE3 inst ruct ions have t wo- byt e opcodes t hat are eit her 2 byt es or 3 byt es in lengt h. Two- byt e opcodes t hat are 3 byt es in lengt h consist of: a m andat ory prefix ( F2H, F3H, or 66H) , 0FH, and an opcode byt e. See Table 11- 3.

Vol. 1 11-37

PROGRAMMING WITH STREAMING SIMD EXTENSIONS 2 (SSE2)

Table 11-3. Effect of Prefixes on SSE, SSE2, and SSE3 Instructions Prefix Type Address Size Prefix (67H)

Effect on SSE, SSE2 and SSE3 Instructions Affects instructions with a memory operand. Reserved for instructions without a memory operand and may result in unpredictable behavior.

Operand Size (66H)

Reserved and may result in unpredictable behavior.

Segment Override (2EH,36H,3EH,26H,64H,65H)

Affects instructions with a memory operand.

Repeat Prefixes (F2H and F3H)

Reserved and may result in unpredictable behavior.

Lock Prefix (F0H)

Reserved; generates invalid opcode exception (#UD).

Branch Hint Prefixes(E2H and E3H)

Reserved and may result in unpredictable behavior.

11-38 Vol. 1

Reserved for instructions without a memory operand and may result in unpredictable behavior.

CHAPTER 12 PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3 The Pent ium 4 processor support ing Hyper-Threading Technology int roduces St ream ing SI MD Ext ensions 3 ( SSE3) . The I nt el Xeon processor 5100 series, I nt el Core 2 processor fam ilies int roduced Supplem ent al St ream ing SI MD Ext ensions 3 ( SSSE3) . This chapt er describes SSE3/ SSSE3 and provides inform at ion t o assist in writ ing applicat ion program s t hat use t hese ext ensions.

12.1

SSE3/SSSE3 PROGRAMMING ENVIRONMENT AND DATA TYPES

The program m ing environm ent for using SSE3/ SSSE3 is unchanged from t hat shown in Figure 3- 1 and Figure 11- 1. SSE3/ SSSE3 do not int roduce new dat a t ypes. XMM regist ers are used t o operat e on packed int eger dat a, single- precision float ing- point dat a, or double- precision float ing- point dat a. One SSE3 inst ruct ion uses t he x87 FPU for x87- st yle program m ing. There are t w o SSE3 inst ruct ions t hat use t he general regist ers for t hread synchr onizat ion. The MXCSR regist er governs SI MD float ing- point operat ions. Not e, how ever, t hat t he x87FPU cont rol w ord does not affect t he SSE3 inst ruct ion t hat is execut ed by t he x87 FPU ( FI STTP) , ot her t han by unm asking an invalid operand or inexact result except ion.

12.1.1

SSE3/SSSE3 in 64-Bit Mode and Compatibility Mode

I n com pat ibilit y m ode, SSE3/ SSSE3 funct ion like t hey do in prot ect ed m ode. I n 64- bit m ode, eight addit ional XMM regist ers are accessible. Regist ers XMM8-XMM15 are accessed by using REX prefixes. Mem ory operands are specified using t he ModR/ M, SI B encoding described in Sect ion 3.7.5. Som e SSE3 inst ruct ions m ay be used t o operat e on general- purpose regist ers. Use t he REX.W prefix t o access 64- bit general- purpose regist ers. Not e t hat if a REX prefix is used when it has no m eaning, t he prefix is ignored.

Vol. 1 12-1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.1.2

Compatibility of SSE3/SSSE3 with MMX Technology, the x87 FPU Environment, and SSE/SSE2 Extensions

SSE3/ SSSE3 do not int roduce any new st at e t o t he I nt el 64 and I A- 32 execut ion environm ent s. For SI MD and x87 program m ing, t he FXSAVE and FXRSTOR inst ruct ions save and rest ore t he archit ect ural st at es of XMM, MXCSR, x87 FPU, and MMX regist ers. The MONI TOR and MWAI T inst ruct ions use general purpose regist ers on input , t hey do not m odify t he cont ent of t hose regist ers.

12.1.3

Horizontal and Asymmetric Processing

Many SSE/ SSE2/ SSE3/ SSSE3 inst ruct ions accelerat e SI MD dat a processing using a m odel referred t o as vert ical com put at ion. Using t his m odel, dat a flow is vert ical bet ween t he dat a elem ent s of t he input s and t he out put . Figure 12- 1 illust rat es t he asym m et ric processing of t he SSE3 inst ruct ion ADDSUBPD. Figure 12- 2 illust rat es t he horizont al dat a m ovem ent of t he SSE3 inst ruct ion HADDPD.

X1

X0

Y1

Y0

ADD

SUB

X1 + Y1

X0 -Y0

Figure 12-1. Asymmetric Processing in ADDSUBPD

12-2 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

X1

X0

Y1

Y0

ADD

ADD

Y0 + Y1

X0 + X1

Figure 12-2. Horizontal Data Movement in HADDPD

12.2

OVERVIEW OF SSE3 INSTRUCTIONS

SSE3 ext ensions include 13 inst ruct ions. See:



Sect ion 12.3, “ SSE3 I nst ruct ions,” provides an int roduct ion t o individual SSE3 inst ruct ions.



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, provide det ailed inform at ion on individual inst ruct ions.



Chapt er 12, “ Syst em Program m ing for St ream ing SI MD I nst ruct ion Set s,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, gives guidelines for int egrat ing SSE/ SSE2/ SSE3 ext ensions int o an operat ingsyst em environm ent .

12.3

SSE3 INSTRUCTIONS

SSE3 inst ruct ions are grouped as follows:



x87 FPU inst ruct ion — One inst ruct ion t hat im proves x87 FPU float ing- point t o int eger conversion



SI MD int eger inst ruct ion — One inst ruct ion t hat provides a specialized 128- bit unaligned dat a load



SI MD float ing- point inst ruct ions — Three inst ruct ions t hat enhance LOAD/ MOVE/ DUPLI CATE perform ance — Two inst ruct ions t hat provide packed addit ion/ subt ract ion — Four inst ruct ions t hat provide horizont al addit ion/ subt ract ion

Vol. 1 12-3

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3



Thread synchronizat ion inst ruct ions — Two inst ruct ions t hat im prove synchronizat ion bet ween m ult i- t hreaded agent s

The inst ruct ions are discussed in m ore det ail in t he following paragraphs.

12.3.1

x87 FPU Instruction for Integer Conversion

The FI STTP inst ruct ion ( x87 FPU St ore I nt eger and Pop wit h Truncat ion) behaves like FI STP, but uses t runcat ion regardless of what rounding m ode is specified in t he x87 FPU cont rol word. The inst ruct ion convert s t he t op of st ack ( ST0) t o int eger wit h rounding t o and pops t he st ack. The FI STTP inst ruct ion is available in t hree precisions: short int eger ( word or 16- bit ) , int eger ( double word or 32- bit ) , and long int eger ( 64- bit ) . Wit h FI STTP, applicat ions no longer need t o change t he FCW when t runcat ion is required.

12.3.2

SIMD Integer Instruction for Specialized 128-bit Unaligned Data Load

The LDDQU inst ruct ion is a special 128- bit unaligned load designed t o avoid cache line split s. I f t he address of a 16- byt e load is on a 16- byt e boundary, LDQQU loads t he byt es request ed. I f t he address of t he load is not aligned on a 16- byt e boundary, LDDQU loads a 32- byt e block st art ing at t he 16- byt e aligned address im m ediat ely below t he load request . I t t hen ext ract s t he request ed 16 byt es. The inst ruct ion provides significant perform ance im provem ent on 128- bit unaligned m em ory accesses at t he cost of som e usage m odel rest rict ions.

12.3.3

SIMD Floating-Point Instructions That Enhance LOAD/MOVE/DUPLICATE Performance

The MOVSHDUP inst ruct ion loads/ m oves 128- bit s, duplicat ing t he second and fourt h 32- bit dat a elem ent s.



MOVSHDUP OperandA, OperandB — OperandA ( 128 bit s, four dat a elem ent s) : 3 a , 2 a , 1 a , 0 a — OperandB ( 128 bit s, four dat a elem ent s) : 3 b , 2 b , 1 b , 0 b — Result ( st ored in OperandA) : 3 b , 3 b , 1 b , 1 b

The MOVSLDUP inst ruct ion loads/ m oves 128- bit s, duplicat ing t he first and t hird 32- bit dat a elem ent s.



MOVSLDUP OperandA, OperandB — OperandA ( 128 bit s, four dat a elem ent s) : 3 a , 2 a , 1 a , 0 a

12-4 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

— OperandB ( 128 bit s, four dat a elem ent s) : 3 b , 2 b , 1 b , 0 b — Result ( st ored in OperandA) : 2 b , 2 b , 0 b , 0 b The MOVDDUP inst ruct ion loads/ m oves 64- bit s; duplicat ing t he 64 bit s from t he source.



MOVDDUP OperandA, OperandB — OperandA ( 128 bit s, t wo dat a elem ent s) : 1 a , 0 a — OperandB ( 64 bit s, one dat a elem ent ) : 0 b — Result ( st ored in OperandA) : 0 b , 0 b

12.3.4

SIMD Floating-Point Instructions Provide Packed Addition/Subtraction

The ADDSUBPS inst ruct ion has t wo 128- bit operands. The inst ruct ion perform s single- precision addit ion on t he second and fourt h pairs of 32- bit dat a elem ent s wit hin t he operands; and single- precision subt ract ion on t he first and t hird pairs.



ADDSUBPS OperandA, OperandB — OperandA ( 128 bit s, four dat a elem ent s) : 3 a , 2 a , 1 a , 0 a — OperandB ( 128 bit s, four dat a elem ent s) : 3 b , 2 b , 1 b , 0 b — Result ( st ored in OperandA) : 3 a + 3 b , 2 a - 2 b , 1 a + 1 b , 0 a - 0 b

The ADDSUBPD inst ruct ion has t wo 128- bit operands. The inst ruct ion perform s double- precision addit ion on t he second pair of quadwords, and double- precision subt ract ion on t he first pair.



ADDSUBPD OperandA, OperandB — OperandA ( 128 bit s, t wo dat a elem ent s) : 1 a , 0 a — OperandB ( 128 bit s, t wo dat a elem ent s) : 1 b , 0 b — Result ( st ored in OperandA) : 1 a + 1 b , 0 a - 0 b

12.3.5

SIMD Floating-Point Instructions Provide Horizontal Addition/Subtraction

Most SI MD inst ruct ions operat e vert ically. This m eans t hat t he result in posit ion i is a funct ion of t he elem ent s in posit ion i of bot h operands. Horizont al addit ion/ subt ract ion operat es horizont ally. This m eans t hat cont iguous dat a elem ent s in t he sam e source operand are used t o produce a result . The HADDPS inst ruct ion perform s a single- precision addit ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by adding t he first and second elem ent s of t he first operand; t he second elem ent by adding t he t hird and fourt h elem ent s of t he first operand; t he t hird by adding t he first and second

Vol. 1 12-5

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

elem ent s of t he second operand; and t he fourt h by adding t he t hird and fourt h elem ent s of t he second operand.



HADDPS OperandA, OperandB — OperandA ( 128 bit s, four dat a elem ent s) : 3 a , 2 a , 1 a , 0 a — OperandB ( 128 bit s, four dat a elem ent s) : 3 b , 2 b , 1 b , 0 b — Result ( St ored in OperandA) : 3 b + 2 b , 1 b + 0 b , 3 a + 2 a , 1 a + 0 a

The HSUBPS inst ruct ion perform s a single- precision subt ract ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by subt ract ing t he second elem ent of t he first operand from t he first elem ent of t he first operand; t he second elem ent by subt ract ing t he fourt h elem ent of t he first operand from t he t hird elem ent of t he first operand; t he t hird by subt ract ing t he second elem ent of t he second operand from t he first elem ent of t he second operand; and t he fourt h by subt ract ing t he fourt h elem ent of t he second operand from t he t hird elem ent of t he second operand.



HSUBPS OperandA, OperandB — OperandA ( 128 bit s, four dat a elem ent s) : 3 a , 2 a, 1 a , 0 a — OperandB ( 128 bit s, four dat a elem ent s) : 3 b , 2 b , 1 b , 0 b — Result ( St ored in OperandA) : 2 b - 3 b , 0 b - 1 b , 2 a - 3 a , 0 a - 1 a

The HADDPD inst ruct ion perform s a double- precision addit ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by adding t he first and second elem ent s of t he first operand; t he second elem ent by adding t he first and second elem ent s of t he second operand.



HADDPD OperandA, OperandB — OperandA ( 128 bit s, t wo dat a elem ent s) : 1 a , 0 a — OperandB ( 128 bit s, t wo dat a elem ent s) : 1 b , 0 b — Result ( St ored in OperandA) : 1 b + 0 b , 1 a + 0 a

The HSUBPD inst ruct ion perform s a double- precision subt ract ion on cont iguous dat a elem ent s. The first dat a elem ent of t he result is obt ained by subt ract ing t he second elem ent of t he first operand from t he first elem ent of t he first operand; t he second elem ent by subt ract ing t he second elem ent of t he second operand from t he first elem ent of t he second operand.



HSUBPD OperandA OperandB — OperandA ( 128 bit s, t wo dat a elem ent s) : 1 a , 0 a — OperandB ( 128 bit s, t wo dat a elem ent s) : 1 b , 0 b — Result ( St ored in OperandA) : 0 b - 1 b , 0 a - 1 a

12-6 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.3.6

Two Thread Synchronization Instructions

The MONI TOR inst ruct ion set s up an address range t hat is used t o m onit or writ eback- st ores. MWAI T enables a logical processor t o ent er int o an opt im ized st at e while wait ing for a writ e- back- st ore t o t he address range set up by MONI TOR. MONI TOR and MWAI T require t he use of general purpose regist ers for it s input . The regist ers used by MONI TOR and MWAI T m ust be init ialized properly; regist er cont ent is not m odified by t hese inst ruct ions.

12.4

WRITING APPLICATIONS WITH SSE3 EXTENSIONS

The following sect ions give guidelines for writ ing applicat ion program s and operat ing- syst em code t hat use SSE3 inst ruct ions.

12.4.1

Guidelines for Using SSE3 Extensions

The following guidelines describe how t o m axim ize t he benefit s of using SSE3 ext ensions:

• •

Ensure t hat t he processor support s SSE3 ext ensions.

• •

Ensure your operat ing syst em support s MONI TOR and MWAI T.

Ensure t hat your operat ing syst em support s SSE/ SSE2/ SSE3 ext ensions. ( Operat ing syst em support for t he SSE ext ensions im plies support for SSE2 ext ensions, t he x87 and SI MD inst ruct ions of SSE3 ext ensions.) Em ploy t he opt im izat ion and scheduling t echniques described in t he I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual ( see Sect ion 1.4, “ Relat ed Lit erat ure” ) .

12.4.2

Checking for SSE3 Support

Before an applicat ion at t em pt s t o use t he SI MD subset of SSE3 ext ensions, t he applicat ion should follow t he st eps illust rat ed in Sect ion 11.6.2, “ Checking for SSE/ SSE2 Support .” Next , use t he addit ional st ep provided below:



Check t hat t he processor support s t he SI MD and x87 SSE3 ext ensions ( if CPUI D.01H: ECX.SSE3[ bit 0] = 1) . See Exam ple 12- 1 for a code exam ple.

Checking support for SSE, SSE2 along wit h SSE3 allows soft ware flexibilit y t o use SSE3. To use FI STTP, soft ware can use t he st ep above t o det ect support for SSE3. I n t he init ial im plem ent at ion of MONI TOR and MWAI T, t hese t wo inst ruct ions are available t o ring 0 and condit ionally available at ring level great er t han 0. Before an

Vol. 1 12-7

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

applicat ion at t em pt s t o use t he MONI TOR and MWAI T inst ruct ions, t he applicat ion should use t he following st eps: 1. Check t hat t he processor support s MONI TOR and MWAI T. I f CPUI D.01H: ECX.MONI TOR[ bit 3] = 1, MONI TOR and MWAI T are available at ring 0. 2. To verify MONI TOR and MWAI T is support ed at ring level great er t han 0, use a rout ine sim ilar t o Exam ple 12- 2. 3. Query t he sm allest and largest line size t hat MONI TOR uses. Use CPUI D.05H: EAX.sm allest [ bit s 15: 0] ; EBX.largest [ bit s15: 0] . Values are ret urned in byt es in EAX and EBX. 4. Ensure t he m em ory address range( s) t hat will be supplied t o MONI TOR m eet s m em ory t ype requirem ent s. MONI TOR and MWAI T are t arget ed for syst em soft ware t hat support s efficient t hread synchronizat ion, See Chapt er 12 in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A for det ails. Example 12-1. Verifying SSE3 Support boolean SSE3_SIMD_works = TRUE; try { IssueSSE3_SIMD_Instructions(); // Use ADDSUBPD } except (UNWIND) { // if we get here, SSE3 not available SSE3_SIMD_works = FALSE; } Example 12-2. Verifying MONITOR/MWAIT Support boolean MONITOR_MWAIT_works = TRUE; try { _asm { xor ecx, ecx xor edx, edx mov eax, MemArea monitor } // Use monitor } except (UNWIND) {

12-8 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

// if we get here, MONITOR/MWAIT is not available MONITOR_MWAIT_works = FALSE; }

12.4.3

Enable FTZ and DAZ for SIMD Floating-Point Computation

Enabling t he FTZ and DAZ flags in t he MXCSR regist er is likely t o accelerat e SI MD float ing- point com put at ion where st rict com pliance t o t he I EEE st andard 754- 1985 is not required. The FTZ flag is available t o I nt el 64 and I A- 32 processors t hat support t he SSE; DAZ is available t o I nt el 64 processors and t o m ost I A- 32 processors t hat support SSE/ SSE2/ SSE3. Soft ware can det ect t he presence of DAZ, m odify t he MXCSR regist er, and save and rest ore st at e inform at ion by following t he t echniques discussed in Sect ion 11.6.3 t hrough Sect ion 11.6.6.

12.4.4

Programming SSE3 with SSE/SSE2 Extensions

SI MD inst ruct ions in SSE3 ext ensions are int ended t o com plem ent t he use of SSE/ SSE2 in program m ing SI MD applicat ions. Applicat ion soft ware t hat int ends t o use SSE3 inst ruct ions should also check for t he availabilit y of SSE/ SSE2 inst ruct ions. The FI STTP inst ruct ion in SSE3 is int ended t o accelerat e x87 st yle program m ing where perform ance is lim it ed by frequent float ing- point conversion t o int egers; t his happens when t he x87 FPU cont rol word is m odified frequent ly. Use of FI STTP can elim inat e t he need t o access t he x87 FPU cont rol word.

12.5

OVERVIEW OF SSSE3 INSTRUCTIONS

SSSE3 provides 32 inst ruct ions t o accelerat e a variet y of m ult im edia and signal processing applicat ions em ploying SI MD int eger dat a. See:



Sect ion 12.6, “ SSSE3 I nst ruct ions,” provides an int roduct ion t o individual SSE3 inst ruct ions.



I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A & 2B, provide det ailed inform at ion on individual inst ruct ions.



Chapt er 12, “ Syst em Program m ing for St ream ing SI MD I nst ruct ion Set s,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, gives guidelines for int egrat ing SSE/ SSE2/ SSE3/ SSSE3 ext ensions int o an operat ing- syst em environm ent .

Vol. 1 12-9

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.6

SSSE3 INSTRUCTIONS

SSSE3 inst ruct ions include:

• • •

Twelve inst ruct ions t hat perform horizont al addit ion or subt ract ion operat ions.



Two inst ruct ions t hat accelerat e packed- int eger m ult iply operat ions and produce int eger values wit h scaling.



Two inst ruct ions t hat perform a byt e- wise, in- place shuffle according t o t he second shuffle cont rol operand.



Six inst ruct ions t hat negat e packed int egers in t he dest inat ion operand if t he signs of t he corresponding elem ent in t he source operand is less t han zero.



Two inst ruct ions t hat align dat a from t he com posit e of t wo operands.

Six inst ruct ions t hat evaluat e t he absolut e values. Two inst ruct ions t hat perform m ult iply and add operat ions and speed up t he evaluat ion of dot product s.

The operands of t hese inst ruct ions are packed int egers of byt e, word, or double word sizes. The operands are st ored as 64 or 128 bit dat a in MMX regist ers, XMM regist ers, or m em ory. The inst ruct ions are discussed in m ore det ail in t he following paragraphs.

12.6.1

Horizontal Addition/Subtraction

I n analogy t o t he packed, float ing- point horizont al add and subt ract inst ruct ions in SSE3, SSSE3 offers sim ilar capabilit ies on packed int eger dat a. Dat a elem ent s of signed words, doublewords are support ed. Sat urat ed version for horizont al add and subt ract on signed words are also support ed. The horizont al dat a m ovem ent of PHADD is shown in Figure 12- 3.

X3

X2

X1

X0

Y3

Y2

Y1

Y0

ADD

ADD

Y2 + Y3

Y0 + Y1

ADD

ADD

X2 + X3

X0 + X1

Figure 12-3. Horizontal Data Movement in PHADDD

12-10 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

There are six horizont al add inst ruct ions ( represent ed by t hree m nem onics) ; t hree operat e on 128- bit operands and t hree operat e on 64- bit operands. The widt h of each dat a elem ent is eit her 16 bit s or 32 bit s. The m nem onics are list ed below.



PHADDW adds t wo adj acent , signed 16- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed 16- bit result s t o t he dest inat ion operand.



PHADDSW adds t wo adj acent , signed 16- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed, sat urat ed 16- bit result s t o t he dest inat ion operand.



PHADDD adds t wo adj acent , signed 32- bit int egers horizont ally from t he source and dest inat ion operands and packs t he signed 32- bit result s t o t he dest inat ion operand.

There are six horizont al subt ract inst ruct ions ( represent ed by t hree m nem onics) ; t hree operat e on 128- bit operands and t hree operat e on 64- bit operands. The widt h of each dat a elem ent is eit her 16 bit s or 32 bit s. These are list ed below.



PHSUBW perform s horizont al subt ract ion on each adj acent pair of 16- bit signed int egers by subt ract ing t he m ost significant word from t he least significant word of each pair in t he source and dest inat ion operands. The signed 16- bit result s are packed and writ t en t o t he dest inat ion operand.



PHSUBSW perform s horizont al subt ract ion on each adj acent pair of 16- bit signed int egers by subt ract ing t he m ost significant word from t he least significant word of each pair in t he source and dest inat ion operands. The signed, sat urat ed 16- bit result s are packed and writ t en t o t he dest inat ion operand.



PHSUBD perform s horizont al subt ract ion on each adj acent pair of 32- bit signed int egers by subt ract ing t he m ost significant doubleword from t he least significant double word of each pair in t he source and dest inat ion operands. The signed 32- bit result s are packed and writ t en t o t he dest inat ion operand.

12.6.2

Packed Absolute Values

There are six packed- absolut e- value inst ruct ions ( represent ed by t hree m nem onics) . Three operat e on 128- bit operands and t hree operat e on 64- bit operands. The widt hs of dat a elem ent s are 8 bit s, 16 bit s or 32 bit s. The absolut e value of each dat a elem ent of t he source operand is st ored as an UNSI GNED result in t he dest inat ion operand.

• • •

PABSB com put es t he absolut e value of each signed byt e dat a elem ent . PABSW com put es t he absolut e value of each signed 16- bit dat a elem ent . PABSD com put es t he absolut e value of each signed 32- bit dat a elem ent .

Vol. 1 12-11

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.6.3

Multiply and Add Packed Signed and Unsigned Bytes

There are t wo m ult iply- and- add- packed- signed- unsigned- byt e inst ruct ions ( represent ed by one m nem onic) . One operat es on 128- bit operands and t he ot her operat es on 64- bit operands. Mult iplicat ions are perform ed on each vert ical pair of dat a elem ent s. The dat a elem ent s in t he source operand are signed byt e values, t he input dat a elem ent s of t he dest inat ion operand are unsigned byt e values.



PMADDUBSW m ult iplies each unsigned byt e value wit h t he corresponding signed byt e value t o produce an int erm ediat e, 16- bit signed int eger. Each adj acent pair of 16- bit signed values are added horizont ally. The signed, sat urat ed 16- bit result s are packed t o t he dest inat ion operand.

12.6.4

Packed Multiply High with Round and Scale

There are t wo packed- m ult iply- high- wit h- round- and- scale inst ruct ions ( represent ed by one m nem onic) . One operat es on 128- bit operands and t he ot her operat es on 64- bit operands. Mult iplicat ions are perform ed on each vert ical pair of 16- bit dat a elem ent s. The dat a elem ent s in t he source operand are signed int egers, t he dat a elem ent s of t he dest inat ion operand are unsigned int egers.



PMULHRSW m ult iplies vert ically each signed 16- bit int eger from t he dest inat ion operand wit h t he corresponding signed 16- bit int eger of t he source operand, producing int erm ediat e, signed 32- bit int egers. Each int erm ediat e 32- bit int eger is t runcat ed t o t he 18 m ost significant bit s. Rounding is always perform ed by adding 1 t o t he least significant bit of t he 18- bit int erm ediat e result . The final result is obt ained by select ing t he 16 bit s im m ediat ely t o t he right of t he m ost significant bit of each 18- bit int erm ediat e result and packed t o t he dest inat ion operand.

12.6.5

Packed Shuffle Bytes

There are t wo packed- shuffle- byt es inst ruct ions ( represent ed by one m nem onic) . One operat es on 128- bit operands and t he ot her operat es on 64- bit operands. The shuffle operat ions are perform ed byt ewise on t he dest inat ion operand using t he source operand as a cont rol m ask.



PSHUFB perm ut es each byt e in place, according t o a shuffle cont rol m ask. The least significant t hree or four bit s of each shuffle cont rol byt e of t he cont rol m ask form t he shuffle index. The shuffle m ask is unaffect ed. I f t he m ost significant bit ( bit 7) of a shuffle cont rol byt e is set , t he const ant zero is writ t en in t he result byt e.

12-12 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.6.6

Packed Sign

There are six packed- sign inst ruct ions ( represent ed by t hree m nem onics) . Three operat e on 128- bit operands and t hree operat e on 64- bit operands. The widt hs of each dat a elem ent for t hese inst ruct ions are 8 bit , 16 bit or 32 bit signed int egers.



PSI GNB/ W/ D negat es each signed int eger elem ent of t he dest inat ion operand if t he sign of t he corresponding dat a elem ent in t he source operand is less t han zero.

12.6.7

Packed Align Right

There are t wo packed- align- right inst ruct ions ( represent ed by one m nem onic) . One operat es on 128- bit operands and t he ot her operat es on 64- bit operands. These inst ruct ions concat enat e t he dest inat ion and source operand int o a com posit e, and ext ract t he result from t he com posit e according t o an im m ediat e const ant .



PALI GNR’s source operand is appended aft er t he dest inat ion operand form ing an int erm ediat e value of t wice t he widt h of an operand. The result is ext ract ed from t he int erm ediat e value int o t he dest inat ion operand by select ing t he 128- bit or 64- bit value t hat are right- aligned t o t he byt e offset specified by t he im m ediat e value.

12.7

WRITING APPLICATIONS WITH SSSE3 EXTENSIONS

The following sect ions give guidelines for writ ing applicat ion program s and operat ing- syst em code t hat use SSSE3 inst ruct ions.

12.7.1

Guidelines for Using SSSE3 Extensions

The following guidelines describe how t o m axim ize t he benefit s of using SSSE3 ext ensions:

• • •

Ensure t hat t he processor support s SSSE3 ext ensions. Ensure t hat your operat ing syst em support s SSE/ SSE2/ SSE3/ SSSE3 ext ensions. ( Operat ing syst em support for t he SSE ext ensions im plies support for SSE2, t he x87, SI MD inst ruct ions of SSE3, and SSSE3.) Em ploy t he opt im izat ion and scheduling t echniques described in t he I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual ( see Sect ion 1.4, “ Relat ed Lit erat ure” ) .

Vol. 1 12-13

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.7.2

Checking for SSSE3 Support

Before an applicat ion at t em pt s t o use t he SI MD subset of SSSE3 ext ensions, t he applicat ion should follow t he st eps illust rat ed in Sect ion 11.6.2, “ Checking for SSE/ SSE2 Support .” Next , use t he addit ional st ep provided below:



Check t hat t he processor support s t he SI MD and x87 SSSE3 ext ensions ( if CPUI D.01H: ECX.SSSE3[ bit 9] = 1) . See Exam ple 12- 3 for a code exam ple.

Example 12-3. Verifying SSSE3 Support boolean SSSE3_SIMD_works = TRUE; try { Issue_SSSE3_SIMD_Instructions(); // Use PHADDD } except (UNWIND) { // if we get here, SSSE3 not available SSSE3_SIMD_works = FALSE; }

12.8

SSE3/SSSE3 EXCEPTIONS

SSE3/ SSSE3 inst ruct ions can generat e t he sam e t ype of m em ory- access and nonnum eric except ions as ot her I nt el 64 or I A- 32 inst ruct ions. Exist ing except ion handlers generally handle t hese except ions wit hout code m odificat ion. FI STTP can generat e float ing- point except ions. Som e SSE3 inst ruct ions can also generat e SI MD float ing- point except ions. SSE3 addit ions and changes are not ed in t he following sect ions. See also: Sect ion 11.5, “ SSE, SSE2, and SSE3 Except ions”

12.8.1

Device Not Available (DNA) Exceptions

SSE3/ SSSE3 will cause a DNA Except ion ( # NM) if t he processor at t em pt s t o execut e an SSE3 inst ruct ion while CR0.TS[ bit 3] = 1. I f CPUI D.01H: ECX.SSE3[ bit 0] = 0, execut ion of an SSE3 ext ension will cause an invalid opcode fault regardless of t he st at e of CR0.TS[ bit 3] .

12-14 Vol. 1

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12.8.2

Numeric Error flag and IGNNE#

Most SSE3 inst ruct ions ignore CR0.NE[ bit 5] ( t reat s it as if it were always set ) and t he I GNNE# pin. Wit h one except ion, all use t he vect or 19 soft ware except ion for error report ing. The except ion is FI STTP; it behaves like ot her x87- FP inst ruct ions. SSSE3 inst ruct ions ignore CR0.NE[ bit 5] ( t reat s it as if it were always set ) and t he I GNNE# pin. SSSE3 inst ruct ions do not cause float ing- point errors.

12.8.3

Emulation

Used t o em ulat e x87 float ing- point inst ruct ions, CR0.EM[ bit 2] cannot be used for em ulat ion of SSE3/ SSSE3. I f an SSE3/ SSSE3 inst ruct ion execut es wit h CR0.EM[ bit 2] set , an invalid opcode except ion ( I NT 6) is generat ed inst ead of a device not available except ion ( I NT 7) .

Vol. 1 12-15

PROGRAMMING WITH SSE3 AND SUPPLEMENTAL SSE3

12-16 Vol. 1

CHAPTER 13 INPUT/OUTPUT I n addit ion t o t ransferring dat a t o and from ext ernal m em ory, I A- 32 processors can also t ransfer dat a t o and from input / out put port s ( I / O port s) . I / O port s are creat ed in syst em hardware by circuit y t hat decodes t he cont rol, dat a, and address pins on t he processor. These I / O port s are t hen configured t o com m unicat e wit h peripheral devices. An I / O port can be an input port , an out put port , or a bidirect ional port . Som e I / O port s are used for t ransm it t ing dat a, such as t o and from t he t ransm it and receive regist ers, respect ively, of a serial int erface device. Ot her I / O port s are used t o cont rol peripheral devices, such as t he cont rol regist ers of a disk cont roller. This chapt er describes t he processor ’s I / O archit ect ure. The t opics discussed include:

• • •

I / O port addressing I / O inst ruct ions I / O prot ect ion m echanism

13.1

I/O PORT ADDRESSING

The processor perm it s applicat ions t o access I / O port s in eit her of t wo ways:

• •

Through a separat e I / O address space Through m em ory- m apped I / O

Accessing I / O port s t hrough t he I / O address space is handled t hrough a set of I / O inst ruct ions and a special I / O prot ect ion m echanism . Accessing I / O port s t hrough m em ory- m apped I / O is handled wit h t he processors general- purpose m ove and st ring inst ruct ions, wit h prot ect ion provided t hrough segm ent at ion or paging. I / O port s can be m apped so t hat t hey appear in t he I / O address space or t he physicalm em ory address space ( m em ory m apped I / O) or bot h. One benefit of using t he I / O address space is t hat writ es t o I / O port s are guarant eed t o be com plet ed before t he next inst ruct ion in t he inst ruct ion st ream is execut ed. Thus, I / O writ es t o cont rol syst em hardware cause t he hardware t o be set t o it s new st at e before any ot her inst ruct ions are execut ed. See Sect ion 13.6, “ Ordering I / O,” for m ore inform at ion on serializing of I / O operat ions.

13.2

I/O PORT HARDWARE

From a hardware point of view, I / O addressing is handled t hrough t he processor ’s address lines. For t he P6 fam ily, Pent ium 4, and I nt el Xeon pr ocessors, t he request com m and lines signal whet her t he address lines are being driven wit h a m em ory address or an I / O addr ess; for Pent ium processor s and earlier I A- 32 processor s, t he M/ I O# pin indicat es a m em ory addr ess ( 1) or an I / O address ( 0) . When t he separat e

Vol. 1 13-1

INPUT/OUTPUT

I / O address space is select ed, it is t he responsibilit y of t he hardwar e t o decode t he m em ory- I / O bus t ransact ion t o select I / O port s rat her t han m em ory. Dat a is t ransm it t ed bet ween t he pr ocessor and an I / O device t hrough t he dat a lines.

13.3

I/O ADDRESS SPACE

The processor ’s I / O address space is separat e and dist inct from t he physical- m em ory address space. The I / O address space consist s of 2 16 ( 64K) individually addressable 8- bit I / O port s, num bered 0 t hrough FFFFH. I / O port addresses 0F8H t hrough 0FFH are reserved. Do not assign I / O port s t o t hese addresses. The result of an at t em pt t o address beyond t he I / O address space lim it of FFFFH is im plem ent at ion- specific; see t he Developer ’s Manuals for specific processors for m ore det ails. Any t wo consecut ive 8- bit port s can be t reat ed as a 16- bit port , and any four consecut ive port s can be a 32- bit port . I n t his m anner, t he processor can t ransfer 8, 16, or 32 bit s t o or from a device in t he I / O address space. Like words in m em ory, 16- bit port s should be aligned t o even addresses ( 0, 2, 4, ...) so t hat all 16 bit s can be t ransferred in a single bus cycle. Likewise, 32- bit port s should be aligned t o addresses t hat are m ult iples of four ( 0, 4, 8, ...) . The processor support s dat a t ransfers t o unaligned port s, but t here is a perform ance penalt y because one or m ore ext ra bus cycle m ust be used. The exact order of bus cycles used t o access unaligned port s is undefined and is not guarant eed t o rem ain t he sam e in fut ure I A- 32 processors. I f hardware or soft ware requires t hat I / O port s be writ t en t o in a part icular order, t hat order m ust be specified explicit ly. For exam ple, t o load a word- lengt h I / O port at address 2H and t hen anot her word port at 4H, t wo word- lengt h writ es m ust be used, rat her t han a single doubleword writ e at 2H. Not e t hat t he processor does not m ask parit y errors for bus cycles t o t he I / O address space. Accessing I / O port s t hrough t he I / O address space is t hus a possible source of parit y errors.

13.3.1

Memory-Mapped I/O

I / O devices t hat respond like m em ory com ponent s can be accessed t hrough t he processor ’s physical- m em ory address space ( see Figure 13- 1) . When using m em orym apped I / O, any of t he processor ’s inst ruct ions t hat reference m em ory can be used t o access an I / O port locat ed at a physical- m em ory address. For exam ple, t he MOV inst ruct ion can t ransfer dat a bet ween any regist er and a m em ory- m apped I / O port . The AND, OR, and TEST inst ruct ions m ay be used t o m anipulat e bit s in t he cont rol and st at us regist ers of a m em ory- m apped peripheral devices. When using m em ory- m apped I / O, caching of t he address space m apped for I / O operat ions m ust be prevent ed. Wit h t he Pent ium 4, I nt el Xeon, and P6 fam ily processors, caching of I / O accesses can be prevent ed by using m em ory t ype range regist ers ( MTRRs) t o m ap t he address space used for t he m em ory- m apped I / O as

13-2 Vol. 1

INPUT/OUTPUT

uncacheable ( UC) . See Chapt er 10, “ Mem ory Cache Cont rol,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for a com plet e discussion of t he MTRRs. The Pent ium and I nt el486 processors do not support MTRRs. I nst ead, t hey provide t he KEN# pin, which when held inact ive ( high) prevent s caching of all addresses sent out on t he syst em bus. To use t his pin, ext ernal address decoding logic is required t o block caching in specific address spaces.

Physical Memory FFFF FFFFH EPROM

I/O Port I/O Port I/O Port

RAM

0

Figure 13-1. Memory-Mapped I/O All t he I A- 32 processors t hat have on- chip caches also provide t he PCD ( page- level cache disable) flag in page t able and page direct ory ent ries. This flag allows caching t o be disabled on a page- by- page basis. See “ Page- Direct ory and Page-Table Ent ries” in Chapt er 3 of in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A.

13.4

I/O INSTRUCTIONS

The processor ’s I / O inst ruct ions provide access t o I / O port s t hrough t he I / O address space. ( These inst ruct ions cannot be used t o access m em ory- m apped I / O port s.) There are t wo groups of I / O inst ruct ions:



Those t hat t ransfer a single it em ( byt e, word, or doubleword) bet ween an I / O port and a general- purpose regist er

Vol. 1 13-3

INPUT/OUTPUT



Those t hat t ransfer st rings of it em s ( st rings of byt es, words, or doublewords) bet ween an I / O port and m em ory

The regist er I / O inst ruct ions I N ( input from I / O port ) and OUT ( out put t o I / O port ) m ove dat a bet ween I / O port s and t he EAX regist er ( 32- bit I / O) , t he AX regist er ( 16- bit I / O) , or t he AL ( 8- bit I / O) regist er. The address of t he I / O port can be given wit h an im m ediat e value or a value in t he DX regist er. The st ring I / O inst ruct ions I NS ( input st ring from I / O port ) and OUTS ( out put st ring t o I / O port ) m ove dat a bet ween an I / O port and a m em ory locat ion. The address of t he I / O port being accessed is given in t he DX regist er; t he source or dest inat ion m em ory address is given in t he DS: ESI or ES: EDI regist er, respect ively. When used wit h one of t he repeat prefixes ( such as REP) , t he I NS and OUTS inst ruct ions perform st ring ( or block) input or out put operat ions. The repeat prefix REP m odifies t he I NS and OUTS inst ruct ions t o t ransfer blocks of dat a bet ween an I / O port and m em ory. Here, t he ESI or EDI regist er is increm ent ed or decrem ent ed ( according t o t he set t ing of t he DF flag in t he EFLAGS regist er) aft er each byt e, word, or doubleword is t ransferred bet ween t he select ed I / O port and m em ory. See t he references for I N, I NS, OUT, and OUTS in Chapt er 3 and Chapt er 4 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A & 3B, for m ore inform at ion on t hese inst ruct ions.

13.5

PROTECTED-MODE I/O

When t he processor is running in prot ect ed m ode, t he following prot ect ion m echanism s regulat e access t o I / O port s:



When accessing I / O port s t hrough t he I / O address space, t wo prot ect ion devices cont rol access: — The I / O privilege level ( I OPL) field in t he EFLAGS regist er — The I / O perm ission bit m ap of a t ask st at e segm ent ( TSS)



When accessing m em ory- m apped I / O port s, t he norm al segm ent at ion and paging prot ect ion and t he MTRRs ( in processors t hat support t hem ) also affect access t o I / O port s. See Chapt er 4, “ Prot ect ion,” and Chapt er 10, “ Mem ory Cache Cont rol,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for a com plet e discussion of m em ory prot ect ion.

The following sect ions describe t he prot ect ion m echanism s available when accessing I / O port s in t he I / O address space wit h t he I / O inst ruct ions.

13.5.1

I/O Privilege Level

I n syst em s where I / O prot ect ion is used, t he I OPL field in t he EFLAGS regist er cont rols access t o t he I / O address space by rest rict ing use of select ed inst ruct ions. This prot ect ion m echanism perm it s t he operat ing syst em or execut ive t o set t he priv-

13-4 Vol. 1

INPUT/OUTPUT

ilege level needed t o perform I / O. I n a t ypical prot ect ion ring m odel, access t o t he I / O address space is rest rict ed t o privilege levels 0 and 1. Here, kernel and t he device drivers are allowed t o perform I / O, while less privileged device drivers and applicat ion program s are denied access t o t he I / O address space. Applicat ion program s m ust t hen m ake calls t o t he operat ing syst em t o perform I / O. The following inst ruct ions can be execut ed only if t he current privilege level ( CPL) of t he program or t ask current ly execut ing is less t han or equal t o t he I OPL: I N, I NS, OUT, OUTS, CLI ( clear int errupt- enable flag) , and STI ( set int errupt- enable flag) . These inst ruct ions are called I / O se n sit ive inst ruct ions, because t hey are sensit ive t o t he I OPL field. Any at t em pt by a less privileged program or t ask t o use an I / O sensit ive inst ruct ion result s in a general- prot ect ion except ion ( # GP) being signaled. Because each t ask has it s own copy of t he EFLAGS regist er, each t ask can have a different I OPL. The I / O perm ission bit m ap in t he TSS can be used t o m odify t he effect of t he I OPL on I / O sensit ive inst ruct ions, allowing access t o som e I / O port s by less privileged program s or t asks ( see Sect ion 13.5.2, “ I / O Perm ission Bit Map” ) . A program or t ask can change it s I OPL only wit h t he POPF and I RET inst ruct ions; however, such changes are privileged. No procedure m ay change t he current I OPL unless it is running at privilege level 0. An at t em pt by a less privileged procedure t o change t he I OPL does not result in an except ion; t he I OPL sim ply rem ains unchanged. The POPF inst ruct ion also m ay be used t o change t he st at e of t he I F flag ( as can t he CLI and STI inst ruct ions) ; however, t he POPF inst ruct ion in t his case is also I / O sensit ive. A procedure m ay use t he POPF inst ruct ion t o change t he set t ing of t he I F flag only if t he CPL is less t han or equal t o t he current I OPL. An at t em pt by a less privileged procedure t o change t he I F flag does not result in an except ion; t he I F flag sim ply rem ains unchanged.

13.5.2

I/O Permission Bit Map

The I / O perm ission bit m ap is a device for perm it t ing lim it ed access t o I / O port s by less privileged program s or t asks and for t asks operat ing in virt ual- 8086 m ode. The I / O perm ission bit m ap is locat ed in t he TSS ( see Figure 13- 2) for t he current ly running t ask or program . The address of t he first byt e of t he I / O perm ission bit m ap is given in t he I / O m ap base address field of t he TSS. The size of t he I / O perm ission bit m ap and it s locat ion in t he TSS are variable.

Vol. 1 13-5

INPUT/OUTPUT

Task State Segment (TSS) Last byte of bit map must be followed by a byte with all bits set

I/O map base must not exceed DFFFH.

31

24 23

0

1 1 1 1 1 1 1 1

I/O Permission Bit Map

I/O Map Base

64H

0

Figure 13-2. I/O Permission Bit Map Because each t ask has it s own TSS, each t ask has it s own I / O perm ission bit m ap. Access t o individual I / O port s can t hus be grant ed t o individual t asks. I f in prot ect ed m ode and t he CPL is less t han or equal t o t he current I OPL, t he processor allows all I / O operat ions t o proceed. I f t he CPL is great er t han t he I OPL or if t he processor is operat ing in virt ual- 8086 m ode, t he processor checks t he I / O perm ission bit m ap t o det erm ine if access t o a part icular I / O port is allowed. Each bit in t he m ap corresponds t o an I / O port byt e address. For exam ple, t he cont rol bit for I / O port address 29H in t he I / O address space is found at bit posit ion 1 of t he sixt h byt e in t he bit m ap. Before grant ing I / O access, t he processor t est s all t he bit s corresponding t o t he I / O port being addressed. For a doubleword access, for exam ple, t he processors t est s t he four bit s corresponding t o t he four adj acent 8- bit port addresses. I f any t est ed bit is set , a general- prot ect ion except ion ( # GP) is signaled. I f all t est ed bit s are clear, t he I / O operat ion is allowed t o proceed. Because I / O port addresses are not necessarily aligned t o word and doubleword boundaries, t he processor reads t wo byt es from t he I / O perm ission bit m ap for every access t o an I / O port . To prevent except ions from being generat ed when t he port s wit h t he highest addresses are accessed, an ext ra byt e needs t o included in t he TSS im m ediat ely aft er t he t able. This byt e m ust have all of it s bit s set , and it m ust be wit hin t he segm ent lim it . I t is not necessary for t he I / O perm ission bit m ap t o represent all t he I / O addresses. I / O addresses not spanned by t he m ap are t reat ed as if t hey had set bit s in t he m ap. For exam ple, if t he TSS segm ent lim it is 10 byt es past t he bit- m ap base address, t he m ap has 11 byt es and t he first 80 I / O port s are m apped. Higher addresses in t he I / O address space generat e except ions.

13-6 Vol. 1

INPUT/OUTPUT

I f t he I / O bit m ap base address is great er t han or equal t o t he TSS segm ent lim it , t here is no I / O perm ission m ap, and all I / O inst ruct ions generat e except ions when t he CPL is great er t han t he current I OPL.

13.6

ORDERING I/O

When cont rolling I / O devices it is oft en im port ant t hat m em ory and I / O operat ions be carried out in precisely t he order program m ed. For exam ple, a program m ay writ e a com m and t o an I / O port , t hen read t he st at us of t he I / O device from anot her I / O port . I t is im port ant t hat t he st at us ret urned be t he st at us of t he device a ft e r it receives t he com m and, not be for e . When using m em ory- m apped I / O, caut ion should be t aken t o avoid sit uat ions in which t he program m ed order is not preserved by t he processor. To opt im ize perform ance, t he processor allows cacheable m em ory reads t o be reordered ahead of buffered writ es in m ost sit uat ions. I nt ernally, processor reads ( cache hit s) can be reordered around buffered writ es. When using m em ory- m apped I / O, t herefore, is possible t hat an I / O read m ight be perform ed before t he m em ory writ e of a previous inst ruct ion. The recom m ended m et hod of enforcing program ordering of m em orym apped I / O accesses wit h t he Pent ium 4, I nt el Xeon, and P6 fam ily processors is t o use t he MTRRs t o m ake t he m em ory m apped I / O address space uncacheable; for t he Pent ium and I nt el486 processors, eit her t he # KEN pin or t he PCD flags can be used for t his purpose ( see Sect ion 13.3.1, “ Mem ory- Mapped I / O” ) . When t he t arget of a read or writ e is in an uncacheable region of m em ory, m em ory reordering does not occur ext ernally at t he processor ’s pins ( t hat is, reads and writ es appear in- order) . Designat ing a m em ory m apped I / O region of t he address space as uncacheable insures t hat reads and writ es of I / O devices are carried out in program order. See Chapt er 10, “ Mem ory Cache Cont rol,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion on using MTRRs. Anot her m et hod of enforcing program order is t o insert one of t he serializing inst ruct ions, such as t he CPUI D inst ruct ion, bet ween operat ions. See Chapt er 7, “ Mult ipleProcessor Managem ent ,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for m ore inform at ion on serializat ion of inst ruct ions. I t should be not ed t hat t he chip set being used t o support t he processor ( bus cont roller, m em ory cont roller, and/ or I / O cont roller) m ay post writ es t o uncacheable m em ory which can lead t o out- of- order execut ion of m em ory accesses. I n sit uat ions where out- of- order processing of m em ory accesses by t he chip set can pot ent ially cause fault y m em ory- m apped I / O processing, code m ust be writ t en t o force synchronizat ion and ordering of I / O operat ions. Serializing inst ruct ions can oft en be used for t his purpose.

Vol. 1 13-7

INPUT/OUTPUT

When t he I / O address space is used inst ead of m em ory- m apped I / O, t he sit uat ion is different in t wo respect s:



The processor never buffers I / O writ es. Therefore, st rict ordering of I / O operat ions is enforced by t he processor. ( As wit h m em ory- m apped I / O, it is possible for a chip set t o post writ es in cert ain I / O ranges.)



The pr ocessor synchr onizes I / O inst r uct ion execut ion w it h ext er nal bus act ivit y ( see Table 13- 1) .

Table 13-1. I/O Instruction Serialization

Processor Delays Execution of … Instruction Being Current Executed Instruction?

Next Instruction?

Until Completion of …

Pending Stores?

IN

Yes

Yes

INS

Yes

Yes

REP INS

Yes

Yes

Current Store?

OUT

Yes

Yes

Yes

OUTS

Yes

Yes

Yes

REP OUTS

Yes

Yes

Yes

13-8 Vol. 1

CHAPTER 14 PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION When writ ing soft ware int ended t o run on I A- 32 processors, it is necessary t o ident ify t he t ype of processor present in a syst em and t he processor feat ures t hat are available t o an applicat ion.

14.1

USING THE CPUID INSTRUCTION

Use t he CPUI D inst ruct ion for processor ident ificat ion in t he Pent ium M processor fam ily, Pent ium 4 processor fam ily, I nt el Xeon processor fam ily, P6 fam ily, Pent ium processor, and lat er I nt el486 processors. This inst ruct ion ret urns t he fam ily, m odel and ( for som e processors) a brand st ring for t he processor t hat execut es t he inst ruct ion. I t also indicat es t he feat ures t hat are present in t he processor and give inform at ion about t he processors caches and TLB. The I D flag ( bit 21) in t he EFLAGS regist er indicat es support for t he CPUI D inst ruct ion. I f a soft ware procedure can set and clear t his flag, t he processor execut ing t he procedure support s t he CPUI D inst ruct ion. The CPUI D inst ruct ion will cause t he invalid opcode except ion ( # UD) if execut ed on a processor t hat does not support it . To obt ain processor ident ificat ion inform at ion, a source operand value is placed in t he EAX regist er t o select t he t ype of inform at ion t o be ret urned. When t he CPUI D inst ruct ion is execut ed, select ed inform at ion is ret urned in t he EAX, EBX, ECX, and EDX regist ers. For a com plet e descript ion of t he CPUI D inst ruct ion, t ables indicat ing values ret urned, and exam ple code, see “ CPUI D—CPUI D I dent ificat ion” in Chapt er 3 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A.

14.1.1

Notes on Where to Start

For det ailed applicat ion not es on t he inst ruct ion, see AP- 485, I nt el Processor I dent ificat ion and t he CPUI D I nst ruct ion ( Order Num ber 241618) . This publicat ion provides addit ional inform at ion and exam ple source code for use in ident ifying I A- 32 processors. I t also cont ains guidelines for using t he CPUI D inst ruct ion t o help m aint ain t he widest range of soft ware com pat ibilit y. The following guidelines are am ong t he m ost im port ant , and should always be followed when using t he CPUI D inst ruct ion t o det erm ine available feat ures:



Always begin by t est ing for t he “ GenuineI nt el,” m essage in t he EBX, EDX, and ECX regist ers when t he CPUI D inst ruct ion is execut ed wit h EAX equal t o 0. I f t he processor is not genuine I nt el, t he feat ure ident ificat ion flags m ay have different m eanings t han are described in I nt el docum ent at ion.

Vol. 1 14-1

PROCESSOR IDENTIFICATION AND FEATURE DETERMINATION



Test feat ure ident ificat ion flags individually and do not m ake assum pt ions about undefined bit s.

14.1.2

Identification of Earlier IA-32 Processors

The CPUI D inst ruct ion is not available in earlier I A- 32 processors up t hrough t he earlier I nt el486 processors. For t hese processors, several ot her archit ect ural feat ures can be exploit ed t o ident ify t he processor. The set t ings of bit s 12 and 13 ( I OPL) , 14 ( NT) , and 15 ( reserved) in t he EFLAGS regist er are different for I nt el’s 32- bit processors t han for t he I nt el 8086 and I nt el 286 processors. By exam ining t he set t ings of t hese bit s ( wit h t he PUSHF/ PUSHFD and POP/ POPFD inst ruct ions) , an applicat ion program can det erm ine whet her t he processor is an 8086, I nt el 286, or one of t he I nt el 32- bit processors:

• • •

8086 processor — Bit s 12 t hrough 15 of t he EFLAGS regist er are always set . I nt el 286 processor — Bit s 12 t hrough 15 are always clear in real- address m ode. 32- bit processors — I n real- address m ode, bit 15 is always clear and bit s 12 t hrough 14 have t he last value loaded int o t hem . I n prot ect ed m ode, bit 15 is always clear, bit 14 has t he last value loaded int o it , and t he I OPL bit s depends on t he current privilege level ( CPL) . The I OPL field can be changed only if t he CPL is 0.

Ot her EFLAG regist er bit s t hat can be used t o different iat e bet ween t he 32- bit processors:



Bit 18 ( AC) — I m plem ent ed only on t he Pent ium 4, I nt el Xeon, P6 fam ily, Pent ium , and I nt el486 processors. The inabilit y t o set or clear t his bit dist inguishes an I nt el386 processor from t he lat er I A- 32 processors.



Bit 21 ( I D) — Det erm ines if t he processor is able t o execut e t he CPUI D inst ruct ion. The abilit y t o set and clear t his bit indicat es t hat it is a Pent ium 4, I nt el Xeon, P6 fam ily, Pent ium , or lat er-version I nt el486 processor.

To det erm ine whet her an x87 FPU or NPX is present in a syst em , applicat ions can writ e t o t he x87 FPU st at us and cont rol regist ers using t he FNI NI T inst ruct ion and t hen verify t hat t he correct values are read back using t he FNSTENV inst ruct ion. Aft er det erm ining t hat an x87 FPU or NPX is present , it s t ype can t hen be det erm ined. I n m ost cases, t he processor t ype will det erm ine t he t ype of FPU or NPX; however, an I nt el386 processor is com pat ible wit h eit her an I nt el 287 or I nt el 387 m at h coprocessor.

The m et hod t he coprocessor uses t o represent ∞ ( aft er t he execut ion of t he FI NI T, FNI NI T, or RESET inst ruct ion) indicat es which coprocessor is present . The I nt el 287 m at h coprocessor uses t he sam e bit represent at ion for + ∞ and −∞; whereas, t he I nt el 387 m at h coprocessor uses different represent at ions for + ∞ and −∞.

14-2 Vol. 1

APPENDIX A EFLAGS CROSS-REFERENCE A.1

EFLAGS AND INSTRUCTIONS

Table A- 2 sum m arizes how t he inst ruct ions affect t he flags in t he EFLAGS regist er. The following codes describe how t he flags are affect ed.

Table A-1. Codes Describing Flags T

Instruction tests flag.

M

Instruction modifies flag (either sets or resets depending on operands).

0

Instruction resets flag.

1

Instruction sets flag.



Instruction's effect on flag is undefined.

R

Instruction restores prior value of flag.

Blank

Instruction does not affect flag.

Table A-2. EFLAGS Cross-Reference Instruction

OF

SF

ZF

AAA







AAD



M

AAM



AAS

PF

CF

TM



M

M



M



M

M



M









TM



M

ADC

M

M

M

M

M

TM

ADD

M

M

M

M

M

M

AND

0

M

M



M

0

ARPL

AF

TF

IF

DF

NT

RF

M

BOUND BSF/BSR





M

















M

BSWAP BT/BTS/BTR/BTC CALL CBW

Vol. 1 A-1

EFLAGS CROSS-REFERENCE

Table A-2. EFLAGS Cross-Reference (Contd.) Instruction

OF

SF

ZF

AF

PF

CLC

CF

TF

IF

DF

NT

0

CLD

0

CLI

0

CLTS CMC

M

CMOVcc

T

T

T

T

T

CMP

M

M

M

M

M

M

CMPS

M

M

M

M

M

M

CMPXCHG

M

M

M

M

M

M

CMPXCHG8B

T

M

COMSID

0

0

M

0

M

M

COMISS

0

0

M

0

M

M

DAA



M

M

TM

M

TM

DAS



M

M

TM

M

TM

DEC

M

M

M

M

M

DIV











CPUID CWD



ENTER ESC FCMOVcc

T

T

T

FCOMI, FCOMIP, FUCOMI, FUCOMIP

M

M

M

HLT IDIV













IMUL

M









M

M

M

M

M

M

IN INC INS

T

INT INTO

A-2 Vol. 1

T

0

0

0

0

RF

EFLAGS CROSS-REFERENCE

Table A-2. EFLAGS Cross-Reference (Contd.) Instruction

OF

SF

ZF

AF

PF

CF

UCOMSID

0

0

M

0

M

M

UCOMISS

0

0

M

0

M

M

IRET

R

R

R

R

R

R

Jcc

T

T

T

T

T

TF

IF

DF

NT

R

T

RF

INVD INVLPG

R

R

JCXZ JMP LAHF LAR

M

LDS/LES/LSS/LFS/LGS LEA LEAVE LGDT/LIDT/LLDT/LMSW LOCK LODS

T

LOOP LOOPE/LOOPNE

T

LSL

M

LTR MONITOR MWAIT MOV MOV control, debug, test













MOVS

T

MOVSX/MOVZX MUL

M









M

NEG

M

M

M

M

M

M

0

M

M



M

0

NOP NOT OR

Vol. 1 A-3

EFLAGS CROSS-REFERENCE

Table A-2. EFLAGS Cross-Reference (Contd.) Instruction

OF

SF

ZF

AF

PF

CF

TF

IF

DF

NT

RF

OUT OUTS

T

POP/POPA POPF

R

R

R

R

R

R

R

R

R

R

M

M

M

M

PUSH/PUSHA/PUSHF RCL/RCR 1

M

TM

RCL/RCR count



TM

ROL/ROR 1

M

M

ROL/ROR count



M

RSM

M

RDMSR RDPMC RDTSC REP/REPE/REPNE RET

SAHF

M

M

M

M

M

R

R

R

R

R

SAL/SAR/SHL/SHR 1

M

M

M



M

M

SAL/SAR/SHL/SHR count



M

M



M

M

SBB

M

M

M

M

M

TM

SCAS

M

M

M

M

M

M

SETcc

T

T

T

T

T



M

M

M

M

T

SGDT/SIDT/SLDT/SMSW SHLD/SHRD



STC

1

STD

1

STI

1

STOS

T

STR SUB

M

M

M

M

M

M

TEST

0

M

M



M

0

A-4 Vol. 1

M

EFLAGS CROSS-REFERENCE

Table A-2. EFLAGS Cross-Reference (Contd.) Instruction

OF

SF

ZF

AF

PF

CF

TF

IF

DF

NT

RF

UD2 VERR/VERRW

M

WAIT WBINVD WRMSR XADD

M

M

M

M

M

M

0

M

M



M

0

XCHG XLAT XOR

Vol. 1 A-5

EFLAGS CROSS-REFERENCE

A-6 Vol. 1

APPENDIX B EFLAGS CONDITION CODES B.1

CONDITION CODES

Table B- 1 list s condit ion codes t hat can be queried using CMOVcc, FCMOVcc, Jcc, and SETcc. Condit ion codes refer t o t he set t ing of one or m ore st at us flags ( CF, OF, SF, ZF, and PF) in t he EFLAGS regist er. I n t he t able below:



The “ Mnem onic” colum n provides t he suffix ( cc) added t o t he inst ruct ion t o specify a t est condit ion.

• •

“ Condit ion Test ed For ” describes t he t arget ed condit ion.



“ I nst ruct ion Subcode” provides t he opcode suffix added t o t he m ain opcode t o specify t he t est condit ion. “ St at us Flags Set t ing” describes t he flag set t ing.

Table B-1. EFLAGS Condition Codes Condition Tested For

Instruction Subcode

Status Flags Setting

O

Overflow

0000

OF = 1

NO

No overflow

0001

OF = 0

B NAE

Below Neither above nor equal

0010

CF = 1

NB AE

Not below Above or equal

0011

CF = 0

E Z

Equal Zero

0100

ZF = 1

NE NZ

Not equal Not zero

0101

ZF = 0

BE NA

Below or equal Not above

0110

(CF OR ZF) = 1

NBE A

Neither below nor equal Above

0111

(CF OR ZF) = 0

S

Sign

1000

SF = 1

NS

No sign

1001

SF = 0

P PE

Parity Parity even

1010

PF = 1

Mnemonic (cc)

Vol. 1 B-1

EFLAGS CONDITION CODES

Table B-1. EFLAGS Condition Codes (Contd.) Instruction Subcode

Status Flags Setting

No parity Parity odd

1011

PF = 0

L NGE

Less Neither greater nor equal

1100

(SF xOR OF) = 1

NL GE

Not less Greater or equal

1101

(SF xOR OF) = 0

LE NG

Less or equal Not greater

1110

((SF XOR OF) OR ZF) = 1

NLE G

Neither less nor equal Greater

1111

((SF XOR OF) OR ZF) = 0

Mnemonic (cc)

Condition Tested For

NP PO

Many of t he t est condit ions are described in t wo different ways. For exam ple, LE ( less or equal) and NG ( not great er) describe t he sam e t est condit ion. Alt ernat e m nem onics are provided t o m ake code m ore int elligible. The t erm s “ above” and “ below” are associat ed wit h t he CF flag and refer t o t he relat ion bet ween t wo unsigned int eger values. The t erm s “ great er ” and “ less” are associat ed wit h t he SF and OF flags and refer t o t he relat ion bet ween t wo signed int eger values.

B-2 Vol. 1

APPENDIX C FLOATING-POINT EXCEPTIONS SUMMARY C.1

OVERVIEW

This appendix shows which of t he float ing- point except ions can be generat ed for:

• • • •

x87 FPU inst ruct ions — see Table C- 2 SSE inst ruct ions — see Table C- 3 SSE2 inst ruct ions — see Table C- 4 SSE3 inst ruct ions — see Table C- 5

Table C- 1 list s t ypes of float ing- point except ions t hat pot ent ially can be generat ed by t he x87 FPU and by SSE/ SSE2/ SSE3 inst ruct ions.

Table C-1. x87 FPU and SIMD Floating-Point Exceptions Floatingpoint Exception

Description

#IS

Invalid-operation exception for stack underflow or stack overflow (can only be generated for x87 FPU instructions)*

#IA or #I

Invalid-operation exception for invalid arithmetic operands and unsupported formats*

#D

Denormal-operand exception

#Z

Divide-by-zero exception

#O

Numeric-overflow exception

#U

Numeric-underflow exception

#P

Inexact-result (precision) exception

NOTE: * The x87 FPU instruction set generates two types of invalid-operation exceptions: #IS (stack underflow or stack overflow) and #IA (invalid arithmetic operation due to invalid arithmetic operands or unsupported formats). SSE/SSE2/SSE3 instructions potentially generate #I (invalid operation exceptions due to invalid arithmetic operands or unsupported formats). The float ing point except ions shown in Table C- 1 ( except for # D and # I S) are defined in I EEE St andard 754- 1985 for Binary Float ing- Point Arit hm et ic. See Sect ion 4.9.1, “ Float ing- Point Except ion Condit ions,” for a det ailed discussion of float ing- point except ions.

Vol. 1 C-1

FLOATING-POINT EXCEPTIONS SUMMARY

C.2

X87 FPU INSTRUCTIONS

Table C- 2 list s t he x87 FPU inst ruct ions in alphabet ical order. For each inst ruct ion, it sum m arizes t he float ing- point except ions t hat t he inst ruct ion can generat e.

Table C-2. Exceptions Generated with x87 FPU Floating-Point Instructions Mnemonic

Instruction

#IS #IA

#D

F2XM1

Exponential

Y

Y

Y

FABS

Absolute value

Y

FADD(P)

Add floating-point

Y

Y

Y

FBLD

BCD load

Y

FBSTP

BCD store and pop

Y

FCHS

Change sign

Y

FCLEX

Clear exceptions

FCMOVcc

Floating-point conditional move

Y

FCOM, FCOMP, FCOMPP

Compare floating-point

Y

Y

Y

FCOMI, FCOMIP, FUCOMI, FUCOMIP

Compare floating-point and set EFLAGS

Y

Y

Y

FCOS

Cosine

Y

Y

Y

FDECSTP

Decrement stack pointer

FDIV(R)(P)

Divide floating-point

Y

Y

Y

FFREE

Free register

FIADD

Integer add

Y

Y

Y

FICOM(P)

Integer compare

Y

Y

Y

#Z

#O

Y

#U

#P

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

FIDIV

Integer divide

Y

Y

Y

Y

FIDIVR

Integer divide reversed

Y

Y

Y

Y

FILD

Integer load

Y

FIMUL

Integer multiply

Y

Y

Y

FINCSTP

Increment stack pointer

FINIT

Initialize processor

FIST(P)

Integer store

Y

Y

Y

FISTTP

Truncate to integer (SSE3 instruction)

Y

Y

Y

FISUB(R)

Integer subtract

Y

Y

C-2 Vol. 1

Y

Y

Y

Y

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-2. Exceptions Generated with x87 FPU Floating-Point Instructions (Contd.) Mnemonic FLD extended or stack

Instruction Load floating-point

#IS #IA

#D

#Z

#O

#U

#P

Y

FLD single or double

Load floating-point

Y

FLD1

Load + 1.0

Y

Y

Y

FLDCW

Load Control word

Y

Y

Y

Y

Y

Y

Y

FLDENV

Load environment

Y

Y

Y

Y

Y

Y

Y

FLDL2E

Load log2e

Y

FLDL2T

Load log210

Y

FLDLG2

Load log102

Y

FLDLN2

Load loge2

Y

Y

Y

Y

Load π

Y

FLDPI FLDZ

Load + 0.0

Y

FMUL(P)

Multiply floating-point

Y

Y

Y

FNOP

No operation

FPATAN

Partial arctangent

Y

Y

Y

Y

FPREM

Partial remainder

Y

Y

Y

Y

FPREM1

IEEE partial remainder

Y

Y

Y

Y

FPTAN

Partial tangent

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

FRNDINT

Round to integer

Y

Y

Y

FRSTOR

Restore state

Y

Y

Y

FSAVE

Save state

Y Y

FSCALE

Scale

Y

Y

Y

Y

Y

FSIN

Sine

Y

Y

Y

Y

Y

FSINCOS

Sine and cosine

Y

Y

Y

Y

Y

Y

Y

FSQRT

Square root

Y

FST(P) stack or extended

Store floating-point

Y

FST(P) single or double

Store floating-point

Y

Y

FSTCW

Store control word

FSTENV

Store environment

FSTSW (AX)

Store status word

FSUB(R)(P)

Subtract floating-point

Y

Y

Y

FTST

Test

Y

Y

Y

Y

Y

Y

Y

Y

Y

Y

Vol. 1 C-3

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-2. Exceptions Generated with x87 FPU Floating-Point Instructions (Contd.) Mnemonic

Instruction

#IS #IA

FUCOM(P)(P)

Unordered compare floatingpoint

FWAIT

CPU Wait

FXAM

Examine

FXCH

Exchange registers

Y

FXTRACT

Extract

FYL2X FYL2XP1

C.3

Y

#D

#Z

Y

Y

Y

Y

Y

Y

Logarithm

Y

Y

Y

Y

Logarithm epsilon

Y

Y

Y

#O

#U

#P

Y

Y

Y

Y

Y

Y

SSE INSTRUCTIONS

Table C- 3 list s SSE inst ruct ions wit h at least one of t he following charact erist ics:

• • •

have float ing- point operands generat e float ing- point result s read or writ e float ing- point st at us and cont rol inform at ion

The t able also sum m arizes t he float ing- point except ions t hat each inst ruct ion can generat e.

Table C-3. Exceptions Generated with SSE Instructions Mnemonic

Instruction

#I

#D

#Z

#O

#U

#P

ADDPS

Packed add.

Y

Y

Y

Y

Y

ADDSS

Scalar add.

Y

Y

Y

Y

Y

ANDNPS

Packed logical INVERT and AND.

ANDPS

Packed logical AND.

CMPPS

Packed compare.

Y

Y

CMPSS

Scalar compare.

Y

Y

COMISS

Scalar ordered compare lower SP FP numbers and set the status flags.

Y

Y

CVTPI2PS

Convert two 32-bit signed integers from MM2/Mem to two SP FP.

C-4 Vol. 1

Y

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-3. Exceptions Generated with SSE Instructions (Contd.) Mnemonic

Instruction

#I

#D

#Z

#O

#U

#P

CVTPS2PI

Convert lower two SP FP from XMM/Mem to two 32-bit signed integers in MM using rounding specified by MXCSR.

Y

CVTSI2SS

Convert one 32-bit signed integer from Integer Reg/Mem to one SP FP.

CVTSS2SI

Convert one SP FP from XMM/Mem to one 32-bit signed integer using rounding mode specified by MXCSR, and move the result to an integer register.

Y

Y

CVTTPS2PI

Convert two SP FP from XMM2/Mem to two 32-bit signed integers in MM1 using truncate.

Y

Y

CVTTSS2SI

Convert lowest SP FP from XMM/Mem to one 32-bit signed integer using truncate, and move the result to an integer register.

Y

Y

DIVPS

Packed divide.

Y

Y

Y

Y

Y

Y

DIVSS

Scalar divide.

Y

Y

Y

Y

Y

Y

LDMXCSR

Load control/status word.

MAXPS

Packed maximum.

Y

Y

MAXSS

Scalar maximum.

Y

Y

MINPS

Packed minimum.

Y

Y

MINSS

Scalar minimum.

Y

Y

MOVAPS

Move four packed SP values.

MOVHLPS

Move packed SP high to low.

MOVHPS

Move two packed SP values between memory and the high half of an XMM register.

MOVLHPS

Move packed SP low to high.

Y

Y

Vol. 1 C-5

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-3. Exceptions Generated with SSE Instructions (Contd.) Mnemonic

Instruction

#I

#D

#Z

#O

#U

#P

MOVLPS

Move two packed SP values between memory and the low half of an XMM register.

MOVMSKPS

Move sign mask to r32.

MOVSS

Move scalar SP number between an XMM register and memory or a second XMM register.

MOVUPS

Move unaligned packed data.

MULPS

Packed multiply.

Y

Y

Y

Y

Y

MULSS

Scalar multiply.

Y

Y

Y

Y

Y

ORPS

Packed OR.

RCPPS

Packed reciprocal.

RCPSS

Scalar reciprocal.

RSQRTPS

Packed reciprocal square root.

RSQRTSS

Scalar reciprocal square root.

SHUFPS

Shuffle.

SQRTPS

Square Root of the packed SP FP numbers.

Y

Y

Y

SQRTSS

Scalar square roo.

Y

Y

Y

STMXCSR

Store control/status word.

SUBPS

Packed subtract.

Y

Y

Y

Y

Y

SUBSS

Scalar subtract.

Y

Y

Y

Y

Y

UCOMISS

Unordered compare lower SP FP numbers and set the status flags.

Y

Y

UNPCKHPS

Interleave SP FP numbers.

UNPCKLPS

Interleave SP FP numbers.

XORPS

Packed XOR.

C-6 Vol. 1

FLOATING-POINT EXCEPTIONS SUMMARY

C.4

SSE2 INSTRUCTIONS

Table C- 4 list s SSE2 inst ruct ions wit h at least one of t he following charact erist ics:

• •

float ing- point operands float ing point result s

For each inst ruct ion, t he t able sum m arizes t he float ing- point except ions t hat t he inst ruct ion can generat e.

Table C-4. Exceptions Generated with SSE2 Instructions Instruction

Description

#I

#D

ADDPD

Add two packed DP FP numbers from XMM2/Mem to XMM1.

Y

ADDSD

Add the lower DP FP number from XMM2/Mem to XMM1.

ANDNPD

Invert the 128 bits in XMM1and then AND the result with 128 bits from XMM2/Mem.

ANDPD

Logical And of 128 bits from XMM2/Mem to XMM1 register.

CMPPD

#Z

#O

#U

#P

Y

Y

Y

Y

Y

Y

Y

Y

Y

Compare packed DP FP numbers from XMM2/Mem to packed DP FP numbers in XMM1 register using imm8 as predicate.

Y

Y

CMPSD

Compare lowest DP FP number from XMM2/Mem to lowest DP FP number in XMM1 register using imm8 as predicate.

Y

Y

COMISD

Compare lower DP FP number in XMM1 register with lower DP FP number in XMM2/Mem and set the status flags accordingly

Y

Y

CVTDQ2PS

Convert four 32-bit signed integers from XMM/Mem to four SP FP.

CVTPS2DQ

Convert four SP FP from XMM/Mem to four 32-bit signed integers in XMM using rounding specified by MXCSR.

Y

Y

Y

Vol. 1 C-7

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction

Description

CVTTPS2DQ

Convert four SP FP from XMM/Mem to four 32-bit signed integers in XMM using truncate.

CVTDQ2PD

Convert two 32-bit signed integers in XMM2/Mem to 2 DP FP in xmm1 using rounding specified by MXCSR.

CVTPD2DQ

#I

#D

#Z

#O

#U

#P

Y

Y

Convert two DP FP from XMM2/Mem to two 32-bit signed integers in xmm1 using rounding specified by MXCSR.

Y

Y

CVTPD2PI

Convert lower two DP FP from XMM/Mem to two 32-bit signed integers in MM using rounding specified by MXCSR.

Y

Y

CVTPD2PS

Convert two DP FP to two SP FP.

Y

Y

CVTPI2PD

Convert two 32-bit signed integers from MM2/Mem to two DP FP.

CVTPS2PD

Convert two SP FP to two DP FP.

Y

Y

CVTSD2SI

Convert one DP FP from XMM/Mem to one 32 bit signed integer using rounding mode specified by MXCSR, and move the result to an integer register.

Y

CVTSD2SS

Convert scalar DP FP to scalar SP FP.

Y

Y

CVTSI2SD

Convert one 32-bit signed integer from Integer Reg/Mem to one DP FP.

CVTSS2SD

Convert scalar SP FP to scalar DP FP.

Y

Y

C-8 Vol. 1

Y

Y

Y

Y

Y

Y

Y

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction

Description

#I

#D

#Z

#O

#U

#P

CVTTPD2DQ

Convert two DP FP from XMM2/Mem to two 32-bit signed integers in XMM1 using truncate.

Y

Y

CVTTPD2PI

Convert two DP FP from XMM2/Mem to two 32-bit signed integers in MM1 using truncate.

Y

Y

CVTTSD2SI

Convert lowest DP FP from XMM/Mem to one 32 bit signed integer using truncate, and move the result to an integer register.

Y

Y

DIVPD

Divide packed DP FP numbers in XMM1 by XMM2/Mem

Y

Y

Y

Y

Y

Y

DIVSD

Divide lower DP FP numbers in XMM1 by XMM2/Mem

Y

Y

Y

Y

Y

Y

MAXPD

Return the maximum DP FP numbers between XMM2/Mem and XMM1.

Y

Y

MAXSD

Return the maximum DP FP number between the lower DP FP numbers from XMM2/Mem and XMM1.

Y

Y

MINPD

Return the minimum DP numbers between XMM2/Mem and XMM1.

Y

Y

MINSD

Return the minimum DP FP number between the lowest DP FP numbers from XMM2/Mem and XMM1.

Y

Y

MOVAPD

Move 128 bits representing 2 packed DP data from XMM2/Mem to XMM1 register.

Or Move 128 bits representing 2 packed DP from XMM1 register to XMM2/Mem.

Vol. 1 C-9

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction MOVHPD

Description

#I

#D

#Z

#O

#U

#P

Move 64 bits representing one DP operand from Mem to upper field of XMM register. Or move 64 bits representing one DP operand from upper field of XMM register to Mem.

MOVLPD

Move 64 bits representing one DP operand from Mem to lower field of XMM register. Or move 64 bits representing one DP operand from lower field of XMM register to Mem.

MOVMSKPD

Move the sign mask to r32.

MOVSD

Move 64 bits representing one scalar DP operand from XMM2/Mem to XMM1 register. Or move 64 bits representing one scalar DP operand from XMM1 register to XMM2/Mem.

MOVUPD

Move 128 bits representing 2 DP data from XMM2/Mem to XMM1 register. Or move 128 bits representing 2 DP data from XMM1 register to XMM2/Mem.

MULPD

Multiply packed DP FP numbers in XMM2/Mem to XMM1.

Y

Y

Y

Y

Y

MULSD

Multiply the lowest DP FP number in XMM2/Mem to XMM1.

Y

Y

Y

Y

Y

ORPD

OR 128 bits from XMM2/Mem to XMM1 register.

SHUFPD

Shuffle Double.

SQRTPD

Square Root Packed DoublePrecision

Y

Y

Y

SQRTSD

Square Root Scaler DoublePrecision

Y

Y

Y

C-10 Vol. 1

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-4. Exceptions Generated with SSE2 Instructions (Contd.) Instruction

Description

#I

#D

#Z

#O

#U

#P

SUBPD

Subtract Packed DoublePrecision.

Y

Y

Y

Y

Y

SUBSD

Subtract Scaler DoublePrecision.

Y

Y

Y

Y

Y

UCOMISD

Compare lower DP FP number in XMM1 register with lower DP FP number in XMM2/Mem and set the status flags accordingly.

Y

Y

UNPCKHPD

Interleaves DP FP numbers from the high halves of XMM1 and XMM2/Mem into XMM1 register.

UNPCKLPD

Interleaves DP FP numbers from the low halves of XMM1 and XMM2/Mem into XMM1 register.

XORPD

XOR 128 bits from XMM2/Mem to XMM1 register.

C.5

SSE3 INSTRUCTIONS

Table C- 5 list s t he SSE3 inst ruct ions t hat have at least one of t he following charact erist ics:

• •

have float ing- point operands generat e float ing- point result s

For each inst ruct ion, t he t able sum m arizes t he float ing- point except ions t hat t he inst ruct ion can generat e.

Table C-5. Exceptions Generated with SSE3 Instructions Instruction

Description

#I

#D

ADDSUBPD

Add /Sub packed DP FP numbers from XMM2/Mem to XMM1.

Y

ADDSUBPS

Add /Sub packed SP FP numbers from XMM2/Mem to XMM1.

Y

#Z

#O

#U

#P

Y

Y

Y

Y

Y

Y

Y

Y

Vol. 1 C-11

FLOATING-POINT EXCEPTIONS SUMMARY

Table C-5. Exceptions Generated with SSE3 Instructions (Contd.) Instruction

Description

#O

#U

#P

FISTTP

See Table C-2.

Y

HADDPD

Add horizontally packed DP FP numbers XMM2/Mem to XMM1.

Y

Y

Y

Y

Y

HADDPS

Add horizontally packed SP FP numbers XMM2/Mem to XMM1

Y

Y

Y

Y

Y

HSUBPD

Sub horizontally packed DP FP numbers XMM2/Mem to XMM1

Y

Y

Y

Y

Y

HSUBPS

Sub horizontally packed SP FP numbers XMM2/Mem to XMM1

Y

Y

Y

Y

Y

LDDQU

Load unaligned integer 128bit.

MOVDDUP

Move 64 bits representing one DP data from XMM2/Mem to XMM1 and duplicate.

MOVSHDUP

Move 128 bits representing 4 SP data from XMM2/Mem to XMM1 and duplicate high.

MOVSLDUP

Move 128 bits representing 4 SP data from XMM2/Mem to XMM1 and duplicate low.

C.6

#I

#D

#Z

Y

SSSE3 INSTRUCTIONS

SSSE3 inst ruct ions operat e on int eger dat a elem ent s. They do not generat e float ingpoint except ions.

C-12 Vol. 1

APPENDIX D GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS As described in Chapt er 8, “ Program m ing wit h t he x87 FPU,” t he I A- 32 Archit ect ure support s t wo m echanism s for accessing except ion handlers t o handle unm asked x87 FPU except ions: nat ive m ode and MS- DOS com pat ibilit y m ode. The prim ary purpose of t his appendix is t o provide det ailed inform at ion t o help soft ware engineers design and writ e x87 FPU except ion- handling facilit ies t o run on PC syst em s t hat use t he MS- DOS com pat ibilit y m ode 1 for handling x87 FPU except ions. Som e of t he inform at ion in t his appendix will also be of int erest t o engineers who are writ ing nat ive- m ode x87 FPU except ion handlers. The inform at ion provided is as follows:



Discussion of t he origin of t he MS- DOS x87 FPU except ion handling m echanism and it s relat ionship t o t he x87 FPU’s nat ive except ion handling m echanism .



Descript ion of t he I A- 32 flags and processor pins t hat cont rol t he MS- DOS x87 FPU except ion handling m echanism .



Descript ion of t he ext ernal hardware t ypically required t o support MS- DOS except ion handling m echanism .



Descript ion of t he x87 FPU’s except ion handling m echanism and t he t ypical prot ocol for x87 FPU except ion handlers.

• • •

Code exam ples t hat dem onst rat e various levels of x87 FPU except ion handlers. Discussion of x87 FPU considerat ions in m ult it asking environm ent s. Discussion of nat ive m ode x87 FPU except ion handling.

The inform at ion given is orient ed t oward t he m ost recent generat ions of I A- 32 processors, st art ing wit h t he I nt el486. I t is int ended t o augm ent t he reference inform at ion given in Chapt er 8, “ Program m ing wit h t he x87 FPU.” A m ore ext ensive version of t his appendix is available in t he applicat ion not e AP- 578, Soft ware and Hardware Considerat ions for x87 FPU Except ion Handlers for I nt el Archit ect ure Processors ( Order Num ber 243291) , which is available from I nt el.

1. Microsoft Windows* 95 and Windows 3.1 (and earlier versions) operating systems use almost the same x87 FPU exception handling interface as MS-DOS. The recommendations in this appendix for a MS-DOS compatible exception handler thus apply to all three operating systems.

Vol. 1 D-1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

D.1

MS-DOS COMPATIBILITY SUB-MODE FOR HANDLING X87 FPU EXCEPTIONS

The first generat ions of I A- 32 processors ( st art ing wit h t he I nt el 8086 and 8088 processors and going t hrough t he I nt el 286 and I nt el386 processors) did not have an on- chip float ing- point unit . I nst ead, float ing- point capabilit y was provided on a separat e num eric coprocessor chip. The first of t hese num eric coprocessors was t he I nt el 8087, which was followed by t he I nt el 287 and I nt el 387 num eric coprocessors. To allow t he 8087 t o signal float ing- point except ions t o it s com panion 8086 or 8088, t he 8087 has an out put pin, I NT, which it assert s when an unm asked float ing- point except ion occurs. The designers of t he 8087 recom m ended t hat t he out put from t his pin be rout ed t hrough a program m able int errupt cont roller ( PI C) such as t he I nt el 8259A t o t he I NTR pin of t he 8086 or 8088. The accom panying int errupt vect or num ber could t hen be used t o access t he float ing- point except ion handler. However, t he original I BM* PC design and MS- DOS operat ing syst em used a different m echanism for handling t he I NT out put from t he 8087. I t connect ed t he I NT pin direct ly t o t he NMI input pin of t he 8086 or 8088. The NMI int errupt handler t hen had t o det erm ine if t he int errupt was caused by a float ing- point except ion or anot her NMI event . This m echanism is t he origin of what is now called t he “ MS- DOS com pat ibilit y m ode.” The decision t o use t his lat t er float ing- point except ion handling m echanism cam e about because when t he I BM PC was first designed, t he 8087 was not available. When t he 8087 did becom e available, ot her funct ions had already been assigned t o t he eight input s t o t he PI C. One of t hese funct ions was a BI OS video int errupt , which was assigned t o int errupt num ber 16 for t he 8086 and 8088. The I nt el 286 processor creat ed t he “ nat ive m ode” for handling float ing- point except ions by providing a dedicat ed input pin ( ERROR# ) for receiving float ing- point except ion signals and a dedicat ed int errupt num ber, 16. I nt errupt 16 was used t o signal float ing- point errors ( also called m at h fault s) . I t was int ended t hat t he ERROR# pin on t he I nt el 286 be connect ed t o a corresponding ERROR# pin on t he I nt el 287 num eric coprocessor. When t he I nt el 287 signals a float ing- point except ion using t his m echanism , t he I nt el 286 generat es an int errupt 16, t o invoke t he float ing- point except ion handler. To m aint ain com pat ibilit y wit h exist ing PC soft ware, t he nat ive float ing- point except ion handling m ode of t he I nt el 286 and 287 was not used in t he I BM PC AT syst em design. I nst ead, t he ERROR# pin on t he I nt el 286 was t ied perm anent ly high, and t he ERROR# pin from t he I nt el 287 was rout ed t o a second ( cascaded) PI C. The result ing out put of t his PI C was rout ed t hrough an except ion handler and event ually caused an int errupt 2 ( NMI int errupt ) . Here t he NMI int errupt was shared wit h I BM PC AT’s new parit y checking feat ure. I nt errupt 16 rem ained assigned t o t he BI OS video int errupt handler. The ext ernal hardware for t he MS- DOS com pat ibilit y m ode m ust prevent t he I nt el 286 processor from execut ing past t he next x87 FPU inst ruct ion when an unm asked except ion has been generat ed. To do t his, it assert s t he BUSY# signal int o t he I nt el 286 when t he ERROR# signal is assert ed by t he I nt el 287. The I nt el386 processor and it s com panion I nt el 387 num eric coprocessor provided t he sam e hardware m echanism for signaling and handling float ing- point except ions

D-2 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

as t he I nt el 286 and 287 processors. And again, t o m aint ain com pat ibilit y wit h exist ing MS- DOS soft ware, basically t he sam e MS- DOS com pat ibilit y float ing- point except ion handling m echanism t hat was used in t he I BM PC AT was used in PCs based on t he I nt el386 processor.

D.2

IMPLEMENTATION OF THE MS-DOS COMPATIBILITY SUB-MODE IN THE INTEL486, PENTIUM, AND P6 PROCESSOR FAMILY, AND PENTIUM 4 PROCESSORS

Beginning w it h t he I nt el486 pr ocessor, t he I A- 32 archit ect ur e provided a dedicat ed m echanism for enabling t he MS- DOS com pat ibilit y m ode for x87 FPU except ions and for generat ing ext ernal x87 FPU- except ion signals w hile operat ing in t his m ode. The follow ing sect ions descr ibe t he im plem ent at ion of t he MS- DOS com pat ibilit y m ode in t he I nt el486 and Pent ium processor s and in t he P6 fam ily and Pent ium 4 processors. Also described is t he recom m ended ext er nal har dwar e t o suppor t t his m ode of operat ion.

D.2.1

MS-DOS Compatibility Sub-mode in the Intel486 and Pentium Processors

I n t he I nt el486 processor, several t hings were done t o enhance and speed up t he num eric coprocessor, now called t he float ing- point unit ( x87 FPU) . The m ost im port ant enhancem ent was t hat t he x87 FPU was included in t he sam e chip as t he processor, for increased speed in x87 FPU com put at ions and reduced lat ency for x87 FPU except ion handling. Also, for t he first t im e, t he MS- DOS com pat ibilit y m ode was built int o t he chip design, wit h t he addit ion of t he NE bit in cont rol regist er CR0 and t he addit ion of t he FERR# ( Float ing- point ERRor) and I GNNE# ( I GNore Num eric Error) pins. The NE bit select s t he nat ive x87 FPU except ion handling m ode ( NE = 1) or t he MS- DOS com pat ibilit y m ode ( NE = 0) . When nat ive m ode is select ed, all signaling of float ing- point except ions is handled int ernally in t he I nt el486 chip, result ing in t he generat ion of an int errupt 16. When MS- DOS com pat ibilit y m ode is select ed, t he FERRR# and I GNNE# pins are used t o signal float ing- point except ions. The FERR# out put pin, which replaces t he ERROR# pin from t he previous generat ions of I A- 32 num eric coprocessors, is connect ed t o a PI C. A new input signal, I GNNE# , is provided t o allow t he x87 FPU except ion handler t o execut e x87 FPU inst ruct ions, if desired, wit hout first clearing t he error condit ion and wit hout t riggering t he int errupt a second t im e. This I GNNE# feat ure is needed t o replicat e t he capabilit y t hat was provided on MS- DOS com pat ible I nt el 286 and I nt el 287 and I nt el386 and I nt el 387 syst em s by t urning off t he BUSY# signal, when inside t he x87 FPU except ion handler, before clearing t he error condit ion.

Vol. 1 D-3

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

Not e t hat I nt el, in order t o provide I nt el486 processors for m arket segm ent s t hat had no need for an x87 FPU, creat ed t he “ SX” versions. These I nt el486 SX processors did not cont ain t he float ing- point unit . I nt el also produced I nt el 487 SX processors for end users who lat er decided t o upgrade t o a syst em wit h an x87 FPU. These I nt el 487 SX processors are sim ilar t o st andard I nt el486 processors wit h a working x87 FPU on board. Thus, t he ext ernal circuit ry necessary t o support t he MS- DOS com pat ibilit y m ode for I nt el 487 SX processors is t he sam e as for st andard I nt el486 DX processors. The Pent ium , P6 fam ily, and Pent ium 4 processors offer t he sam e m echanism ( t he NE bit and t he FERR# and I GNNE# pins) as t he I nt el486 processors for generat ing x87 FPU except ions in MS- DOS com pat ibilit y m ode. The act ions of t hese m echanism s are slight ly different and m ore st raight forward for t he P6 fam ily and Pent ium 4 processors, as described in Sect ion D.2.2, “ MS- DOS Com pat ibilit y Sub- m ode in t he P6 Fam ily and Pent ium 4 Processors.” For Pent ium , P6 fam ily, and Pent ium 4 processors, it is im port ant t o not e t hat t he special DP ( Dual Processing) m ode for Pent ium processors and also t he m ore general I nt el Mult iProcessor Specificat ion for syst em s wit h m ult iple Pent ium , P6 fam ily, or Pent ium 4 processors support x87 FPU except ion handling only in t he nat ive m ode. I nt el does not recom m end using t he MS- DOS com pat ibilit y x87 FPU m ode for syst em s using m ore t han one processor.

D.2.1.1

Basic Rules: When FERR# Is Generated

When MS- DOS com pat ibilit y m ode is enabled for t he I nt el486 or Pent ium processors ( NE bit is set t o 0) and t he I GNNE# input pin is de- assert ed, t he FERR# signal is generat ed as follows: 1. When an x87 FPU inst ruct ion causes an unm asked x87 FPU except ion, t he processor ( in m ost cases) uses a “ deferred” m et hod of report ing t he error. This m eans t hat t he processor does not respond im m ediat ely, but rat her freezes j ust before execut ing t he next WAI T or x87 FPU inst ruct ion ( except for “ no- wait ” inst ruct ions, which t he x87 FPU execut es regardless of an error condit ion) . 2. When t he processor freezes, it also assert s t he FERR# out put . 3. The frozen processor wait s for an ext ernal int errupt , which m ust be supplied by ext ernal hardware in response t o t he FERR# assert ion. 4. I n MS- DOS com pat ibilit y syst em s, FERR# is fed t o t he I RQ13 input in t he cascaded PI C. The PI C generat es int errupt 75H, which t hen branches t o int errupt 2, as described earlier in t his appendix for syst em s using t he I nt el 286 and I nt el 287 or I nt el386 and I nt el 387 processors. The deferred m et hod of error report ing is used for all except ions caused by t he basic arit hm et ic inst ruct ions ( including FADD, FSUB, FMUL, FDI V, FSQRT, FCOM and FUCOM) , for precision except ions caused by all t ypes of x87 FPU inst ruct ions, and for num eric underflow and overflow except ions caused by all t ypes of x87 FPU inst ruct ions except st ores t o m em ory.

D-4 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

Som e x87 FPU inst ruct ions wit h som e x87 FPU except ions use an “ im m ediat e” m et hod of report ing errors. Here, t he FERR# is assert ed im m ediat ely, at t he t im e t hat t he except ion occurs. The im m ediat e m et hod of error report ing is used for x87 FPU st ack fault , invalid operat ion and denorm al except ions caused by all t ranscendent al inst ruct ions, FSCALE, FXTRACT, FPREM and ot hers, and all except ions ( except precision) when caused by x87 FPU st ore inst ruct ions. Like deferred error report ing, im m ediat e error report ing will cause t he processor t o freeze j ust before execut ing t he next WAI T or x87 FPU inst ruct ion if t he error condit ion has not been cleared by t hat t im e. Not e t hat in general, whet her deferred or im m ediat e error report ing is used for an x87 FPU except ion depends bot h on which except ion occurred and which inst ruct ion caused t hat except ion. A com plet e specificat ion of t hese cases, which applies t o bot h t he Pent ium and t he I nt el486 processors, is given in Sect ion 5.1.21 in t he Pent ium Processor Fam ily Developer’s Manual: Volum e 1. I f NE = 0 but t he I GNNE# input is act ive while an unm asked x87 FPU except ion is in effect , t he processor disregards t he except ion, does not assert FERR# , and cont inues. I f I GNNE# is t hen de- assert ed and t he x87 FPU except ion has not been cleared, t he processor will respond as described above. ( That is, an im m ediat e except ion case will assert FERR# im m ediat ely. A deferred except ion case will assert FERR# and freeze j ust before t he next x87 FPU or WAI T inst ruct ion.) The assert ion of I GNNE# is int ended for use only inside t he x87 FPU except ion handler, where it is needed if one want s t o execut e non- cont rol x87 FPU inst ruct ions for diagnosis, before clearing t he except ion condit ion. When I GNNE# is assert ed inside t he except ion handler, a preceding x87 FPU except ion has already caused FERR# t o be assert ed, and t he ext ernal int errupt hardware has responded, but I GNNE# assert ion st ill prevent s t he freeze at x87 FPU inst ruct ions. Not e t hat if I GNNE# is left act ive out side of t he x87 FPU except ion handler, addit ional x87 FPU inst ruct ions m ay be execut ed aft er a given inst ruct ion has caused an x87 FPU except ion. I n t his case, if t he x87 FPU except ion handler ever did get invoked, it could not det erm ine which inst ruct ion caused t he except ion. To properly m anage t he int erface bet ween t he processor ’s FERR# out put , it s I GNNE# input , and t he I RQ13 input of t he PI C, addit ional ext ernal hardware is needed. A recom m ended configurat ion is described in t he following sect ion.

D.2.1.2

Recommended External Hardware to Support the MS-DOS Compatibility Sub-mode

Figure D- 1 provides an ext ernal circuit t hat will assure proper handling of FERR# and I GNNE# when an x87 FPU except ion occurs. I n part icular, it assures t hat I GNNE# will be act ive only inside t he x87 FPU except ion handler wit hout depending on t he order of act ions by t he except ion handler. Som e hardware im plem ent at ions have been less robust because t hey have depended on t he except ion handler t o clear t he x87 FPU except ion int errupt request t o t he PI C ( FP_I RQ signal) be for e t he handler causes FERR# t o be de- assert ed by clearing t he except ion from t he x87 FPU it self. Figur e D- 2 show s t he det ails of how I GNNE# w ill behave w hen t he cir cuit in

Vol. 1 D-5

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

Figur e D- 1 is im plem ent ed. The t em poral regions wit hin t he x87 FPU except ion handler act ivit y are described as follows: 1. The FERR# signal is act ivat ed by an x87 FPU except ion and sends an int errupt request t hrough t he PI C t o t he processor ’s I NTR pin. 2. During t he x87 FPU int errupt service rout ine ( except ion handler) t he processor will need t o clear t he int errupt request lat ch ( Flip Flop # 1) . I t m ay also want t o execut e non- cont rol x87 FPU inst ruct ions before t he except ion is cleared from t he x87 FPU. For t his purpose t he I GNNE# m ust be driven low. Typically in t he PC environm ent an I / O access t o Port 0F0H clears t he ext ernal x87 FPU except ion int errupt request ( FP_I RQ) . I n t he recom m ended circuit , t his access also is used t o act ivat e I GNNE# . Wit h I GNNE# act ive, t he x87 FPU except ion handler m ay execut e any x87 FPU inst ruct ion wit hout being blocked by an act ive x87 FPU except ion. 3. Clearing t he except ion wit hin t he x87 FPU will cause t he FERR# signal t o be deact ivat ed and t hen t here is no furt her need for I GNNE# t o be act ive. I n t he recom m ended circuit , t he deact ivat ion of FERR# is used t o deact ivat e I GNNE# . I f anot her circuit is used, t he soft ware and circuit t oget her m ust assure t hat I GNNE# is deact ivat ed no lat er t han t he exit from t he x87 FPU except ion handler.

D-6 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

*

5(6(7

,23RUW)+ $GGUHVV'HFRGH 9 ))

)(55

,QWHOŒ3URFHVVRU 3HQWLXPŠ3URFHVVRU 3HQWLXPŠ3UR3URFHVVRU

35 9

9

&/5 ))

35 9

,*11( ,175

,QWHUUXSW &RQWUROOHU

)3B,54

/(*(1' ))Q)OLS)ORSQ &/5&OHDURU5HVHW

Figure D-1. Recommended Circuit for MS-DOS Compatibility x87 FPU Exception Handling I n t he circuit in Figure D- 1, when t he x87 FPU except ion handler accesses I / O port 0F0H it clears t he I RQ13 int errupt request out put from Flip Flop # 1 and also clocks out t he I GNNE# signal ( act ive) from Flip Flop # 2. So t he handler can act ivat e I GNNE# , if needed, by doing t his 0F0H access before clearing t he x87 FPU except ion condit ion ( which de- assert s FERR# ) . How ever, t he cir cuit does not depend on t he or der of act ions by t he x87 FPU except ion handler t o guarant ee t he cor r ect har dwar e st at e upon exit fr om t he handler. Flip Flop # 2, w hich drives I GNNE# t o t he pr ocessor, has it s CLEAR input at t ached t o t he invert ed FERR# . This ensures t hat I GNNE# can never be act ive w hen FERR# is

Vol. 1 D-7

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

inact ive. So if t he handler clear s t he x87 FPU except ion condit ion be for e t he 0F0H access, I GNNE# does not get act ivat ed and left on aft er exit fr om t he handler.

0F0H Address Decode

Figure D-2. Behavior of Signals During x87 FPU Exception Handling

D.2.1.3

No-Wait x87 FPU Instructions Can Get x87 FPU Interrupt in Window

The Pent ium and I nt el486 processors im plem ent t he “ no- wait ” float ing- point inst ruct ions ( FNI NI T, FNCLEX, FNSTENV, FNSAVE, FNSTSW, FNSTCW, FNENI , FNDI SI or FNSETPM) in t he MS- DOS com pat ibilit y m ode in t he following m anner. ( See Sect ion 8.3.11, “ x87 FPU Cont rol I nst ruct ions,” and Sect ion 8.3.12, “ Wait ing vs. Non- wait ing I nst ruct ions,” for a discussion of t he no- wait inst ruct ions.) I f an unm asked num eric except ion is pending from a preceding x87 FPU inst ruct ion, a m em ber of t he no- wait class of inst ruct ions will, at t he beginning of it s execut ion, assert t he FERR# pin in response t o t hat except ion j ust like ot her x87 FPU inst ruct ions, but t hen, unlike t he ot her x87 FPU inst ruct ions, FERR# will be de- assert ed. This de- assert ion was im plem ent ed t o allow t he no- wait class of inst ruct ions t o proceed wit hout an int errupt due t o any pending num eric except ion. However, t he brief assert ion of FERR# is sufficient t o lat ch t he x87 FPU except ion request int o m ost hardware int erface im plem ent at ions ( including I nt el’s recom m ended circuit ) . All t he x87 FPU inst ruct ions are im plem ent ed such t hat during t heir execut ion, t here is a window in which t he processor will sam ple and accept ext ernal int errupt s. I f t here is a pending int errupt , t he processor services t he int errupt first before resum ing t he execut ion of t he inst ruct ion. Consequent ly, it is possible t hat t he nowait float ing- point inst ruct ion m ay accept t he ext ernal int errupt caused by it ’s own assert ion of t he FERR# pin in t he event of a pending unm asked num eric except ion,

D-8 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

which is not an explicit ly docum ent ed behavior of a no- wait inst ruct ion. This process is illust rat ed in Figure D- 3.

Exception Generating Floating-Point Instruction Assertion of FERR# by the Processor

Start of the “No-Wait” Floating-Point Instruction

System Dependent Delay Case 1

External Interrupt Sampling Window

Assertion of INTR Pin by the System Case 2

Window Closed

Figure D-3. Timing of Receipt of External Interrupt

Figure D- 3 assum es t hat a float ing- point inst ruct ion t hat generat es a “ deferred” error ( as defined in t he Sect ion D.2.1.1, “ Basic Rules: When FERR# I s Generat ed” ) , which assert s t he FERR# pin only on encount ering t he next float ing- point inst ruct ion, causes an unm asked num eric except ion. Assum e t hat t he next float ing- point inst ruct ion following t his inst ruct ion is one of t he no- wait float ing- point inst ruct ions. The FERR# pin is assert ed by t he processor t o indicat e t he pending except ion on encount ering t he no- wait float ing- point inst ruct ion. Aft er t he assert ion of t he FERR# pin t he no- wait float ing- point inst ruct ion opens a window where t he pending ext ernal int errupt s are sam pled. Then t here are t wo cases possible depending on t he t im ing of t he receipt of t he int errupt via t he I NTR pin ( assert ed by t he syst em in response t o t he FERR# pin) by t he processor. Ca se 1

I f t he sy st em r esponds t o t he asser t ion of FERR# pin by t he no- wait float ing- point inst r uct ion v ia t he I NTR pin dur ing t his w indow t hen t he int er r upt is ser v iced fir st , befor e r esum ing t he execut ion of t he no- w ait float ing- point inst r uct ion.

Ca se 2

I f t he syst em responds via t he I NTR pin aft er t he window has closed t hen t he int errupt is recognized only at t he next inst ruct ion boundary.

Vol. 1 D-9

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

There are t wo ot her ways, in addit ion t o Case 1 above, in which a no- wait float ingpoint inst ruct ion can service a num eric except ion inside it s int errupt window. First , t he first float ing- point error condit ion could be of t he “ im m ediat e” cat egory ( as defined in Sect ion D.2.1.1, “ Basic Rules: When FERR# I s Generat ed” ) t hat assert s FERR# im m ediat ely. I f t he syst em delay before assert ing I NTR is long enough, relat ive t o t he t im e elapsed before t he no- wait float ing- point inst ruct ion, I NTR can be assert ed inside t he int errupt window for t he lat t er. Second, consider t wo no- wait x87 FPU inst ruct ions in close sequence, and assum e t hat a previous x87 FPU inst ruct ion has caused an unm asked num eric except ion. Then if t he I NTR t im ing is t oo long for an FERR# signal t riggered by t he first no- wait inst ruct ion t o hit t he first inst ruct ion’s int errupt window, it could cat ch t he int errupt window of t he second. The possible m alfunct ion of a no- wait x87 FPU inst ruct ion explained above cannot happen if t he inst ruct ion is being used in t he m anner for which I nt el originally designed it . The no- wait inst ruct ions were int ended t o be used inside t he x87 FPU except ion handler, t o allow m anipulat ion of t he x87 FPU before t he error condit ion is cleared, wit hout hanging t he processor because of t he x87 FPU error condit ion, and wit hout t he need t o assert I GNNE# . They will perform t his funct ion correct ly, since before t he error condit ion is cleared, t he assert ion of FERR# t hat caused t he x87 FPU error handler t o be invoked is st ill act ive. Thus t he logic t hat would assert FERR# briefly at a no- wait inst ruct ion causes no change since FERR# is already assert ed. The no- wait inst ruct ions m ay also be used wit hout problem in t he handler aft er t he error condit ion is cleared, since now t hey will not cause FERR# t o be assert ed at all. I f a no- wait inst ruct ion is used out side of t he x87 FPU except ion handler, it m ay m alfunct ion as explained above, depending on t he det ails of t he hardware int erface im plem ent at ion and which part icular processor is involved. The act ual int errupt inside t he window in t he no- wait inst ruct ion m ay be blocked by surrounding it wit h t he inst ruct ions: PUSHFD, CLI , no- wait , t hen POPFD. ( CLI blocks int errupt s, and t he push and pop of flags preserves and rest ores t he original value of t he int errupt flag.) However, if FERR# was t riggered by t he no- wait , it s lat ched value and t he PI C response will st ill be in effect . Furt her code can be used t o check for and correct such a condit ion, if needed. Sect ion D.3.6, “ Considerat ions When x87 FPU Shared Bet ween Tasks,” discusses an im port ant exam ple of t his t ype of problem and gives a solut ion.

D.2.2

MS-DOS Compatibility Sub-mode in the P6 Family and Pentium 4 Processors

When bit NE = 0 in CR0, t he MS- DOS com pat ibilit y m ode of t he P6 fam ily and Pent ium 4 processors provides FERR# and I GNNE# funct ionalit y t hat is alm ost ident ical t o t he I nt el486 and Pent ium processors. The sam e ext ernal hardware described in Sect ion D.2.1.2, “ Recom m ended Ext ernal Hardware t o Support t he MS- DOS Com pat ibilit y Sub- m ode,” is recom m ended for t he P6 fam ily and Pent ium 4 processors as well as t he t wo previous generat ions. The only change t o MS- DOS com pat ibilit y x87 FPU except ion handling wit h t he P6 fam ily and Pent ium 4 processors is t hat all except ions for all x87 FPU inst ruct ions cause im m ediat e error report ing. That is,

D-10 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

FERR# is assert ed as soon as t he x87 FPU det ect s an unm asked except ion; t here are no cases in which error report ing is deferred t o t he next x87 FPU or WAI T inst ruct ion. ( As is discussed in Sect ion D.2.1.1, “ Basic Rules: When FERR# I s Generat ed,” m ost except ion cases in t he I nt el486 and Pent ium processors are of t he deferred t ype.) Alt hough FERR# is assert ed im m ediat ely upon det ect ion of an unm asked x87 FPU error, t his cert ainly does not m ean t hat t he request ed int errupt will always be serviced before t he next inst ruct ion in t he code sequence is execut ed. To begin wit h, t he P6 fam ily and Pent ium 4 processors execut e several inst ruct ions sim ult aneously. There also will be a delay, which depends on t he ext ernal hardware im plem ent at ion, bet ween t he FERR# assert ion from t he processor and t he responding I NTR assert ion t o t he processor. Furt her, t he int errupt request t o t he PI Cs ( I RQ13) m ay be t em porarily blocked by t he operat ing syst em , or delayed by higher priorit y int errupt s, and processor response t o I NTR it self is blocked if t he operat ing syst em has cleared t he I F bit in EFLAGS. Not e t hat St ream ing SI MD Ext ensions num eric except ions will not cause assert ion of FERR# ( independent of t he value of CR0.NE) . I n addit ion, t hey ignore t he assert ion/ deassert ion of I GNNE# ) . However, j ust as wit h t he I nt el486 and Pent ium processors, if t he I GNNE# input is inact ive, a float ing- point except ion which occurred in t he previous x87 FPU inst ruct ion and is unm asked causes t he processor t o freeze im m ediat ely when encount ering t he next WAI T or x87 FPU inst ruct ion ( except for no- wait inst ruct ions) . This m eans t hat if t he x87 FPU except ion handler has not already been invoked due t o t he earlier except ion ( and t herefore, t he handler not has cleared t hat except ion st at e from t he x87 FPU) , t he processor is forced t o wait for t he handler t o be invoked and handle t he except ion, before t he processor can execut e anot her WAI T or x87 FPU inst ruct ion. As explained in Sect ion D.2.1.3, “ No- Wait x87 FPU I nst ruct ions Can Get x87 FPU I nt errupt in Window,” if a no- wait inst ruct ion is used out side of t he x87 FPU except ion handler, in t he I nt el486 and Pent ium processors, it m ay accept an unm asked except ion from a previous x87 FPU inst ruct ion which happens t o fall wit hin t he ext ernal int errupt sam pling window t hat is opened near t he beginning of execut ion of all x87 FPU inst ruct ions. This will not happen in t he P6 fam ily and Pent ium 4 processors, because t his sam pling window has been rem oved from t he no- wait group of x87 FPU inst ruct ions.

D.3

RECOMMENDED PROTOCOL FOR MS-DOS* COMPATIBILITY HANDLERS

The act ivit ies of num eric program s can be split int o t wo m aj or areas: program cont rol and arit hm et ic. The program cont rol part perform s act ivit ies such as deciding what funct ions t o perform , calculat ing addresses of num eric operands, and loop cont rol. The arit hm et ic part sim ply adds, subt ract s, m ult iplies, and perform s ot her operat ions on t he num eric operands. The processor is designed t o handle t hese t wo part s separat ely and efficient ly. An x87 FPU except ion handler, if a syst em chooses t o im plem ent one, is oft en one of t he m ost com plicat ed part s of t he program cont rol code.

Vol. 1 D-11

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

D.3.1

Floating-Point Exceptions and Their Defaults

The x87 FPU can recognize six classes of float ing- point except ion condit ions while execut ing float ing- point inst ruct ions: 1. # I — I nvalid operat ion # I S — St ack fault # I A — I EEE st andard invalid operat ion 2. # Z — Divide- by-zero 3. # D — Denorm alized operand 4. # O — Num eric overflow 5. # U — Num eric underflow 6. # P — I nexact result ( precision) For com plet e det ails on t hese except ions and t heir default s, see Sect ion 8.4, “ x87 FPU Float ing- Point Except ion Handling,” and Sect ion 8.5, “ x87 FPU Float ing- Point Except ion Condit ions.”

D.3.2

Two Options for Handling Numeric Exceptions

Depending on opt ions det erm ined by t he soft ware syst em designer, t he processor t akes one of t wo possible courses of act ion when a num eric except ion occurs: 1. The x87 FPU can handle select ed except ions it self, producing a default fix- up t hat is reasonable in m ost sit uat ions. This allows t he num eric program execut ion t o cont inue undist urbed. Program s can m ask individual except ion t ypes t o indicat e t hat t he x87 FPU should generat e t his safe, reasonable result whenever t he except ion occurs. The default except ion fix- up act ivit y is t reat ed by t he x87 FPU as part of t he inst ruct ion causing t he except ion; no ext ernal indicat ion of t he except ion is given ( except t hat t he inst ruct ion t akes longer t o execut e when it handles a m asked except ion.) When m asked except ions are det ect ed, a flag is set in t he num eric st at us regist er, but no inform at ion is preserved regarding where or when it was set . 2. A soft ware except ion handler can be invoked t o handle t he except ion. When a num eric except ion is unm asked and t he except ion occurs, t he x87 FPU st ops furt her execut ion of t he num eric inst ruct ion and causes a branch t o a soft ware except ion handler. The except ion handler can t hen im plem ent any sort of recovery procedures desired for any num eric except ion det ect able by t he x87 FPU.

D.3.2.1

Automatic Exception Handling: Using Masked Exceptions

Each of t he six except ion condit ions described above has a corresponding flag bit in t he x87 FPU st at us word and a m ask bit in t he x87 FPU cont rol word. I f an except ion

D-12 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

is m asked ( t he corresponding m ask bit in t he cont rol word = 1) , t he processor t akes an appropriat e default act ion and cont inues wit h t he com put at ion. The processor has a default fix- up act ivit y for every possible except ion condit ion it m ay encount er. These m asked- except ion responses are designed t o be safe and are generally accept able for m ost num eric applicat ions. For exam ple, if t he I nexact result ( Precision) except ion is m asked, t he syst em can specify whet her t he x87 FPU should handle a result t hat cannot be represent ed exact ly by one of four m odes of rounding: rounding it norm ally, chopping it t oward zero, always rounding it up, or always down. I f t he Underflow except ion is m asked, t he x87 FPU will st ore a num ber t hat is t oo sm all t o be represent ed in norm alized form as a denorm al ( or zero if it ’s sm aller t han t he sm allest denorm al) . Not e t hat when except ions are m asked, t he x87 FPU m ay det ect m ult iple except ions in a single inst ruct ion, because it cont inues execut ing t he inst ruct ion aft er perform ing it s m asked response. For exam ple, t he x87 FPU could det ect a denorm alized operand, perform it s m asked response t o t his except ion, and t hen det ect an underflow. As an exam ple of how even severe except ions can be handled safely and aut om at ically using t he default except ion responses, consider a calculat ion of t he parallel resist ance of several values using only t he st andard form ula ( see Figure D- 4) . I f R1 becom es zero, t he circuit resist ance becom es zero. Wit h t he divide- by-zero and precision except ions m asked, t he processor will produce t he correct result . FDI V of R1 int o 1 gives infinit y, and t hen FDI V of ( infinit y + R2 + R3) int o 1 gives zero.

R1

R2

Equivalent Resistance =

R3

1 1 1 1 + + R1 R3 R2

Figure D-4. Arithmetic Example Using Infinity By m asking or unm asking specific num eric except ions in t he x87 FPU cont rol word, program m ers can delegat e responsibilit y for m ost except ions t o t he processor, reserving t he m ost severe except ions for program m ed except ion handlers. Except ion- handling soft ware is oft en difficult t o writ e, and t he m asked responses have been t ailored t o deliver t he m ost reasonable result for each condit ion. For t he m aj orit y of applicat ions, m asking all except ions yields sat isfact ory result s wit h t he

Vol. 1 D-13

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

least program m ing effort . Cert ain except ions can usefully be left unm asked during t he debugging phase of soft ware developm ent , and t hen m asked when t he clean soft ware is act ually run. An invalid- operat ion except ion for exam ple, t ypically indicat es a program error t hat m ust be correct ed. The except ion flags in t he x87 FPU st at us word provide a cum ulat ive record of except ions t hat have occurred since t hese flags were last cleared. Once set , t hese flags can be cleared only by execut ing t he FCLEX/ FNCLEX ( clear except ions) inst ruct ion, by reinit ializing t he x87 FPU wit h FI NI T/ FNI NI T or FSAVE/ FNSAVE, or by overwrit ing t he flags wit h an FRSTOR or FLDENV inst ruct ion. This allows a program m er t o m ask all except ions, run a calculat ion, and t hen inspect t he st at us word t o see if any except ions were det ect ed at any point in t he calculat ion.

D.3.2.2

Software Exception Handling

I f t he x87 FPU in or wit h an I A- 32 processor ( I nt el 286 and onwards) encount ers an unm asked except ion condit ion, wit h t he syst em operat ed in t he MS- DOS com pat ibilit y m ode and wit h I GNNE# not assert ed, a soft ware except ion handler is invoked t hrough a PI C and t he processor ’s I NTR pin. The FERR# ( or ERROR# ) out put from t he x87 FPU t hat begins t he process of invoking t he except ion handler m ay occur when t he error condit ion is first det ect ed, or when t he processor encount ers t he next WAI T or x87 FPU inst ruct ion. Which of t hese t wo cases occurs depends on t he processor generat ion and also on which except ion and which x87 FPU inst ruct ion t riggered it , as discussed earlier in Sect ion D.1, “ MS- DOS Com pat ibilit y Sub- m ode for Handling x87 FPU Except ions,” and Sect ion D.2, “ I m plem ent at ion of t he MS- DOS Com pat ibilit y Sub- m ode in t he I nt el486, Pent ium , and P6 Processor Fam ily, and Pent ium 4 Processors.” The elapsed t im e bet ween t he init ial error signal and t he invocat ion of t he x87 FPU except ion handler depends of course on t he ext ernal hardware int erface, and also on whet her t he ext ernal int errupt for x87 FPU errors is enabled. But t he archit ect ure ensures t hat t he handler will be invoked before execut ion of t he next WAI T or float ing- point inst ruct ion since an unm asked float ing- point except ion causes t he processor t o freeze j ust before execut ing such an inst ruct ion ( unless t he I GNNE# input is act ive, or it is a no- wait x87 FPU inst ruct ion) . The frozen processor wait s for an ext ernal int errupt , which m ust be supplied by ext ernal hardware in response t o t he FERR# ( or ERROR# ) out put of t he processor ( or coprocessor) , usually t hrough I RQ13 on t he “ slave” PI C, and t hen t hrough I NTR. Then t he ext ernal int errupt invokes t he except ion handling rout ine. Not e t hat if t he ext ernal int errupt for x87 FPU errors is disabled when t he processor execut es an x87 FPU inst ruct ion, t he processor will freeze unt il som e ot her ( enabled) int errupt occurs if an unm asked x87 FPU except ion condit ion is in effect . I f NE = 0 but t he I GNNE# input is act ive, t he processor disregards t he except ion and cont inues. Error report ing via an ext ernal int errupt is support ed for MS- DOS com pat ibilit y. Chapt er 17, “ I A- 32 Archit ect ure Com pat ibilit y,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, cont ains furt her discussion of com pat ibilit y issues. The references above t o t he ERROR# out put from t he x87 FPU apply t o t he I nt el 387 and I nt el 287 m at h coprocessors ( NPX chips) . I f one of t hese coprocessors encount ers an unm asked except ion condit ion, it signals t he except ion t o t he I nt el 286 or

D-14 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

I nt el386 processor using t he ERROR# st at us line bet ween t he processor and t he coprocessor. See Sect ion D.1, “ MS- DOS Com pat ibilit y Sub- m ode for Handling x87 FPU Except ions,” in t his appendix, and Chapt er 17, “ I A- 32 Archit ect ure Com pat ibilit y,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, for differences in x87 FPU except ion handling. The except ion- handling rout ine is norm ally a part of t he syst em s soft ware. The rout ine m ust clear ( or disable) t he act ive except ion flags in t he x87 FPU st at us word before execut ing any float ing- point inst ruct ions t hat cannot com plet e execut ion when t here is a pending float ing- point except ion. Ot herwise, t he float ing- point inst ruct ion will t rigger t he x87 FPU int errupt again, and t he syst em will be caught in an endless loop of nest ed float ing- point except ions, and hang. I n any event , t he rout ine m ust clear ( or disable) t he act ive except ion flags in t he x87 FPU st at us word aft er handling t hem , and before I RET( D) . Typical except ion responses m ay include:

• • •

I ncrem ent ing an except ion count er for lat er display or print ing. Print ing or displaying diagnost ic inform at ion ( e.g., t he x87 FPU environm ent and regist ers) . Abort ing furt her execut ion, or using t he except ion point ers t o build an inst ruct ion t hat will run wit hout except ion and execut ing it .

Applicat ions program m ers should consult t heir operat ing syst em 's reference m anuals for t he appropriat e syst em response t o num erical except ions. For syst em s program m ers, som e det ails on writ ing soft ware except ion handlers are provided in Chapt er 5, “ I nt errupt and Except ion Handling,” in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A, as well as in Sect ion D.3.4, “ x87 FPU Except ion Handling Exam ples,” in t his appendix. As discussed in Sect ion D.2.1.2, “ Recom m ended Ext ernal Hardware t o Support t he MS- DOS Com pat ibilit y Sub- m ode,” som e early FERR# t o I NTR hardware int erface im plem ent at ions are less robust t han t he recom m ended circuit . This is because t hey depended on t he except ion handler t o clear t he x87 FPU except ion int errupt request t o t he PI C ( by accessing port 0F0H) be for e t he handler causes FERR# t o be deassert ed by clearing t he except ion from t he x87 FPU it self. To elim inat e t he chance of a problem wit h t his early hardware, I nt el recom m ends t hat x87 FPU except ion handlers always access port 0F0H before clearing t he error condit ion from t he x87 FPU.

D.3.3

Synchronization Required for Use of x87 FPU Exception Handlers

Concurrency or synchronizat ion m anagem ent requires a check for except ions before let t ing t he processor change a value j ust used by t he x87 FPU. I t is im port ant t o rem em ber t hat alm ost any num eric inst ruct ion can, under t he wrong circum st ances, produce a num eric except ion.

Vol. 1 D-15

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

D.3.3.1

Exception Synchronization: What, Why and When

Except ion synchronizat ion m eans t hat t he except ion handler inspect s and deals wit h t he except ion in t he cont ext in which it occurred. I f concurrent execut ion is allowed, t he st at e of t he processor when it recognizes t he except ion is oft en n ot in t he cont ext in which it occurred. The processor m ay have changed m any of it s int ernal regist ers and be execut ing a t ot ally different program by t he t im e t he except ion occurs. I f t he except ion handler cannot recapt ure t he original cont ext , it cannot reliably det erm ine t he cause of t he except ion or recover successfully from t he except ion. To handle t his sit uat ion, t he x87 FPU has special regist ers updat ed at t he st art of each num eric inst ruct ion t o describe t he st at e of t he num eric program when t he failed inst ruct ion was at t em pt ed. This provides t ools t o help t he except ion handler recapt ure t he original cont ext , but t he applicat ion code m ust also be writ t en wit h synchronizat ion in m ind. Overall, except ion synchronizat ion m ust ensure t hat t he x87 FPU and ot her relevant part s of t he cont ext are in a well defined st at e when t he handler is invoked aft er an unm asked num eric except ion occurs. When t he x87 FPU signals an unm asked except ion condit ion, it is request ing help. The fact t hat t he except ion was unm asked indicat es t hat furt her num eric program execut ion under t he arit hm et ic and program m ing rules of t he x87 FPU will probably yield invalid result s. Thus t he except ion m ust be handled, and wit h proper synchronizat ion, or t he program will not operat e reliably. For program m ers using higher- level languages, all required synchronizat ion is aut om at ically provided by t he appropriat e com piler. However, for assem bly language program m ers except ion synchronizat ion rem ains t he responsibilit y of t he program m er. I t is not uncom m on for a program m er t o expect t hat t heir num eric program will not cause num eric except ions aft er it has been t est ed and debugged, but in a different syst em or num eric environm ent , except ions m ay occur regularly nonet heless. An obvious exam ple would be use of t he program wit h som e num bers beyond t he range for which it was designed and t est ed. Exam ple D- 1 and Exam ple D- 2 in Sect ion D.3.3.2, “ Except ion Synchronizat ion Exam ples,” show a subt le way in which unexpect ed except ions can occur. As described in Sect ion D.3.1, “ Float ing- Point Except ions and Their Default s,” depending on opt ions det erm ined by t he soft ware syst em designer, t he processor can perform one of t wo possible courses of act ion when a num eric except ion occurs.



The x87 FPU can provide a default fix- up for select ed num eric except ions. I f t he x87 FPU perform s it s default act ion for all except ions, t hen t he need for except ion synchronizat ion is not m anifest . However, code is oft en port ed t o cont ext s and operat ing syst em s for which it was not originally designed. Exam ple D- 1 and Exam ple D- 2, below, illust rat e t hat it is safest t o always consider except ion synchronizat ion when designing code t hat uses t he x87 FPU.



Alt ernat ively, a soft ware except ion handler can be invoked t o handle t he except ion. When a num eric except ion is unm asked and t he except ion occurs, t he x87 FPU st ops furt her execut ion of t he num eric inst ruct ion and causes a branch t o a soft ware except ion handler. When an x87 FPU except ion handler will be

D-16 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

invoked, synchronizat ion m ust always be considered t o assure reliable perform ance. Exam ple D- 1 and Exam ple D- 2, below, illust rat e t he need t o always consider except ion synchronizat ion when writ ing num eric code, even when t he code is init ially int ended for execut ion wit h except ions m asked.

D.3.3.2

Exception Synchronization Examples

I n t he following exam ples, t hree inst ruct ions are shown t o load an int eger, calculat e it s square root , t hen increm ent t he int eger. The synchronous execut ion of t he x87 FPU will allow bot h of t hese program s t o execut e correct ly, wit h I NC COUNT being execut ed in parallel in t he processor, as long as no except ions occur on t he FI LD inst ruct ion. However, if t he code is lat er m oved t o an environm ent where except ions are unm asked, t he code in Exam ple D- 1 will not work correct ly: Example D-1. Incorrect Error Synchronization FILD COUNT INC COUNT FSQRT

;x87 FPU instruction ;integer instruction alters operand ;subsequent x87 FPU instruction -- error ;from previous x87 FPU instruction detected here

Example D-2. Proper Error Synchronization FILD COUNT FSQRT INC COUNT

;x87 FPU instruction ;subsequent x87 FPU instruction -- error from ;previous x87 FPU instruction detected here ;integer instruction alters operand

I n som e operat ing syst em s support ing t he x87 FPU, t he num eric regist er st ack is ext ended t o m em ory. To ext end t he x87 FPU st ack t o m em ory, t he invalid except ion is unm asked. A push t o a full regist er or pop from an em pt y regist er set s SF ( St ack Fault flag) and causes an invalid operat ion except ion. The recovery rout ine for t he except ion m ust recognize t his sit uat ion, fix up t he st ack, t hen perform t he original operat ion. The recovery rout ine will not work correct ly in Exam ple D- 1. The problem is t hat t he value of COUNT increm ent s before t he except ion handler is invoked, so t hat t he recovery rout ine will load an incorrect value of COUNT, causing t he program t o fail or behave unreliably.

D.3.3.3

Proper Exception Synchronization

As explained in Sect ion D.2.1.2, “ Recom m ended Ext ernal Hardware t o Support t he MS- DOS Com pat ibilit y Sub- m ode,” if t he x87 FPU encount ers an unm asked except ion condit ion a soft ware except ion handler is invoked be for e execut ion of t he n e x t WAI T or float ing- point inst ruct ion. This is because an unm asked float ing- point except ion

Vol. 1 D-17

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

causes t he processor t o freeze im m ediat ely before execut ing such an inst ruct ion ( unless t he I GNNE# input is act ive, or it is a no- wait x87 FPU inst ruct ion) . Exact ly when t he except ion handler will be invoked ( in t he int erval bet ween when t he except ion is det ect ed and t he next WAI T or x87 FPU inst ruct ion) is dependent on t he processor generat ion, t he syst em , and which x87 FPU inst ruct ion and except ion is involved. To be safe in except ion synchronizat ion, one should assum e t he handler will be invoked at t he end of t he int erval. Thus t he program should not change any value t hat m ight be needed by t he handler ( such as COUNT in Exam ple D- 1 and Exam ple D- 2) unt il a ft e r t he ne x t x87 FPU inst ruct ion following an x87 FPU inst ruct ion t hat could cause an error. I f t he program needs t o m odify such a value before t he next x87 FPU inst ruct ion ( or if t he next x87 FPU inst ruct ion could also cause an error) , t hen a WAI T inst ruct ion should be insert ed before t he value is m odified. This will force t he handling of any except ion before t he value is m odified. A WAI T inst ruct ion should also be placed aft er t he last float ing- point inst ruct ion in an applicat ion so t hat any unm asked except ions will be serviced before t he t ask com plet es.

D.3.4

x87 FPU Exception Handling Examples

There are m any approaches t o writ ing except ion handlers. One useful t echnique is t o consider t he except ion handler procedure as consist ing of “ prologue,” “ body,” and “ epilogue” sect ions of code. I n t he t ransfer of cont rol t o t he except ion handler due t o an I NTR, NMI , or SMI , ext ernal int errupt s have been disabled by hardware. The prologue perform s all funct ions t hat m ust be prot ect ed from possible int errupt ion by higher- priorit y sources. Typically, t his involves saving regist ers and t ransferring diagnost ic inform at ion from t he x87 FPU t o m em ory. When t he crit ical processing has been com plet ed, t he prologue m ay re- enable int errupt s t o allow higher- priorit y int errupt handlers t o preem pt t he except ion handler. The st andard “ prologue” not only saves t he regist ers and t ransfers diagnost ic inform at ion from t he x87 FPU t o m em ory but also clears t he float ing- point except ion flags in t he st at us word. Alt ernat ively, when it is not necessary for t he handler t o be re- ent rant , anot her t echnique m ay also be used. I n t his t echnique, t he except ion flags are not cleared in t he “ prologue” and t he body of t he handler m ust not cont ain any float ing- point inst ruct ions t hat cannot com plet e execut ion when t here is a pending float ing- point except ion. ( The no- wait inst ruct ions are discussed in Sect ion 8.3.12, “ Wait ing vs. Non- wait ing I nst ruct ions.” ) Not e t hat t he handler m ust st ill clear t he except ion flag( s) before execut ing t he I RET. I f t he except ion handler uses neit her of t hese t echniques, t he syst em will be caught in an endless loop of nest ed float ing- point except ions, and hang. The body of t he except ion handler exam ines t he diagnost ic inform at ion and m akes a response t hat is necessarily applicat ion- dependent . This response m ay range from halt ing execut ion, t o displaying a m essage, t o at t em pt ing t o repair t he problem and proceed wit h norm al execut ion. The epilogue essent ially reverses t he act ions of t he prologue, rest oring t he processor so t hat norm al execut ion can be resum ed. The

D-18 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

epilogue m ust not load an unm asked except ion flag int o t he x87 FPU or anot her except ion will be request ed im m ediat ely. The following code exam ples show t he ASM386/ 486 coding of t hree skelet on except ion handlers, wit h t he save spaces given as correct for 32- bit prot ect ed m ode. They show how prologues and epilogues can be writ t en for various sit uat ions, but t he applicat ion- dependent except ion handling body is j ust indicat ed by com m ent s showing where it should be placed. The first t wo are very sim ilar; t heir only subst ant ial difference is t heir choice of inst ruct ions t o save and rest ore t he x87 FPU. The t rade- off here is bet ween t he increased diagnost ic inform at ion provided by FNSAVE and t he fast er execut ion of FNSTENV. ( Also, aft er saving t he original cont ent s, FNSAVE re- init ializes t he x87 FPU, while FNSTENV only m asks all x87 FPU except ions.) For applicat ions t hat are sensit ive t o int errupt lat ency or t hat do not need t o exam ine regist er cont ent s, FNSTENV reduces t he durat ion of t he “ crit ical region,” during which t he processor does not recognize anot her int errupt request . ( See t he Sect ion 8.1.10, “ Saving t he x87 FPU’s St at e wit h FSTENV/ FNSTENV and FSAVE/ FNSAVE,” for a com plet e descript ion of t he x87 FPU save im age.) I f t he processor support s St ream ing SI MD Ext ensions and t he operat ing syst em support s it , t he FXSAVE inst ruct ion should be used inst ead of FNSAVE. I f t he FXSAVE inst ruct ion is used, t he save area should be increased t o 512 byt es and aligned t o 16 byt es t o save t he ent ire st at e. These st eps will ensure t hat t he com plet e cont ext is saved. Aft er t he except ion handler body, t he epilogues prepare t he processor t o resum e execut ion from t he point of int errupt ion ( for exam ple, t he inst ruct ion following t he one t hat generat ed t he unm asked except ion) . Not ice t hat t he except ion flags in t he m em ory im age t hat is loaded int o t he x87 FPU are cleared t o zero prior t o reloading ( in fact , in t hese exam ples, t he ent ire st at us word im age is cleared) . Exam ple D- 3 and Exam ple D- 4 assum e t hat t he except ion handler it self will not cause an unm asked except ion. Where t his is a possibilit y, t he general approach shown in Exam ple D- 5 can be em ployed. The basic t echnique is t o save t he full x87 FPU st at e and t hen t o load a new cont rol word in t he prologue. Not e t hat considerable care should be t aken when designing an except ion handler of t his t ype t o prevent t he handler from being reent ered endlessly.

Vol. 1 D-19

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

Example D-3. Full-State Exception Handler SAVE_ALL PROC ; ;SAVE REGISTERS, ALLOCATE STACK SPACE FOR x87 FPU STATE IMAGE PUSH EBP . . MOV EBP, ESP SUB ESP, 108 ; ALLOCATES 108 BYTES (32-bit PROTECTED MODE SIZE) ;SAVE FULL x87 FPU STATE, RESTORE INTERRUPT ENABLE FLAG (IF) FNSAVE [EBP-108] PUSH [EBP + OFFSET_TO_EFLAGS] ; COPY OLD EFLAGS TO STACK TOP POPFD ;RESTORE IF TO VALUE BEFORE x87 FPU EXCEPTION ; ;APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE ; ;CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY) ;RESTORE MODIFIED STATE IMAGE MOV BYTE PTR [EBP-104], 0H FRSTOR [EBP-108] ;DE-ALLOCATE STACK SPACE, RESTORE REGISTERS MOV ESP, EBP . . POP EBP ; ;RETURN TO INTERRUPTED CALCULATION IRETD SAVE_ALL ENDP Example D-4. Reduced-Latency Exception Handler SAVE_ENVIRONMENTPROC ; ;SAVE REGISTERS, ALLOCATE STACK SPACE FOR x87 FPU ENVIRONMENT PUSH EBP . . MOV EBP, ESP SUB ESP, 28 ;ALLOCATES 28 BYTES (32-bit PROTECTED MODE SIZE) ;SAVE ENVIRONMENT, RESTORE INTERRUPT ENABLE FLAG (IF) FNSTENV [EDP - 28] PUSH [EBP + OFFSET_TO_EFLAGS] ; COPY OLD EFLAGS TO STACK TOP

D-20 Vol. 1

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

POPFD ;RESTORE IF TO VALUE BEFORE x87 FPU EXCEPTION ; ;APPLICATION-DEPENDENT EXCEPTION HANDLING CODE GOES HERE ; ;CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY) ;RESTORE MODIFIED ENVIRONMENT IMAGE MOV BYTE PTR [EBP-24], 0H FLDENV [EBP-28] ;DE-ALLOCATE STACK SPACE, RESTORE REGISTERS MOV ESP, EBP . . POP EBP ; ;RETURN TO INTERRUPTED CALCULATION IRETD SAVE_ENVIRONMENT ENDP Example D-5. Reentrant Exception Handler . . LOCAL_CONTROL DW ?; ASSUME INITIALIZED . . REENTRANTPROC ; ;SAVE REGISTERS, ALLOCATE STACK SPACE FOR x87 FPU STATE IMAGE PUSH EBP . . MOV EBP, ESP SUB ESP, 108 ;ALLOCATES 108 BYTES (32-bit PROTECTED MODE SIZE) ;SAVE STATE, LOAD NEW CONTROL WORD, RESTORE INTERRUPT ENABLE FLAG (IF) FNSAVE [EBP-108] FLDCW LOCAL_CONTROL PUSH [EBP + OFFSET_TO_EFLAGS] ;COPY OLD EFLAGS TO STACK TOP POPFD ;RESTORE IF TO VALUE BEFORE x87 FPU EXCEPTION . . ;

Vol. 1 D-21

GUIDELINES FOR WRITING X87 FPU EXCEPTION HANDLERS

;APPLICATION-DEPENDENT EXCEPTION HANDLING CODE ;GOES HERE - AN UNMASKED EXCEPTION ;GENERATED HERE WILL CAUSE THE EXCEPTION HANDLER TO BE REENTERED ;IF LOCAL STORAGE IS NEEDED, IT MUST BE ALLOCATED ON THE STACK . ;CLEAR EXCEPTION FLAGS IN STATUS WORD (WHICH IS IN MEMORY) ;RESTORE MODIFIED STATE IMAGE MOV BYTE PTR [EBP-104], 0H FRSTOR [EBP-108] ;DE-ALLOCATE STACK SPACE, RESTORE REGISTERS MOV ESP, EBP . . POP EBP ; ;RETURN TO POINT OF INTERRUPTION IRETD REENTRANT ENDP

D.3.5

Need for Storing State of IGNNE# Circuit If Using x87 FPU and SMM

The recom m ended circuit ( see Figure D- 1) for MS- DOS com pat ibilit y x87 FPU except ion handling for I nt el486 processors and beyond cont ains t wo flip flops. When t he x87 FPU except ion handler accesses I / O port 0F0H it clears t he I RQ13 int errupt request out put from Flip Flop # 1 and also clocks out t he I GNNE# signal ( act ive) from Flip Flop # 2. The assert ion of I GNNE# m ay be used by t he handler if needed t o execut e any x87 FPU inst ruct ion while ignoring t he pending x87 FPU errors. The problem here is t hat t he st at e of Flip Flop # 2 is effect ively an addit ional ( but hidden) st at us bit t hat can affect processor behavior, and so ideally should be saved upon ent ering SMM, and rest ored before resum ing t o norm al operat ion. I f t his is not d