321 47 7MB
English Pages 690 Year 2016
Intel® 64 and IA-32 Architectures Optimization Reference Manual
Order Number: 248966-035 November 2016
I nt el t echnologies feat ures and benefit s depend on syst em configurat ion and m ay require enabled hardware, soft ware, or service act ivat ion. Learn m ore at int el.com , or from t he OEM or ret ailer. No com put er syst em can be absolut ely secur e. I nt el does not assum e any liabilit y for lost or st olen dat a or sy st em s or any dam ages r esult ing from such losses. You m ay not use or facilit at e t he use of t his docum ent in connect ion w it h any infringem ent or ot her legal analy sis concerning I nt el pr oduct s described her ein. You agr ee t o grant I nt el a non- exclusive, r oyalt y- free license t o any pat ent claim t her eaft er draft ed w hich includes subj ect m at t er disclosed herein. No license ( ex pr ess or im plied, by est oppel or ot her w ise) t o any int ellect ual proper t y r ight s is grant ed by t his docum ent . The pr oduct s descr ibed m ay cont ain design defect s or er r or s know n as er rat a w hich m ay cause t he pr oduct t o deviat e fr om published specificat ions. Cur rent charact er ized er rat a are available on r equest . This docum ent cont ains inform at ion on pr oduct s, ser v ices and/ or pr ocesses in developm ent . All infor m at ion pr ovided here is subj ect t o change w it hout not ice. Cont act your I nt el r epresent at ive t o obt ain t he lat est I nt el product specificat ions and r oadm aps. Result s have been est im at ed or sim ulat ed using int ernal I nt el analysis or archit ect ure sim ulat ion or m odeling, and provided t o you for inform at ional purposes. Any differences in your syst em hardware, soft ware or configurat ion m ay affect your act ual perform ance. Copies of docum ent s w hich have an or der num ber and are r efer enced in t his docum ent , or ot her I nt el lit erat ure, m ay be obt ained by calling 1- 800- 548- 4725, or by v isit ing ht t p: / / w w w.int el.com / design/ lit erat ure.ht m . I nt el, t he I nt el logo, I nt el At om , I nt el Cor e, I nt el SpeedSt ep, MMX, Pent ium , VTune, and Xeon ar e t radem ark s of I nt el Cor porat ion in t he U. S. and/ or ot her count ries. * Ot her nam es and brands m ay be claim ed as t he pr oper t y of ot her s.
Copy right © 1997- 2016, I nt el Cor porat ion. All Right s Reser ved.
CONTENTS PAGE
CHAPTER 1 INTRODUCTION 1.1 1.2 1.3

CHAPTER 2 INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES 2.1 2.1.1 2.1.2 2.1.3 2.2 2.2.1 2.2.2 2.2.3 2.2.4 2.2.4.1 2.2.5 2.2.6 2.3 2.3.1 2.3.2 2.3.2.1 2.3.2.2 2.3.2.3 2.3.2.4 2.3.3 2.3.3.1 2.3.3.2 2.3.4 2.3.5 2.3.5.1 2.3.5.2 2.3.5.3 2.3.5.4 2.3.6 2.3.7 2.4 2.4.1 2.4.2 2.4.2.1 2.4.2.2 2.4.2.3 2.4.2.4 2.4.2.5 2.4.2.6 2.4.3 2.4.3.1 2.4.4 2.4.4.1 2.4.4.2 2.4.4.3
THE SKYLAKE MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-2 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 The Out-of-Order Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-5 THE HASWELL MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 The Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-9 Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 Load and Store Operation Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 The Haswell-E Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 The Broadwell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 Intel® Microarchitecture Code Name Sandy Bridge Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-14 The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 Legacy Decode Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 Decoded ICache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 Micro-op Queue and the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-19 The Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-20 Renamer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-21 The Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22 Cache Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23 Load and Store Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24 L1 DCache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25 Ring Interconnect and Last Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-29 System Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30 Intel® Microarchitecture Code Name Ivy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31 INTEL® CORE™ MICROARCHITECTURE AND ENHANCED INTEL® CORE™ MICROARCHITECTURE . . . . . . . . . . . . . . . . 2-32 Intel® Core™ Microarchitecture Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 Branch Prediction Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 Instruction Fetch Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35 Instruction Queue (IQ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-35 Instruction Decode. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36 Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36 Micro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-36 Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-37 Issue Ports and Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-38 Intel® Advanced Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39 Loads and Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-40 Data Prefetch to L1 caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 Data Prefetch Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-41 iii
CONTENTS PAGE
2.4.4.4 2.4.4.5 2.4.5 2.4.5.1 2.4.5.2 2.5 2.5.1 2.5.2 2.5.3 2.5.3.1 2.5.4 2.5.5 2.5.5.1 2.5.5.2 2.5.6 2.5.7 2.5.8 2.5.9 2.6 2.6.1 2.6.1.1 2.6.1.2 2.6.1.3 2.6.2 2.6.3 2.6.4 2.6.5 2.7 2.8 2.9 2.9.1 2.9.2 2.9.3 2.9.4 2.9.5 2.9.6 2.9.7 2.9.8 2.9.9 2.9.10 2.9.11 2.9.12 2.9.13 2.9.14 2.9.15 2.9.16 2.9.17
Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Advanced Smart Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INTEL® MICROARCHITECTURE CODE NAME NEHALEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microarchitecture Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Issue Ports and Execution Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache and Memory Subsystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Load and Store Operation Enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficient Handling of Alignment Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Store Forwarding Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REP String Enhancement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enhancements for System Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Efficiency Enhancements for Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . INTEL® HYPER-THREADING TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Processor Resources and HT Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Replicated Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Partitioned Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microarchitecture Pipeline and HT Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Retirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INTEL® 64 ARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SIMD TECHNOLOGY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SUMMARY OF SIMD TECHNOLOGIES AND APPLICATION LEVEL EXTENSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MMX™ Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Streaming SIMD Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Streaming SIMD Extensions 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Streaming SIMD Extensions 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supplemental Streaming SIMD Extensionsand PCLMULQDQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Advanced Vector Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Half-Precision Floating-Point Conversion (F16C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RDRAND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fused-Multiply-ADD (FMA) Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . General-Purpose Bit-Processing Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Transactional Synchronization Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . RDSEED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ADCX and ADOX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-42 2-43 2-43 2-45 2-45 2-46 2-46 2-48 2-49 2-50 2-51 2-52 2-52 2-52 2-54 2-55 2-55 2-55 2-55 2-57 2-57 2-57 2-58 2-58 2-58 2-58 2-59 2-59 2-59 2-61 2-62 2-62 2-62 2-62 2-62 2-63 2-63 2-63 2-64 2-64 2-64 2-64 2-65 2-65 2-65 2-65 2-65
CHAPTER 3 GENERAL OPTIMIZATION GUIDELINES 3.1 3.1.1 3.1.2 3.1.3 3.2 3.2.1 3.2.2 3.2.3 3.3 3.4 3.4.1 3.4.1.1 3.4.1.2 3.4.1.3 iv
PERFORMANCE TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 Intel® C++ and Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-1 General Compiler Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2 VTune™ Performance Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-2 PROCESSOR PERSPECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 CPUID Dispatch Strategy and Compatible Code Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3 Transparent Cache-Parameter Strategy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3 Threading Strategy and Hardware Multithreading Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-3 CODING RULES, SUGGESTIONS AND TUNING HINTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 OPTIMIZING THE FRONT END . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 Branch Prediction Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-4 Eliminating Branches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-5 Spin-Wait and Idle Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-6
CONTENTS PAGE
3.4.1.4 Inlining, Calls and Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8 3.4.1.5 Code Alignment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-8 3.4.1.6 Branch Type Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3-9 3.4.1.7 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 3.4.1.8 Compiler Support for Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 3.4.2 Fetch and Decode Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.4.2.1 Optimizing for Micro-fusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.4.2.2 Optimizing for Macro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.4.2.3 Length-Changing Prefixes (LCP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16 3.4.2.4 Optimizing the Loop Stream Detector (LSD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 3.4.2.5 Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge. . . . . . . . 3-18 3.4.2.6 Optimization for Decoded ICache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.4.2.7 Other Decoding Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.5 OPTIMIZING THE EXECUTION CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.5.1 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.5.1.1 Use of the INC and DEC Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.5.1.2 Integer Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.5.1.3 Using LEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.5.1.4 ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 3.5.1.5 Bitwise Rotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.5.1.6 Variable Bit Count Rotation and Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 3.5.1.7 Address Calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 3.5.1.8 Clearing Registers and Dependency Breaking Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 3.5.1.9 Compares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-27 3.5.1.10 Using NOPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.5.1.11 Mixing SIMD Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 3.5.1.12 Spill Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 3.5.1.13 Zero-Latency MOV Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 3.5.2 Avoiding Stalls in Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 3.5.2.1 ROB Read Port Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 3.5.2.2 Writeback Bus Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32 3.5.2.3 Bypass between Execution Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-32 3.5.2.4 Partial Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33 3.5.2.5 Partial XMM Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34 3.5.2.6 Partial Flag Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34 3.5.2.7 Floating-Point/SIMD Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35 3.5.3 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.5.4 Optimization of Partially Vectorizable Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37 3.5.4.1 Alternate Packing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38 3.5.4.2 Simplifying Result Passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-39 3.5.4.3 Stack Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40 3.5.4.4 Tuning Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40 3.6 OPTIMIZING MEMORY ACCESSES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42 3.6.1 Load and Store Execution Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42 3.6.1.1 Make Use of Load Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . 3-42 3.6.1.2 L1D Cache Latency in Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43 3.6.1.3 Handling L1D Cache Bank Conflict . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44 3.6.2 Minimize Register Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45 3.6.3 Enhance Speculative Execution and Memory Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46 3.6.4 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47 3.6.5 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48 3.6.5.1 Store-to-Load-Forwarding Restriction on Size and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-49 3.6.5.2 Store-forwarding Restriction on Data Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-53 3.6.6 Data Layout Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54 3.6.7 Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-56 3.6.8 Capacity Limits and Aliasing in Caches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57 3.6.8.1 Capacity Limits in Set-Associative Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57 3.6.8.2 Aliasing Cases in the Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo and Intel® Core™ 2 Duo Processors3-58 3.6.9 Mixing Code and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59 3.6.9.1 Self-modifying Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59 3.6.9.2 Position Independent Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60 3.6.10 Write Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60 3.6.11 Locality Enhancement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-61 3.6.12 Minimizing Bus Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-62 3.6.13 Non-Temporal Store Bus Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-62 v
CONTENTS PAGE
3.7 3.7.1 3.7.2 3.7.3 3.7.4 3.7.5 3.7.6 3.7.6.1 3.7.6.2 3.7.6.3 3.8 3.8.1 3.8.2 3.8.2.1 3.8.2.2 3.8.3 3.8.3.1 3.8.3.2 3.8.3.3 3.8.4 3.8.4.1 3.8.4.2 3.8.5 3.8.5.1 3.8.5.2 3.9
PREFETCHING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Instruction Fetching and Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Prefetching for First-Level Data Cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Prefetching for Second-Level Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cacheability Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . REP Prefix and Data Movement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enhanced REP MOVSB and STOSB operation (ERMSB). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memcpy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memmove Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memset Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . FLOATING-POINT CONSIDERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guidelines for Optimizing Floating-point Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Microarchitecture Specific Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Long-Latency FP Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miscellaneous Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-point Modes and Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-point Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dealing with floating-point exceptions in x87 FPU code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-point Exceptions in SSE/SSE2/SSE3 Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Floating-point Modes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x87 vs. Scalar SIMD Floating-point Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Scalar SSE/SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transcendental Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MAXIMIZING PCIE PERFORMANCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3-63 3-63 3-64 3-66 3-66 3-66 3-69 3-69 3-70 3-71 3-71 3-71 3-72 3-72 3-72 3-72 3-72 3-73 3-73 3-73 3-74 3-76 3-76 3-76 3-77 3-77
CHAPTER 4 CODING FOR SIMD ARCHITECTURES 4.1 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.1.6 4.1.7 4.1.8 4.1.9 4.1.10 4.1.11 4.1.12 4.1.13 4.2 4.2.1 4.2.2 4.3 4.3.1 4.3.1.1 4.3.1.2 4.3.1.3 4.3.1.4 4.4 4.4.1 4.4.1.1 4.4.1.2 4.4.2 4.4.3 4.4.4 4.4.4.1 4.5 4.5.1 4.5.2 4.5.3 vi
CHECKING FOR PROCESSOR SUPPORT OF SIMD TECHNOLOGIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1 Checking for MMX Technology Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2 Checking for Streaming SIMD Extensions Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2 Checking for Streaming SIMD Extensions 2 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-2 Checking for Streaming SIMD Extensions 3 Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3 Checking for Supplemental Streaming SIMD Extensions 3 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-3 Checking for SSE4.1 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4 Checking for SSE4.2 Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4 DetectiON of PCLMULQDQ and AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-4 Detection of AVX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-5 Detection of VEX-Encoded AES and VPCLMULQDQ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7 Detection of F16C Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-7 Detection of FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8 Detection of AVX2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-9 CONSIDERATIONS FOR CODE CONVERSION TO SIMD PROGRAMMING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Identifying Hot Spots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12 Determine If Code Benefits by Conversion to SIMD Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12 CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-12 Coding Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 Assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15 Automatic Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16 STACK AND DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Alignment and Contiguity of Data Access Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Using Padding to Align Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Using Arrays to Make Data Contiguous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-17 Stack Alignment For 128-bit SIMD Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18 Data Alignment for MMX Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18 Data Alignment for 128-bit data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19 Compiler-Supported Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-19 IMPROVING MEMORY UTILIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 Data Structure Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-20 Strip-Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23 Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24
CONTENTS PAGE
4.6 4.6.1 4.7
INSTRUCTION SELECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26 SIMD Optimizations and Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-27 TUNING THE FINAL APPLICATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-28
CHAPTER 5 OPTIMIZING FOR SIMD INTEGER APPLICATIONS 5.1 5.2 5.2.1 5.2.2 5.3 5.4 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5 5.4.6 5.4.7 5.4.8 5.4.9 5.4.10 5.4.11 5.4.12 5.4.13 5.4.14 5.4.15 5.4.16 5.5 5.6 5.6.1 5.6.2 5.6.3 5.6.4 5.6.5 5.6.6 5.6.6.1 5.6.6.2 5.6.7 5.6.8 5.6.9 5.6.10 5.6.11 5.6.12 5.6.13 5.6.14 5.6.15 5.6.16 5.6.17 5.7 5.7.1 5.7.1.1 5.7.2 5.7.2.1 5.7.2.2 5.7.2.3 5.7.3 5.8 5.8.1 5.8.1.1 5.8.1.2 5.9 5.10 5.10.1
GENERAL RULES ON SIMD INTEGER CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 USING SIMD INTEGER WITH X87 FLOATING-POINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-2 Using the EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2 Guidelines for Using EMMS Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-2 DATA ALIGNMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 DATA MOVEMENT CODING TECHNIQUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Unsigned Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5 Signed Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-5 Interleaved Pack with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6 Interleaved Pack without Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7 Non-Interleaved Unpack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8 Extract Data Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9 Insert Data Element. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Non-Unit Stride Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 Move Byte Mask to Integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Packed Shuffle Word for 64-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Packed Shuffle Word for 128-bit Registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Shuffle Bytes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Conditional Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 Unpacking/interleaving 64-bit Data in 128-bit Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 Data Movement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-14 Conversion Instructionsbsolute Difference of Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 Absolute Difference of Signed Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Pixel Format Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 Endian Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 Clipping to an Arbitrary Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 Highly Efficient Clipping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Packed Max/Min of Byte, Word and Dword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Packed Multiply Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Packed Sum of Absolute Differences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 MPSADBW and PHMINPOSUW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 Packed Average (Byte/Word) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 Complex Multiply by a Constant. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-22 Packed 64-bit Add/Subtract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 128-bit Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 PTEST and Conditional Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 Vectorization of Heterogeneous Computations across Loop Iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24 Vectorization of Control Flows in Nested Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25 MEMORY OPTIMIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-27 Partial Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28 Supplemental Techniques for Avoiding Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29 Increasing Bandwidth of Memory Fills and Video Fills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Increasing Memory Bandwidth Using the MOVDQ Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Increasing Memory Bandwidth by Loading and Storing to and from the Same DRAM Page . . . . . . . . . . . . 5-30 Increasing UC and WC Store Bandwidth by Using Aligned Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 Reverse Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 CONVERTING FROM 64-BIT TO 128-BIT SIMD INTEGERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34 SIMD Optimizations and Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34 Packed SSE2 Integer versus MMX Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-34 Work-around for False Dependency Issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35 TUNING PARTIALLY VECTORIZABLE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35 PARALLEL MODE AES ENCRYPTION AND DECRYPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38 AES Counter Mode of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-38 vii
CONTENTS PAGE
5.10.2 AES Key Expansion Alternative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10.3 Enhancement in Intel Microarchitecture Code Name Haswell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10.3.1 AES and Multi-Buffer Cryptographic Throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.10.3.2 PCLMULQDQ Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11 LIGHT-WEIGHT DECOMPRESSION AND DATABASE PROCESSING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.1 Reduced Dynamic Range Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.11.2 Compression and Decompression Using SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-46 5-48 5-48 5-48 5-48 5-49 5-49
CHAPTER 6 OPTIMIZING FOR SIMD FLOATING-POINT APPLICATIONS 6.1 6.2 6.3 6.4 6.5 6.5.1 6.5.1.1 6.5.1.2 6.5.1.3 6.5.1.4 6.5.2 6.5.3 6.6 6.6.1 6.6.1.1 6.6.1.2 6.6.2 6.6.3 6.6.4 6.6.4.1
ata Arrangement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-2 Vertical versus Horizontal Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3 Data Swizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-5 Data Deswizzling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-7 Horizontal ADD Using SSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-8 Use of CVTTPS2PI/CVTTSS2SI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 Flush-to-Zero and Denormals-are-Zero Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 SIMD OPTIMIZATIONS AND MICROARCHITECTURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 SIMD Floating-point Programming Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 SSE3 and Complex Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12 Packed Floating-Point Performance in Intel Core Duo Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Dot Product and Horizontal SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Vector Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16 Using Horizontal SIMD Instruction Sets and Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 SOA and Vector Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20
CHAPTER 7 OPTIMIZING CACHE USAGE 7.1 7.2 7.3 7.3.1 7.3.2 7.3.3 7.4 7.4.1 7.4.1.1 7.4.1.2 7.4.1.3 7.4.1.4 7.4.2 7.4.2.1 7.4.2.2 7.4.3 7.4.4 7.4.5 7.4.5.1 7.4.5.2 7.4.5.3 7.4.6 7.4.7 7.5 7.5.1 7.5.2 7.5.3 7.5.4 7.5.5
viii
oftware Data Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3 Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-3 Prefetch and Load Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5 CACHEABILITY CONTROL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 The Non-temporal Store Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-5 Fencing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6 Streaming Non-temporal Stores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6 Memory Type and Non-temporal Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6 Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-6 Streaming Store Usage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7 Coherent Requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7 Non-coherent requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-7 Streaming Store Instruction Descriptions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8 The Streaming Load Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8 FENCE Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8 SFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-8 LFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9 MFENCE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9 CLFLUSH Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7-9 CLFLUSHOPT Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 MEMORY OPTIMIZATION USING PREFETCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Software-Controlled Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Example of Effective Latency Reduction with Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13 Example of Latency Hiding with S/W Prefetch Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 Software Prefetching Usage Checklist . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15
CONTENTS PAGE
7.5.6 7.5.7 7.5.8 7.5.9 7.5.10 7.5.11 7.5.12 7.6 7.6.1 7.6.2 7.6.2.1 7.6.2.2 7.6.2.3 7.6.2.4 7.6.2.5 7.6.2.6 7.6.2.7 7.6.2.8 7.6.3 7.6.3.1 7.6.3.2 7.6.3.3
Software Prefetch Scheduling Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Minimize Number of Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mix Software Prefetch with Computation Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Software Prefetch and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Prefetching and Cache Blocking Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Single-pass versus Multi-pass Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MEMORY OPTIMIZATION USING NON-TEMPORAL STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Non-temporal Stores and Software Write-Combining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Video Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Video Decoder. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusions from Video Encoder and Decoder Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimizing Memory Copy Routines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TLB Priming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using the 8-byte Streaming Stores and Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Using 16-byte Streaming Stores and Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Performance Comparisons of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache Sharing Using Deterministic Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache Sharing in Single-Core or Multicore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Determine Prefetch Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7-16 7-16 7-17 7-19 7-19 7-23 7-24 7-25 7-25 7-26 7-26 7-26 7-27 7-27 7-28 7-29 7-29 7-30 7-31 7-32 7-32 7-32
CHAPTER 8 MULTICORE AND HYPER-THREADING TECHNOLOGY 8.1 8.1.1 8.1.2 8.2 8.2.1 8.2.1.1 8.2.2 8.2.3 8.2.3.1 8.2.4 8.2.4.1 8.2.4.2 8.2.4.3 8.3 8.3.1 8.3.2 8.3.3 8.3.4 8.3.5 8.4 8.4.1 8.4.2 8.4.3 8.4.4 8.4.4.1 8.4.5 8.4.6 8.4.7 8.5 8.5.1 8.5.2 8.5.3 8.5.4 8.5.5 8.6 8.6.1 8.6.2 8.6.2.1 8.6.2.2
PERFORMANCE AND USAGE MODELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-1 Multitasking Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2 PROGRAMMING MODELS AND MULTITHREADING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 Parallel Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4 Domain Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4 Functional Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4 Specialized Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-4 Producer-Consumer Threading Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-5 Tools for Creating Multithreaded Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-7 Programming with OpenMP Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8 Automatic Parallelization of Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8 Supporting Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8 OPTIMIZATION GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-8 Key Practices of Thread Synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-8 Key Practices of System Bus Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9 Key Practices of Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9 Key Practices of Execution Resource Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-9 Generality and Performance Impact. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 THREAD SYNCHRONIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 Choice of Synchronization Primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 Synchronization for Short Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 Optimization with Spin-Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13 Synchronization for Longer Periods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-13 Avoid Coding Pitfalls in Thread Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14 Prevent Sharing of Modified Data and False-Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14 Placement of Shared Synchronization Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15 Pause Latency in Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16 SYSTEM BUS OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17 Conserve Bus Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-17 Understand the Bus and Cache Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Avoid Excessive Software Prefetches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Improve Effective Latency of Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18 Use Full Write Transactions to Achieve Higher Data Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 MEMORY OPTIMIZATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 Cache Blocking Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-19 Shared-Memory Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20 Minimize Sharing of Data between Physical Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20 Batched Producer-Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20 ix
CONTENTS PAGE
8.6.3 8.7 8.7.1 8.8 8.8.1 8.8.2 8.9 8.9.1
Eliminate 64-KByte Aliased Data Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 FRONT END OPTIMIZATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Avoid Excessive Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 Topology Enumeration of Shared Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24 Non-Uniform Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24 OPTIMIZATION OF OTHER SHARED RESOURCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25 Expanded Opportunity for HT Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26
CHAPTER 9 64-BIT MODE CODING GUIDELINES 9.1 9.2 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.3 9.3.1 9.3.2 9.3.3
INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1 CODING RULES AFFECTING 64-BIT MODE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1 Use Legacy 32-Bit Instructions When Data Size Is 32 Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1 Use Extra Registers to Reduce Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-1 Effective Use of 64-Bit by 64-Bit Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2 Replace 128-bit Integer Division with 128-bit Multiplies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-2 Sign Extension to Full 64-Bits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-4 ALTERNATE CODING RULES FOR 64-BIT MODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-5 Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic Result. . . . . . . . . . . . . . . . . . . . . . . .9-5 CVTSI2SS and CVTSI2SD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6 Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9-6
CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXTPROCESSING/LEXING/PARSING 10.1 10.1.1 10.2 10.2.1 10.2.2 10.3 10.3.1 10.3.2 10.3.3 10.3.4 10.3.5 10.3.6 10.4 10.5 10.5.1 10.5.1.1
naligned Memory Access and Buffer Size Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 Unaligned Memory Access and String Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 Null Character Identification (Strlen equivalent). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 White-Space-Like Character Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9 Substring Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-11 String Token Extraction and Case Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Unicode Processing and PCMPxSTRy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22 Replacement String Library Function Using SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-26 SSE4.2 ENABLED NUMERICAL AND LEXICAL COMPUTATION. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-28 NUMERICAL DATA CONVERSION TO ASCII FORMAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-34 Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48 MULX Instruction and Large Integer Numeric Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48
CHAPTER 11 OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2 11.1 11.1.1 11.2 11.3 11.3.1 11.4 11.4.1 11.4.2 11.4.3 11.5 11.5.1 11.5.2 11.6 11.6.1 11.6.2 11.6.3 11.7
x
INTEL® AVX INTRINSICS CODING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-2 Intel® AVX Assembly Codingixing Intel® AVX and Intel SSE in Function Calls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 128-BIT LANE OPERATION AND AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10 Programming With the Lane Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11 Strided Load Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-11 The Register Overlap Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-14 DATA GATHER AND SCATTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 Data Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 Data Scatter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-17 DATA ALIGNMENT FOR INTEL® AVX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 Align Data to 32 Bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 Consider 16-Byte Memory Access when Memory is Unaligned . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-20 Prefer Aligned Stores Over Aligned Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 L1D CACHE LINE REPLACEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22
CONTENTS PAGE
11.8 4K ALIASING. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-22 11.9 CONDITIONAL SIMD PACKED LOADS AND STORES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-23 11.9.1 Conditional Loopseplace Shuffles with Blends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-28 11.11.2 Design Algorithm With Fewer Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-30 11.11.3 Perform Basic Shuffles on Load Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-32 11.12 DIVIDE AND SQUARE ROOT OPERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-34 11.12.1 Single-Precision Divide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-35 11.12.2 Single-Precision Reciprocal Square Root. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-37 11.12.3 Single-Precision Square Root . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-39 11.13 OPTIMIZATION OF ARRAY SUB SUM EXAMPLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-41 11.14 HALF-PRECISION FLOATING-POINT CONVERSIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43 11.14.1 Packed Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43 11.14.2 Packed Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-44 11.14.3 Locality Consideration for using Half-Precision FP to Conserve Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-45 11.15 FUSED MULTIPLY-ADD (FMA) INSTRUCTIONS GUIDELINES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-46 11.15.1 Optimizing Throughput with FMA and Floating-Point Add/MUL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47 11.15.2 Optimizing Throughput with Vector Shifts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-48 11.16 AVX2 OPTIMIZATION GUIDELINES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-49 11.16.1 Multi-Buffering and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54 11.16.2 Modular Multiplication and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54 11.16.3 Data Movement Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-54 11.16.3.1 SIMD Heuristics to implement Memcpy(). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-55 11.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-55 11.16.3.3 Memset() Implementation Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-56 11.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57 11.16.3.5 256-bit Fetch versus Two 128-bit Fetches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57 11.16.3.6 Mixing MULX and AVX2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57 11.16.4 Considerations for Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-64 11.16.4.1 Strided Loads. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-67 11.16.4.2 Adjacent Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-68 11.16.5 AVX2 Conversion Remedy to MMX Instruction Throughput Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-69
CHAPTER 12 INTEL® TSX RECOMMENDATIONS 12.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1 12.1.1 Optimization Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 12.2 APPLICATION-LEVEL TUNING AND OPTIMIZATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2 12.2.1 Existing TSX-enabled Locking Libraries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.2.1.1 Libraries allowing lock elision for unmodified programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.2.1.2 Libraries requiring program modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.2.2 Initial Checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.2.3 Run and Profile the Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-3 12.2.4 Minimize Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-4 12.2.4.1 Transactional Aborts due to Data Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-5 12.2.4.2 Transactional Aborts due to Limited Transactional Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6 12.2.4.3 Lock Elision Specific Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7 12.2.4.4 HLE Specific Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-7 12.2.4.5 Miscellaneous Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-8 12.2.5 Using Transactional-Only Code Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9 12.2.6 Dealing with Transactional Regions or Paths that Abort at a High Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9 12.2.6.1 Transitioning to Non-Elided Execution without Aborting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-9 12.2.6.2 Forcing an Early Abort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.2.6.3 Not Eliding Selected Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.3 DEVELOPING AN INTEL TSX ENABLED SYNCHRONIZATION LIBRARY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.3.1 Adding HLE Prefixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.3.2 Elision Friendly Critical Section Locks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 12.3.3 Using HLE or RTM for Lock Elision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 12.3.4 An example wrapper for lock elision using RTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11 12.3.5 Guidelines for the RTM fallback handler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 12.3.6 Implementing Elision-Friendly Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 12.3.6.1 Implementing a Simple Spinlock using HLE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-13 xi
CONTENTS PAGE
12.3.6.2 12.3.6.3 12.3.6.4 12.3.7 12.3.8 12.3.9 12.3.10 12.4 12.4.1 12.4.2 12.4.3 12.4.4 12.4.5 12.4.6 12.4.7 12.4.8 12.4.9 12.4.10 12.5 12.6 12.7 12.7.1 12.7.1.1 12.7.2 12.7.2.1 12.7.2.2 12.7.2.3 12.7.3
Implementing Reader-Writer Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Implementing Ticket Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Implementing Queue-Based Locks using Intel TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-15 Eliding Application-Specific Meta-Locks using Intel TSX. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16 Avoiding Persistent Non-Elided Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17 Reading the Value of an Elided Lock in RTM-based libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 Intermixing HLE and RTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 USING THE PERFORMANCE MONITORING SUPPORT FOR INTEL TSX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20 Measuring Transactional Success . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21 Finding locks to elide and verifying all locks are elided. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21 Sampling Transactional Aborts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21 Classifying Aborts using a Profiling Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-21 XABORT Arguments for RTM fallback handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-22 Call Graphs for Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23 Last Branch Records and Transactional Aborts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23 Profiling and Testing Intel TSX Software using the Intel® SDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23 HLE Specific Performance Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-24 Computing Useful Metrics for Intelintrinsics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-26 Emulated RTM intrinsics on older gcc compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27 HLE intrinsics on gcc and other Linux compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28 Generating HLE intrinsics with gcc4.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-28 C++11 atomic support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29 Emulating HLE intrinsics with older gcc-compatible compilers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29 HLE intrinsics on Windows C/C++ compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29
CHAPTER 13 POWER OPTIMIZATION FOR MOBILE USAGES 13.1 13.2 13.2.1 13.3 13.3.1 13.3.2 13.3.3 13.3.4 13.4 13.4.1 13.4.2 13.4.3 13.4.4 13.4.5 13.4.6 13.4.7 13.4.7.1 13.4.7.2 13.4.7.3 13.5 13.5.1 13.5.1.1 13.5.1.2 13.5.2 13.5.3 13.5.4 13.5.5 13.5.6 13.5.7 13.5.8 13.6 13.6.1 13.6.1.1 xii
OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 MOBILE USAGE SCENARIOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1 Intelligent Energy Efficient Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-2 ACPI C-STATES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-3 Processor-Specific C4 and Deep C4 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 Processor-Specific Deep C-States and Intel® Turbo Boost Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4 Processor-Specific Deep C-States for Intel® Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . 13-5 Intel® Turbo Boost Technology 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 GUIDELINES FOR EXTENDING BATTERY LIFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 Adjust Performance to Meet Quality of Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-6 Reducing Amount of Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7 Platform-Level Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-7 Handling Sleep State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Using Enhanced Intel SpeedStep® Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-8 Enabling Intel® Enhanced Deeper Sleep. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-9 Multicore Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Enhanced Intel SpeedStep® Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Thread Migration Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10 Multicore Considerations for C-States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11 TUNING SOFTWARE FOR INTELLIGENT POWER CONSUMPTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Reduction of Active Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Multi-threading to reduce Active Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-12 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 PAUSE and Sleep(0) Loop Optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 Spin-Wait Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Using Event Driven Service Instead of Polling in Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Reducing Interrupt Rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Reducing Privileged Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-15 Setting Context Awareness in the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16 Saving Energy by Optimizing for Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-17 PROCESSOR SPECIFIC POWER MANAGEMENT OPTIMIZATION FOR SYSTEM SOFTWARE . . . . . . . . . . . . . . . . . . . 13-17 Power Management Recommendation of Processor-Specific Inactive State Configurations . . . . . . . . . . . . . 13-17 Balancing Power Management and Responsiveness of Inactive To Active State Transitions. . . . . . . . . . 13-19
CONTENTS PAGE
CHAPTER 14 SOFTWARE OPTIMIZATION FOR GOLDMONT AND SILVERMONT MICROARCHITECTURES 14.1 MICROARCHITECTURES OF RECENT INTEL ATOM PROCESSOR GENERATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1 14.1.1 Goldmont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-1 14.1.2 Silvermont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-4 14.1.2.1 Integer Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 14.1.2.2 Floating-Point Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 14.2 CODING RECOMMENDATIONS FOR GOLDMONT AND SILVERMONT MICROARCHITECTURES . . . . . . . . . . . . . . . . . . 14-7 14.2.1 Optimizing The Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 14.2.1.1 Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-7 14.2.1.2 Front End High IPC Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-8 14.2.1.3 Branching Across 4GB Boundary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 14.2.1.4 Loop Unrolling and Loop Stream Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 14.2.1.5 Mixing Code and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 14.2.2 Optimizing The Execution Core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 14.2.2.1 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 14.2.2.2 Address Generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11 14.2.2.3 FP Multiply-Accumulate-Store Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11 14.2.2.4 Integer Multiply Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12 14.2.2.5 Zeroing Idioms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 14.2.2.6 NOP Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 14.2.2.7 Move Elimination and ESP Folding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 14.2.2.8 Stack Manipulation Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 14.2.2.9 Flags usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-13 14.2.2.10 SIMD Floating-Point and X87 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 14.2.2.11 SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 14.2.2.12 Vectorization Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 14.2.2.13 Other SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-14 14.2.2.14 Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-15 14.2.2.15 Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-16 14.2.2.16 Integer Shift. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-17 14.2.2.17 Pause Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-17 14.2.3 Optimizing Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 14.2.3.1 Reduce Unaligned Memory Access with PALIGNR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 14.2.3.2 Minimize Memory Execution Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 14.2.3.3 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 14.2.3.4 PrefetchW Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19 14.2.3.5 Cache Line Splits and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19 14.2.3.6 Segment Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20 14.2.3.7 Copy and String Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20 14.3 INSTRUCTION LATENCY AND THROUGHPUT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-20
CHAPTER 15 KNIGHTS LANDING MICROARCHITECTURE AND SOFTWARE OPTIMIZATION 15.1 KNIGHTS LANDING MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-2 15.1.1 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-3 15.1.2 Out-of-Order Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-3 15.1.3 UnTile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-6 15.2 INTEL® AVX-512 CODING RECOMMENDATIONS FOR KNIGHTS LANDING MICROARCHITECTURE . . . . . . . . . . . . . . 15-7 15.2.1 Using Gather and Scatter Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 15.2.2 Using Enhanced Reciprocal Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 15.2.3 Using AVX-512CD Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9 15.2.4 Using Intel® Hyper-Threading Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-10 15.2.5 Front End Considerations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11 15.2.5.1 Instruction Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11 15.2.5.2 Branching Indirectly Across a 4GB Boundary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11 15.2.6 Integer Execution Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 15.2.6.1 Flags usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 15.2.6.2 Integer Division. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 15.2.7 Optimizing FP and Vector Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 15.2.7.1 Instruction Selection Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 15.2.7.2 Porting Intrinsic From Prior Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14 15.2.7.3 Vectorization Trade-Off Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-14 xiii
CONTENTS PAGE
15.2.8 15.2.8.1 15.2.8.2 15.2.8.3 15.2.8.4 15.2.8.5 15.2.8.6 15.2.8.7 15.2.8.8 15.2.8.9
Memory Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17 Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17 Hardware Prefetcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-18 Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-18 Memory Execution Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-18 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19 Way, Set Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19 Streaming Store Versus Regular Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20 Compiler Switches and Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20 Direct Mapped MCDRAM Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-20
APPENDIX A APPLICATION PERFORMANCE TOOLS A.1 A.1.1 A.1.2 A.1.2.1 A.1.2.2 A.1.3 A.1.4 A.1.4.1 A.1.4.2 A.1.5 A.2 A.2.1 A.2.2 A.2.3 A.2.4 A.3 A.3.1 A.3.1.1 A.3.1.2 A.3.1.3 A.4 A.4.1 A.5 A.5.1 A.6 A.6.1 A.6.1.1 A.6.2 A.6.3 A.7
COMPILERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recommended Optimization Settings for Intel® 64 and IA-32 Processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vectorization and Loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multithreading with OpenMP* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Automatic Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Inline Expansion of Library Functions (/Oi, /Oi-) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interprocedural and Profile-Guided Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interprocedural Optimization (IPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Profile-Guided Optimization (PGO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Cilk™ Plus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PERFORMANCE LIBRARIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Integrated Performance Primitives (Intel® IPP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Math Kernel Library (Intel® MKL) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Threading Building Blocks (Intel® TBB) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benefits Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PERFORMANCE PROFILERS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® VTune™ Amplifier XE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hardware Event-Based Sampling Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Platform Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . THREAD AND MEMORY CHECKERS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Inspector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VECTORIZATION ASSISTANT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Advisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CLUSTER TOOLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® Trace Analyzer and Collector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPI Performance Snapshot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® MPI Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel® MPI Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INTEL® ACADEMIC COMMUNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A-2 A-2 A-2 A-3 A-3 A-3 A-3 A-3 A-3 A-4 A-4 A-4 A-5 A-5 A-5 A-5 A-5 A-6 A-6 A-6 A-6 A-6 A-7 A-7 A-7 A-7 A-7 A-7 A-8 A-8
APPENDIX B USING PERFORMANCE MONITORING EVENTS B.1 B.1.1 B.1.2 B.1.3 B.1.4 B.1.5 B.1.6 B.1.7 B.1.8 B.1.8.1 B.2 B.3 B.4 B.4.1 B.4.1.1 B.4.1.2
xiv
TOP-DOWN ANALYSIS METHOD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-1 Top-Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-2 Front End Bound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-3 Back End Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-4 Memory Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-4 Core Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5 Bad Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-5 Retiring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6 TMAM and Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6 TMAM Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-6 PERFORMANCE MONITORING AND MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7 INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-13 PERFORMANCE ANALYSIS TECHNIQUES FOR INTEL® XEON® PROCESSOR 5500 SERIES . . . . . . . . . . . . . . . . . . . . . B-14 Cycle Accounting and Uop Flow Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-15 Cycle Drill Down and Branch Mispredictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-16 Basic Block Drill Down . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-19
CONTENTS PAGE
B.4.2 B.4.2.1 B.4.3 B.4.3.1 B.4.3.2 B.4.3.3 B.4.3.4 B.4.3.5 B.4.3.6 B.4.3.7 B.4.3.8 B.4.3.9 B.4.4 B.4.4.1 B.4.4.2 B.4.5 B.4.5.1 B.4.5.2 B.4.5.3 B.4.5.4 B.4.6 B.4.7 B.5 B.5.1 B.5.2 B.5.2.1 B.5.2.2 B.5.2.3 B.5.3 B.5.4 B.5.4.1 B.5.4.2 B.5.4.3 B.5.4.4 B.5.5 B.5.5.1 B.5.5.2 B.5.6 B.5.6.1 B.5.7 B.5.7.1 B.5.7.2 B.5.7.3 B.5.7.4 B.5.7.5 B.6 B.6.1 B.6.2 B.6.3 B.7 B.7.1 B.7.2 B.7.3 B.8 B.8.1 B.8.2 B.8.2.1 B.8.2.2 B.8.2.3 B.8.2.4 B.8.2.5 B.8.2.6 B.8.3 B.8.3.1
Stall Cycle Decomposition and Core Memory Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measuring Costs of Microarchitectural Conditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Core PMU Precise Events. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precise Memory Access Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Load Latency Event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Precise Execution Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Last Branch Record (LBR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measuring Core Memory Access Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measuring Per-Core Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Miscellaneous L1 and L2 Events for Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . TLB Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branch Mispredictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Code Generation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Uncore Performance Monitoring Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Queue Occupancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Queue Port Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Global Queue Snoop Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . L3 Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Intel QuickPath Interconnect Home Logic (QHL). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Measuring Bandwidth From the Uncore . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . PERFORMANCE TUNING TECHNIQUES FOR INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE . . . . . . . Correlating Performance Bottleneck to Source Location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hierarchical Top-Down Performance Characterization Methodology and Locating Performance Bottlenecks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Back End Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Core Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Bound Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Back End Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory Sub-System Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Accounting for Load Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cache-line Replacement Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lock Contention Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Other Memory Access Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Execution Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Longer Instruction Latencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Assists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bad Speculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branch Mispredicts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding the Micro-op Delivery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Understanding the Sources of the Micro-op Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Decoded ICache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Issues in the Legacy Decode Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Instruction Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . USING PERFORMANCE EVENTS OF INTEL® CORE™ SOLO AND INTEL® CORE™ DUO PROCESSORS. . . . . . . . . . . . . . Understanding the Results in a Performance Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ratio Interpretation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Notes on Selected Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DRILL-DOWN TECHNIQUES FOR PERFORMANCE ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cycle Composition at Issue Port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cycle Composition of OOO Execution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Drill-Down on Performance Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Clocks Per Instructions Retired Ratio (CPI). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Front End Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Code Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branching and Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stack Pointer Tracker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Macro-fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Length Changing Prefix (LCP) Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Self Modifying Code Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branch Prediction Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Branch Mispredictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B-20 B-20 B-21 B-22 B-23 B-25 B-26 B-28 B-30 B-31 B-31 B-32 B-32 B-32 B-32 B-33 B-33 B-35 B-35 B-36 B-36 B-41 B-42 B-42 B-43 B-44 B-44 B-44 B-45 B-46 B-47 B-48 B-49 B-49 B-52 B-52 B-52 B-53 B-53 B-53 B-53 B-55 B-56 B-57 B-57 B-58 B-58 B-58 B-59 B-59 B-61 B-61 B-62 B-63 B-63 B-64 B-64 B-64 B-64 B-64 B-65 B-65 B-65 B-65 xv
CONTENTS PAGE
B.8.3.2 B.8.3.3 B.8.4 B.8.4.1 B.8.4.2 B.8.4.3 B.8.4.4 B.8.4.5 B.8.4.6 B.8.5 B.8.5.1 B.8.5.2 B.8.5.3 B.8.5.4 B.8.5.5 B.8.6 B.8.6.1 B.8.6.2 B.8.6.3 B.8.7 B.8.7.1 B.8.7.2 B.8.7.3 B.8.8 B.8.9 B.8.9.1 B.8.9.2 B.8.9.3 B.8.10 B.8.10.1 B.8.10.2
Virtual Tables and Indirect Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-65 Mispredicted Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Execution Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Resource Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 ROB Read Port Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Partial Register Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Partial Flag Stalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Bypass Between Execution Domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Floating-Point Performance Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-66 Memory Sub-System - Access Conflicts Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-67 Loads Blocked by the L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-67 4K Aliasing and Store Forwarding Block Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-67 Load Block by Preceding Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-67 Memory Disambiguation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 Load Operation Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 Memory Sub-System - Cache Misses Ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 Locating Cache Misses in the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 L1 Data Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 L2 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-68 Memory Sub-system - Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-69 L1 Data Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-69 L2 Hardware Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-69 Software Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-69 Memory Sub-system - TLB Miss Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-69 Memory Sub-system - Core Interaction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Modified Data Sharing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Fast Synchronization Penalty. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Simultaneous Extensive Stores and Load Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Memory Sub-system - Bus Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Bus Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-70 Modified Cache Lines Eviction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-71
APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT C.1 C.2 C.3 C.3.1 C.3.2 C.3.3 C.3.3.1
atency and Throughput with Register Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .C-3 Table Footnotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-18 Instructions with Memory Operands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-19 Software Observable Latency of Memory References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-20
APPENDIX D INTEL® ATOM™ MICROARCHITECTURE AND SOFTWARE OPTIMIZATION D.1 D.2 D.2.1 D.3 D.3.1 D.3.2 D.3.2.1 D.3.2.2 D.3.2.3 D.3.2.4 D.3.2.5 D.3.2.6 D.3.3 D.3.3.1 D.3.3.2 D.3.3.3 D.3.3.4 D.3.3.5
xvi
OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1 INTEL® ATOM™ MICROARCHITECTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-1 Hyper-Threading Technology Support in Intel® Atom™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3 CODING RECOMMENDATIONS FOR INTEL® ATOM™ MICROARCHITECTURE. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3 Optimization for Front End of Intel® Atom™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-3 Optimizing the Execution Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5 Integer Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-5 Address Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-6 Integer Multiply. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-6 Integer Shift Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7 Partial Register Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7 FP/SIMD Instruction Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-7 Optimizing Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9 Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9 First-level Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-9 Segment Base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10 String Moves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-10 Parameter Passing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-11
CONTENTS PAGE
D.3.3.6 D.3.3.7 D.3.3.8 D.4
Function Calls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimization of Multiply/Add Dependent Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Position Independent Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . INSTRUCTION LATENCY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
D-11 D-11 D-13 D-13
EXAMPLES Example 3-1. Example 3-2. Example 3-3. Example 3-4. Example 3-5. Example 3-6. Example 3-7. Example 3-8. Example 3-9. Example 3-10. Example 3-11. Example 3-12. Example 3-13. Example 3-14. Example 3-15. Example 3-16. Example 3-17. Example 3-18. Example 3-19. Example 3-20. Example 3-21. Example 3-22. Example 3-23. Example 3-24. Example 3-25. Example 3-26. Example 3-27. Example 3-28. Example 3-29. Example 3-30. Example 3-31. Example 3-32. Example 3-33. Example 3-34. Example 3-35. Example 3-36. Example 3-37. Example 3-38. Example 3-39. Example 3-40. Example 3-41. Example 3-42. Example 3-43. Example 3-44. Example 3-45. Example 3-46. Example 3-47. Example 3-48. Example 3-49. Example 3-50. Example 3-51.
Assembly Code with an Unpredictable Branch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 Code Optimization to Eliminate Branches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 Eliminating Branch with CMOV Instruction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 Use of PAUSE Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 Static Branch Prediction Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Static Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Static Not-Taken Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 Indirect Branch With Two Favored Targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 A Peeling Technique to Reduce Indirect Branch Misprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11 Macro-fusion, Unsigned Iteration Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 Macro-fusion, If Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 Macro-fusion, Signed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 Macro-fusion, Signed Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 Additional Macro-fusion Benefit in Intel Microarchitecture Code Name Sandy Bridge. . . . . . . . . . . . . . 3-16 Avoiding False LCP Delays with 0xF7 Group Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-17 Unrolling Loops in LSD to Optimize Emission Bandwidth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-18 Independent Two-Operand LEA Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 Alternative to Three-Operand LEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 Examples of 512-bit Additions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 Clearing Register to Break Dependency While Negating Array Elements . . . . . . . . . . . . . . . . . . . . . . . . . 3-27 Spill Scheduling Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-29 Zero-Latency MOV Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30 Byte-Granular Data Computation Technique. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30 Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions . . . . . . . . . . . . . . 3-31 Avoiding Partial Register Stalls in Integer Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-33 Avoiding Partial Register Stalls in SIMD Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-34 Avoiding Partial Flag Register Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35 Partial Flag Register Accesses in Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . 3-35 Reference Code Template for Partially Vectorizable Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-38 Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty . . . . . . . . . . . . . . . . . . . . . . 3-39 Using Four Registers to Reduce Memory Spills and Simplify Result Passing . . . . . . . . . . . . . . . . . . . . . . 3-39 Stack Optimization Technique to Simplify Parameter Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-40 Base Line Code Sequence to Estimate Loop Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-41 Optimize for Load Port Bandwidth in Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . 3-43 Index versus Pointers in Pointer-Chasing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-44 Example of Bank Conflicts in L1D Cache and Remedy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-45 Using XMM Register in Lieu of Memory for Register Spills . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-46 Loads Blocked by Stores of Unknown Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-47 Code That Causes Cache Line Split . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48 Situations Showing Small Loads After Large Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51 Non-forwarding Example of Large Load After Small Store. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51 A Non-forwarding Situation in Compiler Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-51 Two Ways to Avoid Non-forwarding Situation in Example 3-43. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52 Large and Small Load Stalls. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-52 Loop-carried Dependence Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-54 Rearranging a Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55 Decomposing an Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-55 Examples of Dynamical Stack Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-57 Aliasing Between Loads and Stores Across Loop Iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-59 Instruction Pointer Query Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-60
xvii
CONTENTS PAGE
Example 3-52. Example 3-53. Example 3-54. Example 3-55. Example 3-56. Example 3-57. Example 3-58. Example 4-1. Example 4-2. Example 4-3. Example 4-4. Example 4-5. Example 4-6. Example 4-7. Example 4-8. Example 4-9. Example 4-10. Example 4-11. Example 4-12. Example 4-13. Example 4-14. Example 4-15. Example 4-16. Example 4-17. Example 4-18. Example 4-19. Example 4-20. Example 4-21. Example 4-22. Example 4-23. Example 4-24. Example 4-25. Example 4-26. Example 5-1. Example 5-2. Example 5-3. Example 5-4. Example 5-5. Example 5-6. Example 5-7. Example 5-8. Example 5-9. Example 5-10. Example 5-11. Example 5-12. Example 5-13. Example 5-14. Example 5-15. Example 5-16. Example 5-17. Example 5-18. Example 5-19. Example 5-20. Example 5-21. Example 5-22. Example 5-23. Example 5-24. Example 5-25. Example 5-26. xviii
Using Non-temporal Stores and 64-byte Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63 On-temporal Stores and Partial Bus Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-63 Using DCU Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-64 Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-65 Technique For Using L1 Hardware Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-66 REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-68 Algorithm to Avoid Changing Rounding Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-75 Identification of MMX Technology with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Identification of SSE with CPUID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 Identification of SSE2 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 Identification of SSE3 with CPUID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 Identification of SSSE3 with cpuid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 Identification of SSE4.1 with cpuid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 Identification of SSE4.2 with cpuid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 Detection of AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 Detection of PCLMULQDQ Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-5 Detection of AVX Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 Detection of VEX-Encoded AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7 Detection of VEX-Encoded AESNI Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7 Simple Four-Iteration Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Streaming SIMD Extensions Using Inlined Assembly Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-14 Simple Four-Iteration Loop Coded with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-15 C++ Code Using the Vector Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16 Automatic Vectorization for a Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16 C Algorithm for 64-bit Data Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-18 AoS Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21 SoA Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21 AoS and SoA Code Samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-21 Hybrid SoA Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22 Pseudo-code Before Strip Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23 Strip Mined Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24 Loop Blocking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-24 Emulation of Conditional Moves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26 Resetting Register Between __m64 and FP Data Types Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 FIR Processing Example in C language Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 SSE2 and SSSE3 Implementation of FIR Processing Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 Zero Extend 16-bit Values into 32 Bits Using Unsigned Unpack Instructions Code . . . . . . . . . . . . . . . . . . 5-5 Signed Unpack Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 Interleaved Pack with Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Interleaved Pack without Saturation Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-7 Unpacking Two Packed-word Sources in Non-interleaved Way Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-9 PEXTRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 Repeated PINSRW Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 Non-Unit Stride Load/Store Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 Scatter and Gather Operations Using SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-11 PMOVMSKB Instruction Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Broadcast a Word Across XMM, Using 2 SSE2 Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Swap/Reverse words in an XMM, Using 3 SSE2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 Generating Constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 Absolute Difference of Two Unsigned Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-15 Absolute Difference of Signed Numbers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Computing Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-16 Basic C Implementation of RGBA to BGRA Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 Color Pixel Format Conversion Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-17 Color Pixel Format Conversion Using SSSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-18 Big-Endian to Little-Endian Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-19 Clipping to a Signed Range of Words [High, Low]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 Clipping to an Arbitrary Signed Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20
CONTENTS PAGE
Example 5-27. Example 5-28. Example 5-29. Example 5-30. Example 5-31. Example 5-32. Example 5-33. Example 5-34. Example 5-35. Example 5-36. Example 5-37. Example 5-38. Example 5-39. Example 5-40. Example 5-41. Example 5-42. Example 5-43. Example 5-44. Example 5-45. Example 5-46. Example 5-47. Example 5-48. Example 5-49. Example 5-50. Example 6-1. Example 6-2. Example 6-3. Example 6-4. Example 6-5. Example 6-6. Example 6-7. Example 6-8. Example 6-9. Example 6-10. Example 6-11. Example 6-12. Example 6-13. Example 6-14. Example 6-15. Example 6-16. Example 6-17. Example 6-18. Example 6-19. Example 6-20. Example 6-21. Example 6-22. Example 6-23. Example 6-24. Example 7-1. Example 7-2. Example 7-3. Example 7-4. Example 7-5. Example 7-6. Example 7-7. Example 7-8. Example 7-9. Example 7-10. Example 7-11.
Simplified Clipping to an Arbitrary Signed Range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-20 Clipping to an Arbitrary Unsigned Range [High, Low] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-21 Complex Multiply by a Constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-23 Using PTEST to Separate Vectorizable and non-Vectorizable Loop Iterations. . . . . . . . . . . . . . . . . . . . . 5-24 Using PTEST and Variable BLEND to Vectorize Heterogeneous Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-24 Baseline C Code for Mandelbrot Set Map Evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-25 Vectorized Mandelbrot Set Map Evaluation Using SSE4.1 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-26 A Large Load after a Series of Small Stores (Penalty) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28 Accessing Data Without Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28 A Series of Small Loads After a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-28 Eliminating Delay for a Series of Small Loads after a Large Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29 An Example of Video Processing with Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-29 Video Processing Using LDDQU to Avoid Cache Line Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-30 Un-optimized Reverse Memory Copy in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31 Using PSHUFB to Reverse Byte Ordering 16 Bytes at a Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-33 PMOVSX/PMOVZX Work-around to Avoid False Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35 Table Look-up Operations in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-35 Shift Techniques on Non-Vectorizable Table Look-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-36 PEXTRD Techniques on Non-Vectorizable Table Look-up. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-37 Pseudo-Code Flow of AES Counter Mode Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-39 AES128-CTR Implementation with Eight Block in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-39 AES128 Key Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-46 Compress 32-bit Integers into 5-bit Buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-49 Decompression of a Stream of 5-bit Integers into 32-bit Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-51 Pseudocode for Horizontal (xyz, AoS) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 Pseudocode for Vertical (xxxx, yyyy, zzzz, SoA) Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 Swizzling Data Using SHUFPS, MOVLHPS, MOVHLPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-5 Swizzling Data Using UNPCKxxx Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-6 Deswizzling Single-Precision SIMD Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-7 Deswizzling Data Using SIMD Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-8 Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-9 Horizontal Add Using Intrinsics with MOVHLPS/MOVLHPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-10 Multiplication of Two Pair of Single-precision Complex Number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12 Division of Two Pair of Single-precision Complex Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-12 Double-Precision Complex Multiplication of Two Pairs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13 Double-Precision Complex Multiplication Using Scalar SSE2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-13 Dot Product of Vector Length 4 Using SSE/SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-14 Dot Product of Vector Length 4 Using SSE3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Dot Product of Vector Length 4 Using SSE4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Unrolled Implementation of Four Dot Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-15 Normalization of an Array of Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-16 Normalize (x, y, z) Components of an Array of Vectors Using SSE2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-17 Normalize (x, y, z) Components of an Array of Vectors Using SSE4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-18 Data Organization in Memory for AOS Vector-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 AOS Vector-Matrix Multiplication with HADDPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-19 AOS Vector-Matrix Multiplication with DPPS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-20 Data Organization in Memory for SOA Vector-Matrix Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-21 Vector-Matrix Multiplication with Native SOA Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-22 Pseudo-code Using CLFLUSH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 Flushing Cache Lines Using CLFLUSH or CLFLUSHOPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 Populating an Array for Circular Pointer Chasing with Constant Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-13 Prefetch Scheduling Distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 Using Prefetch Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 Concatenation and Unrolling the Last Iteration of Inner Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 Data Access of a 3D Geometry Engine without Strip-mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 Data Access of a 3D Geometry Engine with Strip-mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 Using HW Prefetch to Improve Read-Once Memory Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23 Basic Algorithm of a Simple Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27 A Memory Copy Routine Using Software Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28 xix
CONTENTS PAGE
Example 7-12. Memory Copy Using Hardware Prefetch and Bus Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29 Example 8-1. Serial Execution of Producer and Consumer Work Items. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-5 Example 8-2. Basic Structure of Implementing Producer Consumer Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 Example 8-3. Thread Function for an Interlaced Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 Example 8-4. Spin-wait Loop and PAUSE Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-12 Example 8-5. Coding Pitfall using Spin Wait Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-14 Example 8-6. Placement of Synchronization and Regular Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-15 Example 8-7. Declaring Synchronization Variables without Sharing a Cache Line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-16 Example 8-8. Batched Implementation of the Producer Consumer Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 Example 8-9. Parallel Memory Initialization Technique Using OpenMP and NUMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25 Example 9-1. Compute 64-bit Quotient and Remainder with 64-bit Divisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-3 Example 9-2. Quotient and Remainder of 128-bit Dividend with 64-bit Divisor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Example 10-1. A Hash Function Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 Example 10-2. Hash Function Using CRC32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 Example 10-3. Strlen() Using General-Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-6 Example 10-4. Sub-optimal PCMPISTRI Implementation of EOS handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8 Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-8 Example 10-6. WordCnt() Using C and Byte-Scanning Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-9 Example 10-7. WordCnt() Using PCMPISTRM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-10 Example 10-8. KMP Substring Search in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12 Example 10-9. Brute-Force Substring Search Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13 Example 10-10.Substring Search Using PCMPISTRI and KMP Overlap Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-15 Example 10-11.I Equivalent Strtok_s() Using PCMPISTRI Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-19 Example 10-12.I Equivalent Strupr() Using PCMPISTRM Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-21 Example 10-13.UTF16 VerStrlen() Using C and Table Lookup Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-22 Example 10-14.Assembly Listings of UTF16 VerStrlen() Using PCMPISTRI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-23 Example 10-15.Intrinsic Listings of UTF16 VerStrlen() Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-25 Example 10-16.Replacement String Library Strcmp Using SSE4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-27 Example 10-17.High-level flow of Character Subset Validation for String Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29 Example 10-18.Intrinsic Listings of atol() Replacement Using PCMPISTRI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-29 Example 10-19.Auxiliary Routines and Data Constants Used in sse4i_atol() listing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-31 Example 10-20.Conversion of 64-bit Integer to ASCII . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-34 Example 10-21.Conversion of 64-bit Integer to ASCII without Integer Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-35 Example 10-22.Conversion of 64-bit Integer to ASCII Using SSE4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-37 Example 10-23.Conversion of 64-bit Integer to Wide Character String Using SSE4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-43 Example 10-24. MULX and Carry Chain in Large Integer Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-48 Example 10-25. Building-block Macro Used in Binary Decimal Floating-point Operations . . . . . . . . . . . . . . . . . . . . . . . . . 10-49 Example 11-1. Cartesian Coordinate Transformation with Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-3 Example 11-2. Cartesian Coordinate Transformation with Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-4 Example 11-3. Direct Polynomial Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-6 Example 11-4. Function Calls and AVX/SSE transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-10 Example 11-5. AoS to SoA Conversion of Complex Numbers in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-12 Example 11-6. Aos to SoA Conversion of Complex Numbers Using AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-13 Example 11-7. Register Overlap Method for Median of 3 Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-15 Example 11-8. Data Gather - AVX versus Scalar Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-16 Example 11-9. Scatter Operation Using AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-18 Example 11-10.SAXPY using Intel AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-19 Example 11-11.Using 16-Byte Memory Operations for Unaligned 32-Byte Memory Operation. . . . . . . . . . . . . . . . . . . 11-21 Example 11-12.SAXPY Implementations for Unaligned Data Addresses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-21 Example 11-13.Loop with Conditional Expression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24 Example 11-14.Handling Loop Conditional with VMASKMOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-24 Example 11-15.Three-Tap Filter in C Code. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-25 Example 11-16.Three-Tap Filter with 128-bit Mixed Integer and FP SIMD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26 Example 11-17.256-bit AVX Three-Tap Filter Code with VSHUFPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-26 Example 11-18.Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code. . . . . . . . . . . . . . . . . . . . . . . . . . 11-27 Example 11-19.8x8 Matrix Transpose - Replace Shuffles with Blends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-29 Example 11-20.8x8 Matrix Transpose Using VINSRTPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-31 Example 11-21.Port 5 versus Load Port Shuffles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-33 Example 11-22.Divide Using DIVPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36 xx
CONTENTS PAGE
Example 11-23.Divide Using RCPPS 11-bit Approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36 Example 11-24.Divide Using RCPPS and Newton-Raphson Iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-36 Example 11-25.Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38 Example 11-26.Reciprocal Square Root Using RCPPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38 Example 11-27.Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-38 Example 11-28.Square Root Using SQRTPS for 24-bit Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-39 Example 11-29. Square Root Using RCPPS 11-bit Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40 Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-40 Example 11-31. Array Sub Sums Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-42 Example 11-32. Single-Precision to Half-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-43 Example 11-33. Half-Precision to Single-Precision Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-44 Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision . . . . . . . . . . . . . . . . . . 11-45 Example 11-35. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47 Example 11-36. Unrolling to Hide Dependent FP Add Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-47 Example 11-37. FP Mul/FP Add Versus FMA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-49 Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-50 Example 11-39. Separable KLT Intra-block Transformation Using AVX2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-52 Example 11-40. Macros for Parallel Moduli/Remainder Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-57 Example 11-41. Signed 64-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-58 Example 11-42. Unsigned 63-bit Integer Conversion Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-60 Example 11-43. Access Patterns Favoring Non-VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-64 Example 11-44. Access Patterns Likely to Favor VGATHER Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-65 Example 11-45. Software AVX Sequence Equivalent to Full-Mask VPGATHERD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-66 Example 11-46.AOS to SOA Transformation Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-67 Example 11-47. Non-Strided AOS to SOA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-68 Example 11-48. Conversion to Throughput-Reduced MMX sequence to AVX2 Alternative . . . . . . . . . . . . . . . . . . . . . . 11-70 Example 12-1. Reduce Data Conflict with Conditional Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6 Example 12-2. Transition from Non-Elided Execution without Aborting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-10 Example 12-3. Exemplary Wrapper Using RTM for Lock/Unlock Primitives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12 Example 12-4. Spin Lock Example Using HLE in GCC 4.8 and Later . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Example 12-5. Spin Lock Example Using HLE in Intel and Microsoft Compiler Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . 12-14 Example 12-6. A Meta Lock Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-16 Example 12-7. A Meta Lock Example Using RTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-17 Example 12-8. HLE-enabled Lock-Acquire/ Lock-Release Sequence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-18 Example 12-9. A Spin Wait Example Using HLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-19 Example 12-10. A Conceptual Example of Intermixed HLE and RTM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-20 Example 12-11. Emulated RTM intrinsic for Older GCC compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-27 Example 12-12. C++ Example of HLE Intrinsic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29 Example 12-13. Emulated HLE Intrinsic with Older GCC compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-29 Example 12-14. HLE Intrinsic Supported by Intel and Microsoft Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-30 Example 13-1. Unoptimized Sleep Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 Example 13-2. Power Consumption Friendly Sleep Loop Using PAUSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14 Example 14-1. Unrolled Loop Executes In-Order Due to Multiply-Store Port Conflict. . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-11 Example 14-2. Grouping Store Instructions Eliminates Bubbles and Improves IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12 Example 15-1. Gather Comparison Between AVX-512F and AVX2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 Example 15-2. Gather Comparison Between AVX-512F and KNC Equivalent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-8 Example 15-3. Using VRCP28SS for 32-bit Floating-Point Division . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9 Example 15-4. Vectorized Histogram Update Using AVX-512CD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-9 Example 15-5. Replace VCOMIS* with VCMPSS/KORTEST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-12 Example 15-6. Using Software Sequence for Horizontal Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-13 Example 15-7. Optimized Inner Loop of DGEMM for Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . 15-14 Example 15-8. Ordering of Memory Instruction for MEC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-19 Example D-1. Instruction Pairing and Alignment to Optimize Decode Throughput on Intel® Atom™ MicroarchitectureD-4 Example D-2. Alternative to Prevent AGU and Execution Unit Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-6 Example D-3. Pipeling Instruction Execution in Integer Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-7 Example D-4. Memory Copy of 64-byte. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-11 Example D-5. Examples of Dependent Multiply and Add Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-12 Example D-6. Instruction Pointer Query Techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-13
xxi
CONTENTS PAGE
xxii
CONTENTS PAGE
FI GURES Figure 2-1. Figure 2-2. Figure 2-3. Figure 2-4. Figure 2-5. Figure 2-6. Figure 2-7. Figure 2-8. Figure 2-9. Figure 2-10. Figure 2-11. Figure 2-12. Figure 2-13. Figure 2-14. Figure 2-15. Figure 2-16. Figure 3-1. Figure 3-2. Figure 3-3. Figure 3-4. Figure 4-1. Figure 4-2. Figure 4-3. Figure 4-4. Figure 4-5. Figure 5-1. Figure 5-2. Figure 5-3. Figure 5-4. Figure 5-5. Figure 5-6. Figure 5-7. Figure 5-8. Figure 5-9. Figure 6-1. Figure 6-2. Figure 6-3. Figure 6-4. Figure 6-5. Figure 6-6. Figure 7-1. Figure 7-2. Figure 7-3. Figure 7-4. Figure 7-5. Figure 7-6. Figure 7-7. Figure 7-8. Figure 7-9. Figure 7-10. Figure 8-1. Figure 8-2. Figure 8-3. Figure 8-4. Figure 8-5. Figure 10-1. Figure 10-2.
CPU Core Pipeline Functionality of the Skylake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-2 CPU Core Pipeline Functionality of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-7 Four Core System Integration of the Haswell Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-8 An Example of the Haswell-E Microarchitecture Supporting 12 Processor Cores . . . . . . . . . . . . . . . . . . . 2-13 Intel Microarchitecture Code Name Sandy Bridge Pipeline Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-15 Intel Core Microarchitecture Pipeline Functionality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-33 Execution Core of Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-39 Store-Forwarding Enhancements in Enhanced Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . 2-43 Intel Advanced Smart Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-44 Intel Microarchitecture Code Name Nehalem Pipeline Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-47 Front End of Intel Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-48 Store-Forwarding Scenarios of 16-Byte Store Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-53 Store-Forwarding Enhancement in Intel Microarchitecture Code Name Nehalem. . . . . . . . . . . . . . . . . . . . 2-54 Hyper-Threading Technology on an SMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-56 Typical SIMD Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-60 SIMD Instruction Register Usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-61 Generic Program Flow of Partially Vectorized Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-37 Cache Line Split in Accessing Elements in a Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-48 Size and Alignment Restrictions in Store Forwarding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-50 Memcpy Performance Comparison for Lengths up to 2KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-69 General Procedural Flow of Application Detection of AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-6 General Procedural Flow of Application Detection of Float-16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4-8 Converting to Streaming SIMD Extensions Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-11 Hand-Coded Assembly and High-Level Compiler Performance Trade-offs . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 Loop Blocking Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-26 PACKSSDW mm, mm/mm64 Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-6 Interleaved Pack with Saturation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-7 Result of Non-Interleaved Unpack Low in MM0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8 Result of Non-Interleaved Unpack High in MM1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-8 PEXTRW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5-9 PINSRW Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-10 PMOVSMKB Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-12 Data Alignment of Loads and Stores in Reverse Memory Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32 A Technique to Avoid Cacheline Split Loads in Reverse Memory Copy Using Two Aligned Loads . . . . . 5-33 Homogeneous Operation on Parallel Data Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3 Horizontal Computation Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-3 Dot Product Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4 Horizontal Add Using MOVHLPS/MOVLHPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-9 Asymmetric Arithmetic Operation of the SSE3 Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 Horizontal Arithmetic Operation of the SSE3 Instruction HADDPD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-11 CLFLUSHOPT versus CLFLUSH In SkyLake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-11 Effective Latency Reduction as a Function of Access Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 Memory Access Latency and Execution Without Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15 Prefetch and Loop Unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 Memory Access Latency and Execution With Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 Spread Prefetch Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 Cache Blocking – Temporally Adjacent and Non-adjacent Passes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 Examples of Prefetch and Strip-mining for Temporally Adjacent and Non-Adjacent Passes Loops . . . 7-21 Single-Pass Vs. Multi-Pass 3D Geometry Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 Amdahl’s Law and MP Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-2 Single-threaded Execution of Producer-consumer Threading Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-5 Execution of Producer-consumer Threading Model on a Multicore Processor . . . . . . . . . . . . . . . . . . . . . . . . .8-5 Interlaced Variation of the Producer Consumer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8-6 Batched Approach of Producer Consumer Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 SSE4.2 String/Text Instruction Immediate Operand Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 Retrace Inefficiency of Byte-Granular, Brute-Force Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-12 xxiii
CONTENTS PAGE
Figure 10-3. Figure 10-4. Figure 11-1. Figure 11-2. Figure 11-3. Figure 11-4. Figure 11-5. Figure 13-1. Figure 13-2. Figure 13-3. Figure 13-4. Figure 13-5. Figure 13-6. Figure 13-7. Figure 13-8. Figure 13-9. Figure 13-10. Figure 14-1. Figure 14-2. Figure 15-1. Figure 15-2. Figure B-1. Figure B-2. Figure B-3. Figure B-4. Figure B-5. Figure B-6. Figure B-7. Figure B-8. Figure B-9. Figure B-11. Figure B-10. Figure B-12. Figure B-13. Figure B-15. Figure B-14. Figure B-16. Figure D-1.
xxiv
SSE4.2 Speedup of SubString Searches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-18 Compute Four Remainders of Unsigned Short Integer in Parallel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-37 AVX-SSE Transitions in the Broadwell, and Prior Generation Microarchitectures . . . . . . . . . . . . . . . . . . . . 11-8 AVX-SSE Transitions in the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-8 4x4 Image Block Transformation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-50 Throughput Comparison of Gather Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-65 Comparison of HW GATHER Versus Software Sequence in Skylake Microarchitecture. . . . . . . . . . . . . . 11-66 Performance History and State Transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-2 Active Time Versus Halted Time of a Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-3 Application of C-states to Idle Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-4 Profiles of Coarse Task Scheduling and Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13-9 Thread Migration in a Multicore Processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11 Progression to Deeper Sleep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-11 Energy Saving due to Performance Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 Energy Saving due to Vectorization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-13 Energy Saving Comparison of Synchronization Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-16 Power Saving Comparison of Power-Source-Aware Frame Rate Configurations . . . . . . . . . . . . . . . . . . . 13-17 CPU Core Pipeline Functionality of the Goldmont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-2 Silvermont Microarchitecture Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-5 Tile-Mesh Topology of the Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-1 Processor Core Pipeline Functionality of the Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . .15-2 General TMAM Hierarchy for Out-of-Order Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-2 TMAM’s Top Level Drill Down Flowchart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-3 TMAM Hierarchy Supported by Skylake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-7 System Topology Supported by Intel® Xeon® Processor 5500 Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-14 PMU Specific Event Logic Within the Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-16 LBR Records and Basic Blocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-27 Using LBR Records to Rectify Skewed Sample Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-27 RdData Request after LLC Miss to Local Home (Clean Rsp). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-38 RdData Request after LLC Miss to Remote Home (Clean Rsp) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-38 RdData Request after LLC Miss to Local Home (Hitm Response) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-39 RdData Request after LLC Miss to Remote Home (Hitm Response) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-39 RdData Request after LLC Miss to Local Home (Hit Response) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-40 RdInvOwn Request after LLC Miss to Remote Home (Clean Res) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-40 RdInvOwn Request after LLC Miss to Local Home (Hit Res) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-41 RdInvOwn Request after LLC Miss to Remote Home (Hitm Res) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-41 Performance Events Drill-Down and Software Tuning Feedback Loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-60 Intel Atom Microarchitecture Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D-2
CONTENTS PAGE
TABLES Table 2-1. Table 2-2. Table 2-3. Table 2-4. Table 2-5. Table 2-6. Table 2-7. Table 2-8. Table 2-9. Table 2-10. Table 2-11. Table 2-12. Table 2-13. Table 2-14. Table 2-15. Table 2-16. Table 2-17. Table 2-18. Table 2-19. Table 2-20. Table 2-21. Table 2-22. Table 2-23. Table 2-24. Table 2-25. Table 2-26. Table 2-27. Table 2-28. Table 2-29. Table 2-30. Table 2-31. Table 2-32. Table 3-1. Table 3-2. Table 3-3. Table 3-4. Table 3-5. Table 5-1. Table 6-1. Table 7-1. Table 7-2. Table 7-3. Table 8-1. Table 8-2. Table 8-3. Table 10-1. Table 10-2. Table 10-3. Table 10-4. Table 10-5. Table 11-1. Table 11-2. Table 11-3. Table 11-4. Table 11-5. Table 11-6. Table 11-7.
Dispatch Port and Execution Stacks of the Skylake Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-3 Skylake Microarchitecture Execution Units and Representative Instructions . . . . . . . . . . . . . . . . . . . . . . . . .2-4 Bypass Delay Between Producer and Consumer Micro-ops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-5 Cache Parameters of the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6 TLB Parameters of the Skylake Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-6 Dispatch Port and Execution Stacks of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2-9 Haswell Microarchitecture Execution Units and Representative Instructions . . . . . . . . . . . . . . . . . . . . . . . 2-10 Bypass Delay Between Producer and Consumer Micro-ops (cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 Cache Parameters of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 TLB Parameters of the Haswell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-12 TLB Parameters of the Broadwell Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13 Components of the Front End of Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . 2-16 ICache and ITLB of Intel Microarchitecture Code Name Sandy Bridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-16 Dispatch Port and Execution Stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-22 Execution Core Writeback Latency (cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23 Cache Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-23 Lookup Order and Load Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-24 L1 Data Cache Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-25 Effect of Addressing Modes on Load Latency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26 DTLB and STLB Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-26 Store Forwarding Conditions (1 and 2 byte stores) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27 Store Forwarding Conditions (4-16 byte stores) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-27 32-byte Store Forwarding Conditions (0-15 byte alignment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28 32-byte Store Forwarding Conditions (16-31 byte alignment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-28 Components of the Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-34 Issue Ports of Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture. . . . . . . . . . . . . . 2-38 Cache Parameters of Processors based on Intel Core Microarchitecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-44 Characteristics of Load and Store Operations in Intel Core Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . 2-45 Bypass Delay Between Producer and Consumer Micro-ops (cycles) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-50 Issue Ports of Intel Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-50 Cache Parameters of Intel Core i7 Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-51 Performance Impact of Address Alignments of MOVDQU from L1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-52 Macro-Fusible Instructions in Intel Microarchitecture Code Name Sandy Bridge . . . . . . . . . . . . . . . . . . . . . 3-13 Small Loop Criteria Detected by Sandy Bridge and Haswell Microarchitectures . . . . . . . . . . . . . . . . . . . . . 3-18 Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture . . . . . . . . . . . . . . . . 3-53 Relative Performance of Memcpy() Using ERMSB Vs. 128-bit AVX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70 Effect of Address Misalignment on Memcpy() Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-70 PSHUF Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-13 SoA Form of Representing Vertices Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .6-4 Software Prefetching Considerations into Strip-mining Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23 Relative Performance of Memory Copy Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-30 Deterministic Cache Parameters Leaf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-31 Properties of Synchronization Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 Design-Time Resource Management Choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-23 Microarchitectural Resources Comparisons of HT Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-26 SSE4.2 String/Text Instructions Compare Operation on N-elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 SSE4.2 String/Text Instructions Unary Transformation on IntRes1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 SSE4.2 String/Text Instructions Output Selection Imm[6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 SSE4.2 String/Text Instructions Element-Pair Comparison Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 SSE4.2 String/Text Instructions Eflags Behavior. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 Features between 256-bit AVX, 128-bit AVX and Legacy SSE Extensions . . . . . . . . . . . . . . . . . . . . . . . . . 11-2 State Transitions of Mixing AVX and SSE Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9 Approximate Magnitude of AVX-SSE Transition Penalties in Different Microarchitectures. . . . . . . . . . . 11-9 Effect of VZEROUPPER with Inter-Function Calls Between AVX and SSE Code . . . . . . . . . . . . . . . . . . . . 11-10 Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture . . . . . . 11-34 Single-Precision Divide and Square Root Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-35 Comparison of Single-Precision Divide Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-37 xxv
CONTENTS PAGE
Table 11-8. Table 11-9. Table 11-10. Table 11-11. Table 12-1. Table 13-1. Table 13-2. Table 13-3. Table 13-4. Table 13-5. Table 13-6. Table 14-1. Table 14-2. Table 14-3. Table 14-4. Table 14-5. Table 14-6. Table 14-7. Table 14-8. Table 14-9. Table 14-10. Table 14-11. Table 14-12. Table 14-13. Table 14-14. Table 15-1. Table 15-2. Table 15-3. Table 15-4. Table 15-5. Table A-1. Table B-1. Table B-2. Table B-3. Table B-4. Table B-5. Table B-6. Table B-7. Table B-8. Table B-9. Table B-10. Table B-11. Table B-12. Table B-13. Table B-14. Table B-15. Table B-16. Table C-1. Table C-2. Table C-3. Table C-4. Table C-5. Table C-6. Table C-7. xxvi
Comparison of Single-Precision Reciprocal Square Root Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-39 Comparison of Single-Precision Square Root Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-41 Comparison of AOS to SOA with Strided Access Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-68 Comparison of Indexed AOS to SOA Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-69 RTM Abort Status Definition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-23 ACPI C-State Type Mappings to Processor Specific C-State for Mobile Processors Based on Intel Microarchitecture Code Name Nehalem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5 ACPI C-State Type Mappings to Processor Specific C-State of Intel Microarchitecture Code Name Sandy Bridge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5 C-State Total Processor Exit Latency for Client Systems (Core+ Package Exit Latency) with Slow VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18 C-State Total Processor Exit Latency for Client Systems (Core+ Package Exit Latency) with Fast VR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-18 C-State Core-Only Exit Latency for Client Systems with Slow VR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-19 POWER_CTL MSR in Next Generation Intel Processor (Intel® Microarchitecture Code Name Sandy Bridge) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-19 Comparison of Front End Cluster Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3 Comparison of Distributed Reservation Stations on Scheduling Uops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3 Function Unit Mapping of the Goldmont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-3 Comparison of MEC Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-4 Function Unit Mapping of the Silvermont Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-6 Alternatives to MSROM Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14-8 Comparison of Decoder Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10 Integer Multiply Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-12 Floating-Point and SIMD Integer Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-15 Unsigned Integer Division Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-17 Signed Integer Division Operation Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-17 Store Forwarding Conditions (1 and 2 Byte Stores) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18 Store Forwarding Conditions (4-16 Byte Stores). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19 Instructions Latency and Throughput Recent Microarchitectures for Intel Atom Processors . . . . . . . 14-21 Integer Pipeline Characteristics of the Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . .15-4 Vector Pipeline Characteristics of the Knights Landing Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . .15-5 Characteristics of Caching Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15-6 Alternatives to MSROM Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-11 Cycle Cost Building Blocks for Vectorization Estimate for Knights Landing Microarchitecture . . . . . . . 15-15 Recommended Processor Optimization Options. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2 Performance Monitoring Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B-8 Cycle Accounting and Micro-ops Flow Recipe. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-15 CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-16 Cycle Accounting of Wasted Work Due to Misprediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-17 Cycle Accounting of Instruction Starvation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-18 CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-19 Approximate Latency of L2 Misses of Intel Xeon Processor 5500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-21 Load Latency Event Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-24 Data Source Encoding for Load Latency PEBS Record . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-24 Core PMU Events to Drill Down L2 Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-28 Core PMU Events for Super Queue Operation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-29 Core PMU Event to Drill Down OFFCore Responses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-29 OFFCORE_RSP_0 MSR Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-29 Common Request and Response Types for OFFCORE_RSP_0 MSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-30 Uncore PMU Events for Occupancy Cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-35 Common QHL Opcode Matching Facility Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .B-37 CPUID Signature Values of Of Recent Intel Microarchitectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-3 Instruction Extensions Introduction by Microarchitectures (CPUID Signature). . . . . . . . . . . . . . . . . . . . . . . . . C-4 BMI1, BMI2 and General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-4 256-bit AVX2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-5 Gather Timing Data from L1D* . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-6 BMI1, BMI2 and General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7 F16C,RDRAND Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7
CONTENTS PAGE
Table C-8. Table C-9. Table C-10. Table C-11. Table C-12. Table C-13. Table C-14. Table C-15. Table C-16. Table C-17. Table C-18. Table D-1. Table D-2.
256-bit AVX Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-7 AESNI and PCLMULQDQ Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-9 SSE4.2 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-10 SSE4.1 Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-10 Supplemental Streaming SIMD Extension 3 Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-11 Streaming SIMD Extension 3 SIMD Floating-point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 Streaming SIMD Extension 2 128-bit Integer Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-12 Streaming SIMD Extension 2 Double-precision Floating-point Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . C-14 Streaming SIMD Extension Single-precision Floating-point Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-15 General Purpose Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-17 Pointer-Chasing Variability of Software Measurable Latency of L1 Data Cache Latency. . . . . . . . . . . . . . C-20 Instruction Latency/Throughput Summary of Intel® Atom™ Microarchitecture . . . . . . . . . . . . . . . . . . . . . . . . D-7 Intel® Atom™ Microarchitecture Instructions Latency Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .D-14
xxvii
CONTENTS PAGE
xxviii
CHAPTER 1 INTRODUCTION The I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual describes how t o opt im ize software t o t ake advant age of t he perform ance charact erist ics of I A- 32 and I nt el 64 archit ect ure processors. Opt im izat ions described in t his m anual apply t o processors based on t he I nt el ® Core™ m icroarchit ect ure, Enhanced I nt el ® Core™ m icroarchit ect ure, I nt el ® m icroarchit ect ure code nam e Nehalem , I nt el ® m icroarchit ect ure code nam e West m ere, I nt el ® m icroarchit ect ure code nam e Sandy Bridge, I nt el ® m icroarchit ect ure code nam e I vy Bridge, I nt el ® m icroarchit ect ure code nam e Haswell, I nt el Net Burst ® m icroarchit ect ure, t he I nt el ® Core™ Duo, I nt el ® Core™ Solo, Pent ium ® M processor fam ilies. The t arget audience for t his m anual includes soft ware program m ers and com piler writ ers. This m anual assum es t hat t he reader is fam iliar wit h t he basics of t he I A- 32 archit ect ure and has access t o t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual ( five volum es) . A det ailed underst anding of I nt el 64 and I A- 32 processors is oft en required. I n m any cases, knowledge of t he underlying m icroarchit ect ures is required. The design guidelines t hat are discussed in t his m anual for developing highperform ance soft ware generally apply t o current as well as t o fut ure I A- 32 and I nt el 64 processors. The coding rules and code opt im izat ion t echniques list ed t arget t he I nt el Core m icroarchit ect ure, t he I nt el Net Burst m icroarchit ect ure and t he Pent ium M processor m icroarchit ect ure. I n m ost cases, coding rules apply t o soft ware running in 64- bit m ode of I nt el 64 archit ect ure, com pat ibilit y m ode of I nt el 64 archit ect ure, and I A- 32 m odes ( I A- 32 m odes are support ed in I A- 32 and I nt el 64 archit ect ures) . Coding rules specific t o 64- bit m odes are not ed separat ely.
1.1
TUNING YOUR APPLICATION
Tuning an applicat ion for high perform ance on any I nt el 64 or I A- 32 processor requires underst anding and basic skills in:
• • • • •
I nt el 64 and I A- 32 archit ect ure. C and Assem bly language. Hot- spot regions in t he applicat ion t hat have im pact on perform ance. Opt im izat ion capabilit ies of t he com piler. Techniques used t o evaluat e applicat ion perform ance.
The I nt el ® VTune™ Perform ance Analyzer can help you analyze and locat e hot- spot regions in your applicat ions. On t he I nt el ® Core™ i7, I nt el ® Core™2 Duo, I nt el ® Core™ Duo, I nt el ® Core™ Solo, Pent ium ® 4, I nt el ® Xeon ® and Pent ium ® M processors, t his t ool can m onit or an applicat ion t hrough a select ion of perform ance m onit oring event s and analyze t he perform ance event dat a t hat is gat hered during code execut ion. This m anual also describes inform at ion t hat can be gat hered using t he perform ance count ers t hrough Pent ium 4 processor ’s perform ance m onit oring event s.
1.2
ABOUT THIS MANUAL
The I nt el ® Xeon ® processor 3000, 3200, 5100, 5300, 7200 and 7300 series, I nt el ® Pent ium ® dual- core, I nt el ® Core™2 Duo, I nt el ® Core™2 Quad, and I nt el ® Core™2 Ext rem e processors are based on I nt el ® Core™ m icroarchit ect ure. I n t his docum ent , references t o t he Core 2 Duo processor refer t o processors based on t he I nt el ® Core™ m icroarchit ect ure. The I nt el ® Xeon ® processor 3100, 3300, 5200, 5400, 7400 series, I nt el ® Core™2 Quad processor Q8000 series, and I nt el ® Core™2 Ext rem e processors QX9000 series are based on 45 nm Enhanced I nt el ® Core™m icroarchit ect ure.
INTRODUCTION
The I nt el ® Core™ i7 processor and I nt el ® Xeon ® processor 3400, 5500, 7500 series are based on 45 nm I nt el ® m icroarchit ect ure code nam e Nehalem . I nt el ® m icroarchit ect ure code nam e West m ere is a 32 nm version of I nt el ® m icroarchit ect ure code nam e Nehalem . I nt el ® Xeon ® processor 5600 series, I nt el Xeon processor E7 and various I nt el Core i7, i5, i3 processors are based on I nt el ® m icroarchit ect ure code nam e West m ere. The I nt el ® Xeon ® processor E5 fam ily, I nt el ® Xeon ® processor E3- 1200 fam ily, I nt el ® Xeon ® processor E7- 8800/ 4800/ 2800 product fam ilies, I nt el ® CoreTM i7- 3930K processor, and 2nd generat ion I nt el ® CoreTM i7- 2xxx, I nt el ® CoreTM i5- 2xxx, I nt el ® CoreTM i3- 2xxx processor series are based on t he I nt el ® m icroarchit ect ure code nam e Sandy Bridge. The 3rd generat ion I nt el ® Core™ processors and t he I nt el Xeon processor E3- 1200 v2 product fam ily are based on I nt el ® m icroarchit ect ure code nam e I vy Bridge. The I nt el ® Xeon ® processor E5 v2 and E7 v2 fam ilies are based on t he I vy Bridge- E m icroarchit ect ure, support I nt el 64 archit ect ure and m ult iple physical processor packages in a plat form . The 4t h generat ion I nt el ® Core™ processors and t he I nt el ® Xeon ® processor E3- 1200 v3 product fam ily are based on I nt el ® m icroarchit ect ure code nam e Haswell. The I nt el ® Xeon ® processor E5 26xx v3 fam ily is based on t he Haswell- E m icroarchit ect ure, support s I nt el 64 archit ect ure and m ult iple physical processor packages in a plat form . The I nt el ® Core™ M processor fam ily and 5t h generat ion I nt el ® Core™ processors are based on t he I nt el ® m icroarchit ect ure code nam e Broadwell and support I nt el 64 archit ect ure. The 6t h generat ion I nt el ® Core™ processors are based on t he I nt el ® m icroarchit ect ure code nam e Skylake and support I nt el 64 archit ect ure. I n t his docum ent , references t o t he Pent ium 4 processor refer t o processors based on t he I nt el Net Burst ® m icroarchit ect ure. This includes t he I nt el Pent ium 4 processor and m any I nt el Xeon processors based on I nt el Net Burst m icroarchit ect ure. Where appropriat e, differences are not ed ( for exam ple, som e I nt el Xeon processors have t hird level cache) . The Dual- core I nt el ® Xeon ® processor LV is based on t he sam e archit ect ure as I nt el® Core™ Duo and I nt el ® Core™ Solo processors. I nt el ® At om ™ processor is based on I nt el ® At om ™ m icroarchit ect ure. The following bullet s sum m arize chapt ers in t his m anual.
• •
• • • • • •
1-2
Cha pt e r 1 : I n t r odu ct ion — Defines t he purpose and out lines t he cont ent s of t his m anual. Cha pt e r 2 : I n t e l ® 6 4 a n d I A- 3 2 Pr oce ssor Ar chit e ct ur e s — Describes t he m icroarchit ect ure of recent I A- 32 and I nt el 64 processor fam ilies, and ot her feat ures relevant t o soft ware opt im izat ion. Cha pt e r 3 : Ge ne r a l Opt im iz a t ion Guide line s — Describes general code developm ent and opt im izat ion t echniques t hat apply t o all applicat ions designed t o t ake advant age of t he com m on feat ures of t he I nt el Core m icroarchit ect ure, Enhanced I nt el Core m icroarchit ect ure, I nt el Net Burst m icroarchit ect ure and Pent ium M processor m icroarchit ect ure. Cha pt e r 4 : Coding for SI M D Ar ch it e ct u r e s — Describes t echniques and concept s for using t he SI MD int eger and SI MD float ing- point inst ruct ions provided by t he MMX™ t echnology, St ream ing SI MD Ext ensions, St ream ing SI MD Ext ensions 2, St ream ing SI MD Ext ensions 3, SSSE3, and SSE4.1. Cha pt e r 5 : Opt im iz ing for SI M D I nt e ge r Applica t ions — Provides opt im izat ion suggest ions and com m on building blocks for applicat ions t hat use t he 128- bit SI MD int eger inst ruct ions. Cha pt e r 6 : Opt im izin g for SI M D Floa t in g- poin t Applica t ion s — Provides opt im izat ion suggest ions and com m on building blocks for applicat ions t hat use t he single- precision and doubleprecision SI MD float ing- point inst ruct ions. Cha pt e r 7 : Opt im izing Ca che Usa ge — Describes how t o use t he PREFETCH inst ruct ion, cache cont rol m anagem ent inst ruct ions t o opt im ize cache usage, and t he det erm inist ic cache param et ers. Ch a pt e r 8 : M u lt icor e a n d H ype r - Th r e a ding Te chn ology — Describes guidelines and t echniques for opt im izing m ult it hreaded applicat ions t o achieve opt im al perform ance scaling. Use t hese when t arget ing m ult icore processor, processors support ing Hyper-Threading Technology, or m ult iprocessor ( MP) syst em s.
INTRODUCTION
• • • • • • • • • •
Cha pt e r 9 : 6 4 - Bit M ode Coding Guide line s — This chapt er describes a set of addit ional coding guidelines for applicat ion soft ware writ t en t o run in 64- bit m ode. Ch a pt e r 1 0 : SSE4 .2 a n d SI M D Pr ogr a m m in g for Te x t - Pr oce ssin g/ Le x ing/ Pa r sing— Describes SI MD t echniques of using SSE4.2 along wit h ot her inst ruct ion ext ensions t o im prove t ext / st ring processing and lexing/ parsing applicat ions. Cha pt e r 1 1 : Opt im iz a t ions for I nt e l ® AVX , FM A a nd AVX 2 — Provides opt im izat ion suggest ions and com m on building blocks for applicat ions t hat use I nt el ® Advanced Vect or Ext ensions, FMA, and AVX2. Cha pt e r 1 2 : I nt e l Tr a nsa ct iona l Syn ch r on iza t ion Ex t e n sions — Tuning recom m endat ions t o use lock elision t echniques wit h I nt el Transact ional Synchronizat ion Ext ensions t o opt im ize m ult it hreaded soft ware wit h cont ended locks. Ch a pt e r 1 3 : Pow e r Opt im iza t ion for M obile Usa ge s — This chapt er provides background on power saving t echniques in m obile processors and m akes recom m endat ions t hat developers can leverage t o provide longer bat t ery life. Cha pt e r 1 4 : I nt e l ® At om ™ M icr oa r ch it e ct u r e a nd Soft w a r e Opt im iz a t ion — Describes t he m icroarchit ect ure of processor fam ilies based on I nt el At om m icroarchit ect ure, and soft ware opt im izat ion t echniques t arget ing I nt el At om m icroarchit ect ure. Cha pt e r 1 5 : Silve r m ont M icr oa r chit e ct ur e a nd Soft w a r e Opt im iz a t ion — Describes t he m icroarchit ect ure of processor fam ilies based on t he Silverm ont m icroarchit ect ure, and soft ware opt im izat ion t echniques t arget ing I nt el processors based on t he Silverm ont m icroarchit ect ure. Appe ndix A: Applica t ion Pe r for m a nce Tools — I nt roduces t ools for analyzing and enhancing applicat ion perform ance wit hout having t o writ e assem bly code. Appe ndix B: Usin g Pe r for m a n ce M on it or in g Eve nt s — Provides inform at ion on t he Top- Down Analysis Met hod and inform at ion on how t o use perform ance event s specific t o t he I nt el Xeon processor 5500 series, processors based on I nt el m icroarchit ect ure code nam e Sandy Bridge, and I nt el Core Solo and I nt el Core Duo processors. Appe ndix C: I A- 3 2 I n st r uct ion La t e ncy a n d Thr ough put — Provides lat ency and t hroughput dat a for t he I A- 32 inst ruct ions. I nst ruct ion t im ing dat a specific t o recent processor fam ilies are provided.
1.3
RELATED INFORMATION
For m ore inform at ion on t he I nt el ® archit ect ure, t echniques, and t he processor archit ect ure t erm inology, t he following are of part icular int erest :
• • • • • •
I nt el ® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual Developing Mult i- t hreaded Applicat ions: A Plat form Consist ent Approach I nt el ® C+ + Com piler docum ent at ion and online help I nt el ® Fort ran Com piler docum ent at ion and online help I nt el ® VTune™ Am plifier docum ent at ion and online help Using Spin- Loops on I nt el Pent ium 4 Processor and I nt el Xeon Processor MP
More relevant links are:
•
Developer Zone:
•
Processor support general link:
•
ht t ps: / / soft ware.int el.com / en- us/ all- dev- areas ht t p: / / www.int el.com / support / processors/ I nt el Mult i- Core Technology: ht t ps: / / soft ware.int el.com / en- us/ art icles/ m ult i- core- int roduct ion
1-3
INTRODUCTION
• • • • • •
Hyper-Threading Technology ( HT Technology) : ht t p: / / www.int el.com / cont ent / www/ us/ en/ archit ect ure- and- t echnology/ hyper- t hreading/ hypert hreading- t echnology.ht m l SSE4.1 Applicat ion Not e: Mot ion Est im at ion wit h I nt el® St ream ing SI MD Ext ensions 4: ht t ps: / / soft ware.int el.com / en- us/ art icles/ m ot ion- est im at ion- wit h- int el- st ream ing- sim d- ext ensions4- int el- sse4 I nt el® SSE4 Program m ing Reference: ht t ps: / / soft ware.int el.com / sit es/ default / files/ m / 8/ b/ 8/ D9156103.pdf I nt el® 64 Archit ect ure Processor Topology Enum erat ion: ht t ps: / / soft ware.int el.com / en- us/ art icles/ int el- 64- archit ect ure- processor- t opology- enum erat ion Mult i- buffering t echniques using SI MD ext ensions: ht t p: / / www.int el.com / cont ent / dam / www/ public/ us/ en/ docum ent s/ whit e- papers/ com m unicat ionsia- m ult i- buffer- paper.pdf Parallel hashing using Mult i- buffering t echniques: ht t p: / / www.scirp.org/ j ournal/ PaperI nform at ion.aspx?paperI D= 23995
• • •
ht t p: / / eprint .iacr.org/ 2012/ 476.pdf AES Library of sam ple code: ht t p: / / soft ware.int el.com / en- us/ art icles/ download- t he- int el- aesni- sam ple- library/ PCMMULQDQ resources: ht t ps: / / soft ware.int el.com / en- us/ art icles/ int el- carry- less- m ult iplicat ion- inst ruct ion- and- it s- usagefor- com put ing- t he- gcm - m ode Modular exponent iat ion using redundant represent at ion and AVX2: ht t p: / / rd.springer.com / chapt er/ 10.1007% 2F978- 3- 642- 31662- 3_9?LI = t rue
1-4
CHAPTER 2 INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES ®
This chapt er gives an overview of feat ures relevant t o soft ware opt im izat ion for current generat ions of I nt el 64 and I A- 32 processors ( processors based on I nt el ® m icroarchit ect ure code nam e Skylake, I nt el ® m icroarchit ect ure code nam e Broadwell, I nt el ® m icroarchit ect ure code nam e Haswell, I nt el m icroarchit ect ure code nam e I vy Bridge, I nt el m icroarchit ect ure code nam e Sandy Bridge, processors based on t he I nt el Core m icroarchit ect ure, Enhanced I nt el Core m icroarchit ect ure, I nt el m icroarchit ect ure code nam e Nehalem ) . These feat ures are:
•
• • • • • • • • •
Microarchit ect ures t hat enable execut ing inst ruct ions wit h high t hroughput at high clock rat es, a high speed cache hierarchy and high speed syst em bus. Mult icore archit ect ure available across I nt el Core processor and I nt el Xeon processor fam ilies. Hyper-Threading Technology 1 ( HT Technology) support . I nt el 64 archit ect ure on I nt el 64 processors. SI MD inst ruct ion ext ensions: MMX t echnology, St ream ing SI MD Ext ensions ( SSE) , St ream ing SI MD Ext ensions 2 ( SSE2) , St ream ing SI MD Ext ensions 3 ( SSE3) , Supplem ent al St ream ing SI MD Ext ensions 3 ( SSSE3) , SSE4.1, and SSE4.2. I nt el ® Advanced Vect or Ext ensions ( I nt el ® AVX) . Half- precision float ing- point conversion and RDRAND. Fused Mult iply Add Ext ensions. I nt el ® Advanced Vect or Ext ensions 2 ( I nt el ® AVX2) . ADX and RDSEED.
The I nt el Core 2, I nt el Core 2 Ext rem e, I nt el Core 2 Quad processor fam ily, I nt el Xeon processor 3000, 3200, 5100, 5300, 7300 series are based on t he high- perform ance and power- efficient I nt el Core m icroarchit ect ure. I nt el Xeon processor 3100, 3300, 5200, 5400, 7400 series, I nt el Core 2 Ext rem e processor QX9600, QX9700 series, I nt el Core 2 Quad Q9000 series, Q8000 series are based on t he enhanced I nt el Core m icroarchit ect ure. I nt el Core i7 processor is based on I nt el m icroarchit ect ure code nam e Nehalem . I nt el ® Xeon ® processor 5600 series, I nt el Xeon processor E7 and I nt el Core i7, i5, i3 processors are based on I nt el m icroarchit ect ure code nam e West m ere. The I nt el ® Xeon ® processor E5 fam ily, I nt el ® Xeon ® processor E3- 1200 fam ily, I nt el ® Xeon ® processor E7- 8800/ 4800/ 2800 product fam ilies, I nt el ® CoreTM i7- 3930K processor, and 2nd generat ion I nt el ® Core™ i7- 2xxx, I nt el ® Core™ i5- 2xxx, I nt el ® Core™ i3- 2xxx processor series are based on t he I nt el ® m icroarchit ect ure code nam e Sandy Bridge. The I nt el ® Xeon ® processor E3- 1200 v2 product fam ily and t he 3rd generat ion I nt el ® Core™ processors are based on t he I vy Bridge m icroarchit ect ure and support I nt el 64 archit ect ure. The I nt el ® Xeon ® processor E5 v2 and E7 v2 fam ilies are based on t he I vy Bridge- E m icroarchit ect ure, support I nt el 64 archit ect ure and m ult iple physical processor packages in a plat form . The I nt el ® Xeon ® processor E3- 1200 v3 product fam ily and 4t h Generat ion I nt el ® Core™ processors are based on t he Haswell m icroarchit ect ure and support I nt el 64 archit ect ure. The I nt el ® Xeon ® processor E5 26xx v3 fam ily is based on t he Haswell- E m icroarchit ect ure, support s I nt el 64 archit ect ure and m ult iple physical processor packages in a plat form . I nt el ® Core™ M processors, 5t h generat ion I nt el Core processors and I nt el Xeon processor E3- 1200 v4 series are based on t he Broadwell m icroarchit ect ure and support I nt el 64 archit ect ure. The 6t h generat ion I nt el Core processors, I nt el Xeon processor E3- 1500m v5 are based on t he Skylake m icroarchit ect ure and support I nt el 64 archit ect ure. 1. Hyper-Threading Technology requires a computer system with an Intel processor supporting HT Technology and an HT Technology enabled chipset, BIOS and operating system. Performance varies depending on the hardware and software used.
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.1
THE SKYLAKE MICROARCHITECTURE
The Skylake m icroarchit ect ure builds on t he successes of t he Haswell and Broadwell m icroarchit ect ures. The basic pipeline funct ionalit y of t he Skylake m icroarchit ect ure is depict ed in Figure 2- 1.
K L I st u tio Ca he
BPU
De oded I a he DSB
MSROM uops/
le
uops/
Lega De ode Pipeli e uops/
le
I st u tio De ode Queue IDQ,, o
le
i o-op ueue
Allo ate/Re a e/Reti e/Mo eEli i atio /)e oIdio S hedule Po t
I t ALU, Ve FMA, Ve MUL, Ve Add, Ve ALU, Ve Shft, Di ide, Ba h
Po t
I t ALU, Fast LEA, Ve FMA, Ve MUL, Ve Add, Ve ALU, Ve Shft, I t MUL, Slo LEA
Po t
I t ALU, Fast LEA, Ve SHUF, Ve ALU, CVT
Po t
I t ALU, I t Shft, Ba h ,
K L Ca he U ified
Po t LD/STA Po t LD/STA Po t STD
K L Data Ca he
Po t STA
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
The Skylake m icroarchit ect ure offers t he following enhancem ent s:
• • • • • • •
Larger int ernal buffers t o enable deeper OOO execut ion and higher cache bandwidt h. I m proved front end t hroughput . I m proved branch predict or. I m proved divider t hroughput and lat ency. Lower power consum pt ion. I m proved SMT perform ance wit h Hyper-Threading Technology. Balanced float ing- point ADD, MUL, FMA t hroughput and lat ency.
The m icroarchit ect ure support s flexible int egrat ion of m ult iple processor cores wit h a shared uncore subsyst em consist ing of a num ber of com ponent s including a ring int erconnect t o m ult iple slices of L3 ( an off- die L4 is opt ional) , processor graphics, int egrat ed m em ory cont roller, int erconnect fabrics, et c. A four- core configurat ion can be support ed sim ilar t o t he arrangem ent shown in Figure 2- 3.
2-2
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.1.1
The Front End
The front end in t he Skylake m icroarchit ect ure provides t he following im provem ent s over previous generat ion m icroarchit ect ures:
•
• • • •
Legacy Decode Pipeline delivery of 5 uops per cycle t o t he I DQ com pared t o 4 uops in previous generat ions. The DSB delivers 6 uops per cycle t o t he I DQ com pared t o 4 uops in previous generat ions. The I DQ can hold 64 uops per logical processor vs. 28 uops per logical processor in previous generat ions when t wo sibling logical processors in t he sam e core are act ive ( 2x64 vs. 2x28 per core) . I f only one logical processor is act ive in t he core, t he I DQ can hold 64 uops ( 64 vs. 56 uops in ST operat ion) . The LSD in t he I DQ can det ect loops up t o 64 uops per logical processor irrespect ive ST or SMT operat ion. I m proved Branch Predict or.
2.1.2
The Out-of-Order Execution Engine
The Out of Order and execut ion engine changes in Skylake m icroarchit ect ure include:
• • • •
Larger buffers enable deeper OOO execut ion com pared t o previous generat ions. I m proved t hroughput and lat ency for divide/ sqrt and approxim at e reciprocals. I dent ical lat ency and t hroughput for all operat ions running on FMA unit s. Longer pause lat ency enables bet t er power efficiency and bet t er SMT perform ance resource ut ilizat ion.
Table 2- 1 sum m arizes t he OOO engine’s capabilit y t o dispat ch different t ypes of operat ions t o various port s.
Table 2-1. Dispatch Port and Execution Stacks of the Skylake Microarchitecture Port 0
Port 1
Port 2, 3
ALU,
ALU,
LD
Vec ALU
Fast LEA,
STA
Port 4 STD
Port 5
Port 6
ALU,
ALU,
Fast LEA,
Shft,
Vec ALU
Vec ALU,
Vec Shft,
Vec Shft,
Vec Shuffle,
Vec Add,
Vec Add,
Vec Mul,
Vec Mul,
FMA,
FMA
DIV,
Slow Int
Branch2
Slow LEA
Port 7 STA
Branch1
Table 2- 2 list s execut ion unit s and com m on represent at ive inst ruct ions t hat rely on t hese unit s. Throughput im provem ent s across t he SSE, AVX and general- purpose inst ruct ion set s are relat ed t o t he num ber of unit s for t he respect ive operat ions, and t he variet ies of inst ruct ions t hat execut e using a part icular unit .
2-3
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-2. Skylake Microarchitecture Execution Units and Representative Instructions1 Execution Unit
# of Unit
Instructions
ALU
4
add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa, (v)movap*, (v)movup*
SHFT
2
sal, shl, rol, adc, sarx, adcx, adox, etc.
Slow Int
1
mul, imul, bsr, rcl, shld, mulx, pdep, etc.
BM
2
andn, bextr, blsi, blsmsk, bzhi, etc
Vec ALU
3
(v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)movap*, (v)movup*, (v)andp*, (v)orp*, (v)paddb/w/d/q, (v)blendv*, (v)blendp*, (v)pblendd
Vec_Shft
2
(v)psllv*, (v)psrlv*, vector shift count in imm8
Vec Add
2
(v)addp*, (v)cmpp*, (v)max*, (v)min*, (v)padds*, (v)paddus*, (v)psign, (v)pabs, (v)pavgb, (v)pcmpeq*, (v)pmax, (v)cvtps2dq, (v)cvtdq2ps, (v)cvtsd2si, (v)cvtss2si
Shuffle
1
(v)shufp*, vperm*, (v)pack*, (v)unpck*, (v)punpck*, (v)pshuf*, (v)pslldq, (v)alignr, (v)pmovzx*, vbroadcast*, (v)pslldq, (v)psrldq, (v)pblendw
Vec Mul
2
(v)mul*, (v)pmul*, (v)pmadd*,
SIMD Misc
1
STTNI, (v)pclmulqdq, (v)psadw, vector shift count in xmm,
FP Mov
1
(v)movsd/ss, (v)movd gpr,
DIVIDE
1
divp*, divs*, vdiv*, sqrt*, vsqrt*, rcp*, vrcp*, rsqrt*, idiv
NOTES: 1. Execution unit mapping to MMX instructions are not covered in this table. See Section 11.16.5 on MMX instruction throughput remedy. A significant port ion of t he SSE, AVX and general- purpose inst ruct ions also have lat ency im provem ent s. Appendix C list s t he specific det ails. Soft ware- visible lat ency exposure of an inst ruct ion som et im es m ay include addit ional cont ribut ions t hat depend on t he relat ionship bet ween m icro- ops flows of t he producer inst ruct ion and t he m icro- op flows of t he ensuing consum er inst ruct ion. For exam ple, a t wo- uop inst ruct ion like VPMULLD m ay experience t wo cum ulat ive bypass delays of 1 cycle each from each of t he t wo m icro- ops of VPMULLD. Table 2- 3 describes t he bypass delay in cycles bet ween a producer uop and t he consum er uop. The leftm ost colum n list s a variet y of sit uat ions charact erist ic of t he producer m icro- op. The t op row list s a variet y of sit uat ions charact erist ic of t he consum er m icro- op.
2-4
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-3. Bypass Delay Between Producer and Consumer Micro-ops SIMD/0,1/ 1
FMA/0,1/ 4
VIMUL/0,1/ 4
SIMD/5/1,3
SHUF/5/1, 3
V2I/0/3
I2V/5/1
SIMD/0,1/1
0
1
1
0
0
0
NA
FMA/0,1/4
1
0
1
0
0
0
NA
VIMUL/0,1/4
1
0
1
0
0
0
NA
SIMD/5/1,3
0
1
1
0
0
0
NA
SHUF/5/1,3
0
0
1
0
0
0
NA
V2I/0/3
NA
NA
NA
NA
NA
NA
NA
I2V/5/1
0
0
1
0
0
0
NA
The at t ribut es t hat are relevant t o t he producer/ consum er m icro- ops for bypass are a t riplet of abbreviat ion/ one or m ore port num ber/ lat ency cycle of t he uop. For exam ple:
• • •
“ SI MD/ 0,1/ 1” applies t o 1- cycle vect or SI MD uop dispat ched t o eit her port 0 or port 1. “ VI MUL/ 0,1/ 4” applies t o 4- cycle vect or int eger m ult iply uop dispat ched t o eit her port 0 or port 1. “ SI MD/ 5/ 1,3” applies t o eit her 1- cycle or 3- cycle non- shuffle uop dispat ched t o port 5.
2.1.3
Cache and Memory Subsystem
The cache hierarchy of t he Skylake m icroarchit ect ure has t he following enhancem ent s:
• • •
• • • • •
Higher Cache bandwidt h com pared t o previous generat ions. Sim ult aneous handling of m ore loads and st ores enabled by enlarged buffers. Processor can do t wo page walks in parallel com pared t o one in Haswell m icroarchit ect ure and earlier generat ions. Page split load penalt y down from 100 cycles in previous generat ion t o 5 cycles. L3 writ e bandwidt h increased from 4 cycles per line in previous generat ion t o 2 per line. Support for t he CLFLUSHOPT inst ruct ion t o flush cache lines and m anage m em ory ordering of flushed dat a using SFENCE. Reduced perform ance penalt y for a soft ware prefet ch t hat specifies a NULL point er. L2 associat ivit y changed from 8 ways t o 4 ways.
2-5
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-4. Cache Parameters of the Skylake Microarchitecture Level
Capacity / Associativity
Line Size (bytes)
Fastest Latency1
Peak Bandwidth (bytes/cyc)
Sustained Bandwidth (bytes/cyc)
Update Policy
First Level Data
32 KB/ 8
64
4 cycle
96 (2x32B Load + 1*32B Store)
~81
Writeback
Instruction
32 KB/8
64
N/A
N/A
N/A
N/A
Second Level
256KB/4
64
12 cycle
64
~29
Writeback
64
44
32
~18
Writeback
Third Level (Shared L3)
Up to 2MB per core/Up to 16 ways
NOTES: 1. Software-visible latency will vary depending on access patterns and other factors. The TLB hierarchy consist s of dedicat ed level one TLB for inst ruct ion cache, TLB for L1D, plus unified TLB for L2. The part it ion colum n of Table 2- 5 indicat es t he resource sharing policy when Hyper-Threading Technology is act ive.
Table 2-5. TLB Parameters of the Skylake Microarchitecture Level
Page Size
Entries
Associativity
Partition
Instruction
4KB
128
8 ways
dynamic
Instruction
2MB/4MB
8 per thread
First Level Data
4KB
64
4
fixed
First Level Data
2MB/4MB
32
4
fixed
First Level Data
1GB
4
4
fixed
Second Level
Shared by 4KB and 2/4MB pages
1536
12
fixed
Second Level
1GB
16
4
fixed
2.2
fixed
THE HASWELL MICROARCHITECTURE
The Haswell m icroarchit ect ure builds on t he successes of t he Sandy Bridge and I vy Bridge m icroarchit ect ures. The basic pipeline funct ionalit y of t he Haswell m icroarchit ect ure is depict ed in Figure 2- 2. I n general, m ost of t he feat ures described in Sect ion 2.2.1 - Sect ion 2.2.4 also apply t o t he Broadwell m icroarchit ect ure. Enhancem ent s of t he Broadwell m icroarchit ect ure are sum m arized in Sect ion 2.2.6.
2-6
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
K L I st u tio Ca he
P e-De ode
MSROM
I st u tio Queue
De ode IDQ BPU
Uop Ca he DSB
Load Buffe s, Sto e Buffe s, Reo de Buffe s
Allo ate/Re a e/Reti e/ Mo eEli i atio /)e oIdio S hedule
Po t
ALU, SHFT, VEC LOG, VEC SHFT, FP ul, FMA, DIV, STTNI, B a h
Po t
Po t
Po t
Po t
ALU, Shft ALU, Fast LEA, VEC ALU, VEC LOG, FP ul, FMA, FP add, Slo I t
ALU, Fast LEA, VEC ALU, VEC LOG, VEC SHUF,
STD
Po t
Po t
LD/STA
LD/STA
Po t
STA
P i a B a h Me o
Co t ol
K L Data Ca he Li e Fill Buffe s K L Ca he U ified
Figure 2-2. CPU Core Pipeline Functionality of the Haswell Microarchitecture
The Haswell m icroarchit ect ure offers t he following innovat ive feat ures:
• • • • • • • • • • • •
Support for I nt el Advanced Vect or Ext ensions 2 ( I nt el AVX2) , FMA. Support for general- purpose, new inst ruct ions t o accelerat e int eger num eric encrypt ion. Support for I nt el ® Transact ional Synchronizat ion Ext ensions ( I nt el ® TSX) . Each core can dispat ch up t o 8 m icro- ops per cycle. 256- bit dat a pat h for m em ory operat ion, FMA, AVX float ing- point and AVX2 int eger execut ion unit s. I m proved L1D and L2 cache bandwidt h. Two FMA execut ion pipelines. Four arit hm et ic logical unit s ( ALUs) . Three st ore address port s. Two branch execut ion unit s. Advanced power m anagem ent feat ures for I A processor core and uncore sub- syst em s. Support for opt ional fourt h level cache.
The m icroarchit ect ure support s flexible int egrat ion of m ult iple processor cores wit h a shared uncore subsyst em consist ing of a num ber of com ponent s including a ring int erconnect t o m ult iple slices of L3 ( an off- die L4 is opt ional) , processor graphics, int egrat ed m em ory cont roller, int erconnect fabrics, et c. An exam ple of t he syst em int egrat ion view of four CPU cores wit h uncore com ponent s is illust rat ed in Figure 2- 3.
2-7
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
PCIe
Disp Eng
DMI
PEG
DMI
DRAM
PCIe Brdg
IMc
System Agent
CPU Core
L3 Slice
CPU Core
L3 Slice
CPU Core
L3 Slice
CPU Core
L3 Slice
Legend:
Uncore CPU Core
Processor Graphics/ Media Engine
Figure 2-3. Four Core System Integration of the Haswell Microarchitecture
2.2.1
The Front End
The front end of I nt el m icroarchit ect ure code nam e Haswell builds on t hat of I nt el m icroarchit ect ure code nam e Sandy Bridge and I nt el m icroarchit ect ure code nam e I vy Bridge, see Sect ion 2.3.2 and Sect ion 2.3.7. Addit ional enhancem ent s in t he front end include:
• •
•
The uop cache ( or decoded I Cache) is part it ioned equally bet ween t wo logical processors. The inst ruct ion decoders will alt ernat e bet ween each act ive logical processor. I f one sibling logical processor is idle, t he act ive logical processor will use t he decoders cont inuously. The LSD in t he m icro- op queue ( or I DQ) can det ect sm all loops up t o 56 m icro- ops. The 56- ent ry m icro- op queue is shared by t wo logical processors if Hyper-Threading Technology is act ive ( I nt el m icroarchit ect ure Sandy Bridge provides duplicat ed 28- ent ry m icro- op queue in each core) .
2.2.2
The Out-of-Order Engine
The key com ponent s and significant im provem ent s t o t he out- of- order engine are sum m arized below: Re na m e r : The Renam er m oves m icro- ops from t he m icro- op queue t o bind t o t he dispat ch port s in t he Scheduler wit h execut ion resources. Zero- idiom , one- idiom and zero- lat ency regist er m ove operat ions are perform ed by t he Renam er t o free up t he Scheduler and execut ion core for im proved perform ance. Sche dule r : The Scheduler cont rols t he dispat ch of m icro- ops ont o t he dispat ch port s. There are eight dispat ch port s t o support t he out- of- order execut ion core. Four of t he eight port s provided execut ion resources for com put at ional operat ions. The ot her 4 port s support m em ory operat ions of up t o t wo 256bit load and one 256- bit st ore operat ion in a cycle.
2-8
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Ex e cu t ion Cor e : The scheduler can dispat ch up t o eight m icro- ops every cycle, one on each port . Of t he four port s providing com put at ional resources, each provides an ALU, t wo of t hese execut ion pipes provided dedicat ed FMA unit s. Wit h t he except ion of division/ square- root , STTNI / AESNI unit s, m ost float ing- point and int eger SI MD execut ion unit s are 256- bit wide. The four dispat ch port s servicing m em ory operat ions consist wit h t wo dual- use port s for load and st ore- address operat ion. Plus a dedicat ed 3rd st ore- address port and one dedicat ed st ore- dat a port . All m em ory port s can handle 256- bit m em ory m icro- ops. Peak float ing- point t hroughput , at 32 single- precision operat ions per cycle and 16 double- precision operat ions per cycle using FMA, is t wice t hat of I nt el m icroarchit ect ure code nam e Sandy Bridge. The out- of- order engine can handle 192 uops in flight com pared t o 168 in I nt el m icroarchit ect ure code nam e Sandy Bridge.
2.2.3
Execution Engine
Table 2- 6 sum m arizes which operat ions can be dispat ched on which port .
Table 2-6. Dispatch Port and Execution Stacks of the Haswell Microarchitecture Port 0
Port 1
Port 2, 3
ALU,
ALU,
Load_Addr,
Shift
Fast LEA,
Store_addr
Port 4 Store_data
Port 5
Port 6
ALU,
ALU,
Fast LEA,
Shift,
BM
BM
JEU
SIMD_Log, SIMD misc, SIMD_Shifts
SIMD_ALU, SIMD_Log
SIMD_ALU, SIMD_Log,
FMA/FP_mul, Divide
FMA/FP_mul, FP_add
Shuffle
2nd_Jeu
slow_int,
Port 7 Store_addr, Simple_AGU
FP mov, AES
Table 2- 7 list s execut ion unit s and com m on represent at ive inst ruct ions t hat rely on t hese unit s. Table 2- 7 also includes som e inst ruct ions t hat are available only on processors based on t he Broadwell m icroarchit ect ure.
2-9
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-7. Haswell Microarchitecture Execution Units and Representative Instructions Execution Unit
# of Ports
Instructions
ALU
4
add, and, cmp, or, test, xor, movzx, movsx, mov, (v)movdqu, (v)movdqa
SHFT
2
sal, shl, rol, adc, sarx, (adcx, adox)1 etc.
Slow Int
1
mul, imul, bsr, rcl, shld, mulx, pdep, etc.
BM
2
andn, bextr, blsi, blsmsk, bzhi, etc
SIMD Log
3
(v)pand, (v)por, (v)pxor, (v)movq, (v)movq, (v)blendp*, vpblendd
SIMD_Shft
1
(v)psl*, (v)psr*
SIMD ALU
2
(v)padd*, (v)psign, (v)pabs, (v)pavgb, (v)pcmpeq*, (v)pmax, (v)pcmpgt*
Shuffle
1
(v)shufp*, vperm*, (v)pack*, (v)unpck*, (v)punpck*, (v)pshuf*, (v)pslldq, (v)alignr, (v)pmovzx*, vbroadcast*, (v)pslldq, (v)pblendw
SIMD Misc
1
(v)pmul*, (v)pmadd*, STTNI, (v)pclmulqdq, (v)psadw, (v)pcmpgtq, vpsllvd, (v)bendv*, (v)plendw,
FP Add
1
(v)addp*, (v)cmpp*, (v)max*, (v)min*,
FP Mov
1
(v)movap*, (v)movup*, (v)movsd/ss, (v)movd gpr, (v)andp*, (v)orp*
DIVIDE
1
divp*, divs*, vdiv*, sqrt*, vsqrt*, rcp*, vrcp*, rsqrt*, idiv
NOTES: 1. Only available in processors based on the Broadwell microarchitecture and support CPUID ADX feature flag. The reservat ion st at ion ( RS) is expanded t o 60 ent ries deep ( com pared t o 54 ent ries in I nt el m icroarchit ect ure code nam e Sandy Bridge) . I t can dispat ch up t o eight m icro- ops in one cycle if t he m icro- ops are ready t o execut e. The RS dispat ch a m icro- op t hrough an issue port t o a specific execut ion clust er, arranged in several st acks t o handle specific dat a t ypes or granularit y of dat a. When a source of a m icro- op execut ed in one st ack com es from a m icro- op execut ed in anot her st ack, a delay can occur. The delay occurs also for t ransit ions bet ween I nt el SSE int eger and I nt el SSE float ingpoint operat ions. I n som e of t he cases t he dat a t ransit ion is done using a m icro- op t hat is added t o t he inst ruct ion flow. Table 2- 29 describes how dat a, writ t en back aft er execut ion, can bypass t o m icro- op execut ion in t he following cycles.
2-10
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-8. Bypass Delay Between Producer and Consumer Micro-ops (cycles) From/To
INT
SSE-INT/ AVX-INT
• •
INT
SSE-INT/ AVX-INT
micro-op (port 1)
SSE-FP/ AVX-FP_LOW
micro-op (port 1)
X87/ AVX-FP_High
micro-op (port 1) + 3 cycle delay
micro-op (port 5) micro-op (port 6) + 1 cycle
• •
micro-op (port 5) micro-op (port 6) + 1 cycle
X87/ AVX-FP_High micro-op (port 5) + 3 cycle delay
1 cycle delay 1 cycle delay
Load
2.2.4
SSE-FP/ AVX-FP_LOW
micro-op (port 5) + 1cycle delay micro-op (port 5) + 1cycle delay
1 cycle delay
1 cycle delay
2 cycle delay
Cache and Memory Subsystem
The cache hierarchy is sim ilar t o prior generat ions, including an inst ruct ion cache, a first- level dat a cache and a second- level unified cache in each core, and a 3rd- level unified cache wit h size dependent on specific product configurat ion. The 3rd- level cache is organized as m ult iple cache slices, t he size of each slice m ay depend on product configurat ions, connect ed by a ring int erconnect . The exact det ails of t he cache t opology is report ed by CPUI D leaf 4. The 3rd level cache resides in t he “ uncore” sub- syst em t hat is shared by all t he processor cores. I n som e product configurat ions, a fourt h level cache is also support ed. Table 2- 27 provides m ore det ails of t he cache hierarchy.
Table 2-9. Cache Parameters of the Haswell Microarchitecture Level
Capacity/Ass ociativity
Line Size (bytes)
Fastest Latency1
Throughput Peak Bandwidth (clocks) (bytes/cyc)
Update Policy
First Level Data
32 KB/ 8
64
4 cycle
0.52
64 (Load) + 32 (Store)
Writeback
Instruction
32 KB/8
64
N/A
N/A
N/A
N/A
Second Level
256KB/8
64
11 cycle
Varies
64
Writeback
64
~34
Varies
Third Level (Shared L3)
Varies
Writeback
NOTES: 1. Software-visible latency will vary depending on access patterns and other factors. L3 latency can vary due to clock ratios between the processor core and uncore. 2. First level data cache supports two load micro-ops each cycle; each micro-op can fetch up to 32-bytes of data.
2-11
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The TLB hierarchy consist s of dedicat ed level one TLB for inst ruct ion cache, TLB for L1D, plus unified TLB for L2.
Table 2-10. TLB Parameters of the Haswell Microarchitecture Level
Page Size
Entries
Associativity
Partition
Instruction
4KB
128
4 ways
dynamic
Instruction
2MB/4MB
8 per thread
First Level Data
4KB
64
4
fixed
First Level Data
2MB/4MB
32
4
fixed
First Level Data
1GB
4
4
fixed
Second Level
Shared by 4KB and 2/4MB pages
1024
8
fixed
2.2.4.1
fixed
Load and Store Operation Enhancements
The L1 dat a cache can handle t wo 256- bit load and one 256- bit st ore operat ions each cycle. The unified L2 can service one cache line ( 64 byt es) each cycle. Addit ionally, t here are 72 load buffers and 42 st ore buffers available t o support m icro- ops execut ion in- flight .
2.2.5
The Haswell-E Microarchitecture
I nt el processors based on t he Haswell- E m icroarchit ect ure com prises t he sam e processor cores as described in t he Haswell m icroarchit ect ure, but provides m ore advanced uncore and int egrat ed I / O capabilit ies. Processors based on t he Haswell- E m icroarchit ect ure support plat form s wit h m ult iple socket s. The Haswell- E m icroarchit ect ure support s versat ile processor archit ect ures and plat form configurat ions for scalabilit y and high perform ance. Som e of capabilit ies provided by t he uncore and int egrat ed I / O subsyst em of t he Haswell- E m icroarchit ect ure include:
• • • •
Support for m ult iple I nt el QPI int erconnect s in m ult i- socket configurat ions. Up t o t wo int egrat ed m em ory cont rollers per physical processor. Up t o 40 lanes of PCI Express* 3.0 links per physical processor. Up t o 18 processor cores connect ed by t wo ring int erconnect s t o t he L3 in each physical processor.
An exam ple of a possible 12- core processor im plem ent at ion using t he Haswell- E m icroarchit ect ure is illust rat ed in Figure 2- 4. The capabilit ies of t he uncore and int egrat ed I / O sub- syst em vary across t he processor fam ily im plem ent ing t he Haswell- E m icroarchit ect ure. For det ails, please consult t he dat a sheet s of respect ive I nt el Xeon E5 v3 processors.
2-12
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
PCIe
Legend:
QPI
Uncore CPU Core
Integrated I/O
QPII Links
Sbox Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Core
L3 Slice
Sbox
DRAM
Home Agent Memory Controller
DRAM
DRAM
Home Agent Memory Controller
DRAM
Figure 2-4. An Example of the Haswell-E Microarchitecture Supporting 12 Processor Cores
2.2.6
The Broadwell Microarchitecture
I nt el Core M processors are based on t he Broadwell m icroarchit ect ure. The Broadwell m icroarchit ect ure builds from t he Haswell m icroarchit ect ure and provides several enhancem ent s. This sect ion covers enhanced feat ures of t he Broadwell m icroarchit ect ure.
• • •
Float ing- point m ult iply inst ruct ion lat ency is im proved from 5 cycles in prior generat ion t o 3 cycle in t he Broadwell m icroarchit ect ure. This applies t o AVX, SSE and FP inst ruct ion set s. The t hroughput of gat her inst ruct ions has been im proved significant ly, see Table C- 5. The PCLMULQDQ inst ruct ion im plem ent at ion is a single uop in t he Broadwell m icroarchit ect ure wit h im proved lat ency and t hroughput .
The TLB hierarchy consist s of dedicat ed level one TLB for inst ruct ion cache, TLB for L1D, plus unified TLB for L2.
Table 2-11. TLB Parameters of the Broadwell Microarchitecture Level
Page Size
Entries
Associativity
Partition
Instruction
4KB
128
4 ways
dynamic
Instruction
2MB/4MB
8 per thread
First Level Data
4KB
64
4
fixed
fixed
First Level Data
2MB/4MB
32
4
fixed
First Level Data
1GB
4
4
fixed
Second Level
Shared by 4KB and 2MB pages
1536
6
fixed
Second Level
1GB pages
16
4
fixed
2-13
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.3
INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE
I nt el ® m icroarchit ect ure code nam e Sandy Bridge builds on t he successes of I nt el ® Core™ m icroarchit ect ure and I nt el m icroarchit ect ure code nam e Nehalem . I t offers t he following innovat ive feat ures:
•
I nt el Advanced Vect or Ext ensions ( I nt el AVX) — 256- bit float ing- point inst ruct ion set ext ensions t o t he 128- bit I nt el St ream ing SI MD Ext ensions, providing up t o 2X perform ance benefit s relat ive t o 128- bit code. — Non- dest ruct ive dest inat ion encoding offers m ore flexible coding t echniques.
•
— Support s flexible m igrat ion and co- exist ence bet ween 256- bit AVX code, 128- bit AVX code and legacy 128- bit SSE code. Enhanced front end and execut ion engine — New decoded I Cache com ponent t hat im proves front end bandwidt h and reduces branch m ispredict ion penalt y. — Advanced branch predict ion. — Addit ional m acro- fusion support . — Larger dynam ic execut ion window. — Mult i- precision int eger arit hm et ic enhancem ent s ( ADC/ SBB, MUL/ I MUL) . — LEA bandwidt h im provem ent . — Reduct ion of general execut ion st alls ( read port s, writ eback conflict s, bypass lat ency, part ial st alls) . — Fast float ing- point except ion handling.
•
— XSAVE/ XRSTORE perform ance im provem ent s and XSAVEOPT new inst ruct ion. Cache hierarchy im provem ent s for wider dat a pat h — Doubling of bandwidt h enabled by t wo sym m et ric port s for m em ory operat ion. — Sim ult aneous handling of m ore in- flight loads and st ores enabled by increased buffers. — I nt ernal bandwidt h of t wo loads and one st ore each cycle. — I m proved prefet ching. — High bandwidt h low lat ency LLC archit ect ure.
•
— High bandwidt h ring archit ect ure of on- die int erconnect . Syst em - on- a- chip support — I nt egrat ed graphics and m edia engine in second generat ion I nt el Core processors. — I nt egrat ed PCI E cont roller.
•
— I nt egrat ed m em ory cont roller. Next generat ion I nt el Turbo Boost Technology — Leverage TDP headroom t o boost perform ance of CPU cores and int egrat ed graphic unit .
2.3.1
Intel® Microarchitecture Code Name Sandy Bridge Pipeline Overview
Figure 2- 5 depict s t he pipeline and m aj or com ponent s of a processor core t hat ’s based on I nt el m icroarchit ect ure code nam e Sandy Bridge. The pipeline consist s of
•
An in- order issue front end t hat fet ches inst ruct ions and decodes t hem int o m icro- ops ( m icro- operat ions) . The front end feeds t he next pipeline st ages wit h a cont inuous st ream of m icro- ops from t he m ost likely pat h t hat t he program will execut e.
2-14
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
• •
An out- of- order, superscalar execut ion engine t hat dispat ches up t o six m icro- ops t o execut ion, per cycle. The allocat e/ renam e block reorders m icro- ops t o " dat aflow" order so t hey can execut e as soon as t heir sources are ready and execut ion resources are available. An in- order ret irem ent unit t hat ensures t hat t he result s of execut ion of t he m icro- ops, including any except ions t hey m ay have encount ered, are visible according t o t he original program order.
The flow of an inst ruct ion in t he pipeline can be sum m arized in t he following progression: 1. The Branch Predict ion Unit chooses t he next block of code t o execut e from t he program . The processor searches for t he code in t he following resources, in t his order: a.
Decoded I Cache.
b.
I nst ruct ion Cache, via act ivat ing t he legacy decode pipeline.
c.
L2 cache, last level cache ( LLC) and m em ory, as necessary.
32K L1 Instruction Cache
Pre-decode
Instr Queue Decoders
Branch Predictor 1.5K uOP Cache Load Buffers
Store Buffers
Reorder Buffers
Allocate/Rename/Retire In-order out-of-order
Scheduler
Port 0 ALU V-Mul V-Shuffle Fdiv 256- FP MUL 256- FP Blend
Port 1
Port 5
ALU
ALU JMP
V-Add V-Shuffle 256- FP Add
Port 2 Load StAddr
Port 3
Port 4
Load StAddr
STD
256- FP Shuf 256- FP Bool 256- FP Blend Memory Control 48 bytes/cycle
256K L2 Cache (Unified)
Line Fill Buffers
32K L1 Data Cache
Figure 2-5. Intel Microarchitecture Code Name Sandy Bridge Pipeline Functionality
2. The m icro- ops corresponding t o t his code are sent t o t he Renam e/ ret irem ent block. They ent er int o t he scheduler in program order, but execut e and are de- allocat ed from t he scheduler according t o dat a- flow order. For sim ult aneously ready m icro- ops, FI FO ordering is nearly always m aint ained. Micro- op execut ion is execut ed using execut ion resources arranged in t hree st acks. The execut ion unit s in each st ack are associat ed wit h t he dat a t ype of t he inst ruct ion. Branch m ispredict ions are signaled at branch execut ion. I t re- st eers t he front end which delivers m icro- ops from t he correct pat h. The processor can overlap work preceding t he branch m ispredict ion wit h work from t he following correct ed pat h.
2-15
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
3. Mem ory operat ions are m anaged and reordered t o achieve parallelism and m axim um perform ance. Misses t o t he L1 dat a cache go t o t he L2 cache. The dat a cache is non- blocking and can handle m ult iple sim ult aneous m isses. 4. Except ions ( Fault s, Traps) are signaled at ret irem ent ( or at t em pt ed ret irem ent ) of t he fault ing inst ruct ion. Each processor core based on I nt el m icroarchit ect ure code nam e Sandy Bridge can support t wo logical processor if I nt el Hyper-Threading Technology is enabled.
2.3.2
The Front End
This sect ion describes t he key charact erist ics of t he front end. Table 2- 12 list s t he com ponent s of t he front end, t heir funct ions, and t he problem s t hey address.
Table 2-12. Components of the Front End of Intel Microarchitecture Code Name Sandy Bridge Component
Functions
Performance Challenges
Instruction Cache
32-Kbyte backing store of instruction bytes
Fast access to hot code instruction bytes
Legacy Decode Pipeline
Decode instructions to micro-ops, delivered to the micro-op queue and the Decoded ICache.
Provides the same decode latency and bandwidth as prior Intel processors. Decoded ICache warm-up
Decoded ICache
Provide stream of micro-ops to the micro-op queue.
MSROM
Complex instruction micro-op flow store, accessible from both Legacy Decode Pipeline and Decoded ICache
Branch Prediction Unit (BPU)
Determine next block of code to be executed and drive lookup of Decoded ICache and legacy decode pipelines.
Improves performance and energy efficiency through reduced branch mispredictions.
Micro-op queue
Queues micro-ops from the Decoded ICache and the legacy decode pipeline.
Hide front end bubbles; provide execution micro-ops at a constant rate.
2.3.2.1
Provides higher micro-op bandwidth at lower latency and lower power than the legacy decode pipeline
Legacy Decode Pipeline
The Legacy Decode Pipeline com prises t he inst ruct ion t ranslat ion lookaside buffer ( I TLB) , t he inst ruct ion cache ( I Cache) , inst ruct ion predecode, and inst ruct ion decode unit s. I nst r u ct ion Ca ch e a nd I TLB An inst ruct ion fet ch is a 16- byt e aligned lookup t hrough t he I TLB and int o t he inst ruct ion cache. The inst ruct ion cache can deliver every cycle 16 byt es t o t he inst ruct ion pre- decoder. Table 2- 12 com pares t he I Cache and I TLB wit h prior generat ion.
Table 2-13. ICache and ITLB of Intel Microarchitecture Code Name Sandy Bridge Component
Intel microarchitecture code name Sandy Bridge
Intel microarchitecture code name Nehalem
ICache Size
32-Kbyte
32-Kbyte
ICache Ways
8
4
ITLB 4K page entries
128
128
ITLB large page (2M or 4M) entries
8
7
2-16
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Upon I TLB m iss t here is a lookup t o t he Second level TLB ( STLB) t hat is com m on t o t he DTLB and t he I TLB. The penalt y of an I TLB m iss and a STLB hit is seven cycles. I n st r u ct ion Pr e D e code The predecode unit accept s t he 16 byt es from t he inst ruct ion cache and det erm ines t he lengt h of t he inst ruct ions. The following lengt h changing prefixes ( LCPs) im ply inst ruct ion lengt h t hat is different from t he default lengt h of inst ruct ions. Therefore t hey cause an addit ional penalt y of t hree cycles per LCP during lengt h decoding. Previous processors incur a six- cycle penalt y for each 16- byt e chunk t hat has one or m ore LCPs in it . Since usually t here is no m ore t han one LCP in a 16- byt e chunk, in m ost cases, I nt el m icroarchit ect ure code nam e Sandy Bridge int roduces an im provem ent over previous processors.
• •
•
Operand Size Override ( 66H) preceding an inst ruct ion wit h a word/ double im m ediat e dat a. This prefix m ight appear when t he code uses 16 bit dat a t ypes, unicode processing, and im age processing. Address Size Override ( 67H) preceding an inst ruct ion wit h a m odr/ m in real, big real, 16- bit prot ect ed or 32- bit prot ect ed m odes. This prefix m ay appear in boot code sequences. The REX prefix ( 4xh) in t he I nt el® 64 inst ruct ion set can change t he size of t wo classes of inst ruct ions: MOV offset and MOV im m ediat e. Despit e t his capabilit y, it does not cause an LCP penalt y and hence is not considered an LCP.
I nst r uct ion D e code There are four decoding unit s t hat decode inst ruct ion int o m icro- ops. The first can decode all I A- 32 and I nt el 64 inst ruct ions up t o four m icro- ops in size. The rem aining t hree decoding unit s handle singlem icro- op inst ruct ions. All four decoding unit s support t he com m on cases of single m icro- op flows including m icro- fusion and m acro- fusion. Micro- ops em it t ed by t he decoders are direct ed t o t he m icro- op queue and t o t he Decoded I Cache. I nst ruct ions longer t han four m icro- ops generat e t heir m icro- ops from t he MSROM. The MSROM bandwidt h is four m icro- ops per cycle. I nst ruct ions whose m icro- ops com e from t he MSROM can st art from eit her t he legacy decode pipeline or from t he Decoded I Cache. M icr oFusion Micro- fusion fuses m ult iple m icro- ops from t he sam e inst ruct ion int o a single com plex m icro- op. The com plex m icro- op is dispat ched in t he out- of- order execut ion core as m any t im es as it would if it were not m icro- fused. Micro- fusion enables you t o use m em ory- t o- regist er operat ions, also known as t he com plex inst ruct ion set com put er ( CI SC) inst ruct ion set , t o express t he act ual program operat ion wit hout worrying about a loss of decode bandwidt h. Micro- fusion im proves inst ruct ion bandwidt h delivered from decode t o ret irem ent and saves power. Coding an inst ruct ion sequence by using single- uop inst ruct ions will increases t he code size, which can decrease fet ch bandwidt h from t he legacy pipeline. The following are exam ples of m icro- fused m icro- ops t hat can be handled by all decoders.
• •
• •
All st ores t o m em ory, including st ore im m ediat e. St ores execut e int ernally as t wo separat e funct ions, st ore- address and st ore- dat a. All inst ruct ions t hat com bine load and com put at ion operat ions ( load+ op) , for exam ple:
• • •
ADDPS XMM9, OWORD PTR [ RSP+ 40] FADD DOUBLE PTR [ RDI + RSI * 8] XOR RAX, QWORD PTR [ RBP+ 32]
All inst ruct ions of t he form " load and j um p," for exam ple:
• •
JMP [ RDI + 200] RET
CMP and TEST wit h im m ediat e operand and m em ory
2-17
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
An inst ruct ion wit h RI P relat ive addressing is not m icro- fused in t he following cases:
• •
An addit ional im m ediat e is needed, for exam ple:
• •
CMP [ RI P+ 400] , 27 MOV [ RI P+ 3000] , 142
The inst ruct ion is a cont rol flow inst ruct ion wit h an indirect t arget specified using RI P- relat ive addressing, for exam ple:
•
JMP [ RI P+ 5000000]
I n t hese cases, an inst ruct ion t hat can not be m icro- fused will require decoder 0 t o issue t wo m icro- ops, result ing in a slight loss of decode bandwidt h. I n 64- bit code, t he usage of RI P Relat ive addressing is com m on for global dat a. Since t here is no m icrofusion in t hese cases, perform ance m ay be reduced when port ing 32- bit code t o 64- bit code. M a cr o- Fusion Macro- fusion m erges t wo inst ruct ions int o a single m icro- op. I n I nt el Core m icroarchit ect ure, t his hardware opt im izat ion is lim it ed t o specific condit ions specific t o t he first and second of t he m acro- fusable inst ruct ion pair.
•
The first inst ruct ion of t he m acro- fused pair m odifies t he flags. The following inst ruct ions can be m acro- fused: — I n I nt el m icroarchit ect ure code nam e Nehalem : CMP, TEST. — I n I nt el m icroarchit ect ure code nam e Sandy Bridge: CMP, TEST, ADD, SUB, AND, I NC, DEC — These inst ruct ions can fuse if
• •
•
The first source / dest inat ion operand is a regist er. The second source operand ( if exist s) is one of: im m ediat e, regist er, or non RI P- relat ive m em ory.
The second inst ruct ion of t he m acro- fusable pair is a condit ional branch. Table 3- 1 describes, for each inst ruct ion, what branches it can fuse wit h.
Macro fusion does not happen if t he first inst ruct ion ends on byt e 63 of a cache line, and t he second inst ruct ion is a condit ional branch t hat st art s at byt e 0 of t he next cache line. Since t hese pairs are com m on in m any t ypes of applicat ions, m acro- fusion im proves perform ance even on non- recom piled binaries. Each m acro- fused inst ruct ion execut es wit h a single dispat ch. This reduces lat ency and frees execut ion resources. You also gain increased renam e and ret ire bandwidt h, increased virt ual st orage, and power savings from represent ing m ore work in fewer bit s.
2.3.2.2
Decoded ICache
The Decoded I Cache is essent ially an accelerat or of t he legacy decode pipeline. By st oring decoded inst ruct ions, t he Decoded I Cache enables t he following feat ures:
• • •
Reduced lat ency on branch m ispredict ions. I ncreased m icro- op delivery bandwidt h t o t he out- of- order engine. Reduced front end power consum pt ion.
The Decoded I Cache caches t he out put of t he inst ruct ion decoder. The next t im e t he m icro- ops are consum ed for execut ion t he decoded m icro- ops are t aken from t he Decoded I Cache. This enables skipping t he fet ch and decode st ages for t hese m icro- ops and reduces power and lat ency of t he Front End. The Decoded I Cache provides average hit rat es of above 80% of t he m icro- ops; furt herm ore, " hot spot s" t ypically have hit rat es close t o 100% . Typical int eger program s average less t han four byt es per inst ruct ion, and t he front end is able t o race ahead of t he back end, filling in a large window for t he scheduler t o find inst ruct ion level parallelism . However, for high perform ance code wit h a basic block consist ing of m any inst ruct ions, for exam ple, I nt el
2-18
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
SSE m edia algorit hm s or excessively unrolled loops, t he 16 inst ruct ion byt es per cycle is occasionally a lim it at ion. The 32- byt e orient at ion of t he Decoded I Cache helps such code t o avoid t his lim it at ion. The Decoded I Cache aut om at ically im proves perform ance of program s wit h t em poral and spat ial localit y. However, t o fully ut ilize t he Decoded I Cache pot ent ial, you m ight need t o underst and it s int ernal organizat ion. The Decoded I Cache consist s of 32 set s. Each set cont ains eight Ways. Each Way can hold up t o six m icro- ops. The Decoded I Cache can ideally hold up t o 1536 m icro- ops. The following are som e of t he rules how t he Decoded I Cache is filled wit h m icro- ops:
• • • • • • • • •
All m icro- ops in a Way represent inst ruct ions which are st at ically cont iguous in t he code and have t heir EI Ps wit hin t he sam e aligned 32- byt e region. Up t o t hree Ways m ay be dedicat ed t o t he sam e 32- byt e aligned chunk, allowing a t ot al of 18 m icroops t o be cached per 32- byt e region of t he original I A program . A m ult i m icro- op inst ruct ion cannot be split across Ways. Up t o t wo branches are allowed per Way. An inst ruct ion which t urns on t he MSROM consum es an ent ire Way. A non- condit ional branch is t he last m icro- op in a Way. Micro- fused m icro- ops ( load+ op and st ores) are kept as one m icro- op. A pair of m acro- fused inst ruct ions is kept as one m icro- op. I nst ruct ions wit h 64- bit im m ediat e require t wo slot s t o hold t he im m ediat e.
When m icro- ops cannot be st ored in t he Decoded I Cache due t o t hese rest rict ions, t hey are delivered from t he legacy decode pipeline. Once m icro- ops are delivered from t he legacy pipeline, fet ching m icroops from t he Decoded I Cache can resum e only aft er t he next branch m icro- op. Frequent swit ches can incur a penalt y. The Decoded I Cache is virt ually included in t he I nst ruct ion cache and I TLB. That is, any inst ruct ion wit h m icro- ops in t he Decoded I Cache has it s original inst ruct ion byt es present in t he inst ruct ion cache. I nst ruct ion cache evict ions m ust also be evict ed from t he Decoded I Cache, which evict s only t he necessary lines. There are cases where t he ent ire Decoded I Cache is flushed. One reason for t his can be an I TLB ent ry evict ion. Ot her reasons are not usually visible t o t he applicat ion program m er, as t hey occur when im port ant cont rols are changed, for exam ple, m apping in CR3, or feat ure and m ode enabling in CR0 and CR4. There are also cases where t he Decoded I Cache is disabled, for inst ance, when t he CS base address is NOT set t o zero.
2.3.2.3
Branch Prediction
Branch predict ion predict s t he branch t arget and enables t he processor t o begin execut ing inst ruct ions long before t he branch t rue execut ion pat h is known. All branches ut ilize t he branch predict ion unit ( BPU) for predict ion. This unit predict s t he t arget address not only based on t he EI P of t he branch but also based on t he execut ion pat h t hrough which execut ion reached t his EI P. The BPU can efficient ly predict t he following branch t ypes:
• • • •
Condit ional branches. Direct calls and j um ps. I ndirect calls and j um ps. Ret urns.
2.3.2.4
Micro-op Queue and the Loop Stream Detector (LSD)
The m icro- op queue decouples t he front end and t he out- of order engine. I t st ays bet ween t he m icro- op generat ion and t he renam er as shown in Figure 2- 5. This queue helps t o hide bubbles which are int roduced bet ween t he various sources of m icro- ops in t he front end and ensures t hat four m icro- ops are delivered for execut ion, each cycle. 2-19
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The m icro- op queue provides post- decode funct ionalit y for cert ain inst ruct ions t ypes. I n part icular, loads com bined wit h com put at ional operat ions and all st ores, when used wit h indexed addressing, are represent ed as a single m icro- op in t he decoder or Decoded I Cache. I n t he m icro- op queue t hey are fragm ent ed int o t wo m icro- ops t hrough a process called un- lam inat ion, one does t he load and t he ot her does t he operat ion. A t ypical exam ple is t he following " load plus operat ion" inst ruct ion: ADD
RAX, [ RBP+ RSI ] ; rax : = rax + LD( RBP+ RSI )
Sim ilarly, t he following st ore inst ruct ion has t hree regist er sources and is broken int o " generat e st ore address" and " generat e st ore dat a" sub- com ponent s. MOV
[ ESP+ ECX* 4+ 12345678] , AL
The addit ional m icro- ops generat ed by unlam inat ion use t he renam e and ret irem ent bandwidt h. However, it has an overall power benefit . For code t hat is dom inat ed by indexed addressing ( as oft en happens wit h array processing) , recoding algorit hm s t o use base ( or base+ displacem ent ) addressing can som et im es im prove perform ance by keeping t he load plus operat ion and st ore inst ruct ions fused. Th e Loop St r e a m D e t e ct or ( LSD ) The Loop St ream Det ect or was int roduced in I nt el ® Core m icroarchit ect ures. The LSD det ect s sm all loops t hat fit in t he m icro- op queue and locks t hem down. The loop st ream s from t he m icro- op queue, wit h no m ore fet ching, decoding, or reading m icro- ops from any of t he caches, unt il a branch m is- predict ion inevit ably ends it . The loops wit h t he following at t ribut es qualify for LSD/ m icro- op queue replay:
• • • • •
Up t o eight chunk fet ches of 32- inst ruct ion- byt es. Up t o 28 m icro- ops ( ~ 28 inst ruct ions) . All m icro- ops are also resident in t he Decoded I Cache. Can cont ain no m ore t han eight t aken branches and none of t hem can be a CALL or RET. Cannot have m ism at ched st ack operat ions. For exam ple, m ore PUSH t han POP inst ruct ions.
Many calculat ion- int ensive loops, searches and soft ware st ring m oves m at ch t hese charact erist ics. Use t he loop cache funct ionalit y opport unist ically. For high perform ance code, loop unrolling is generally preferable for perform ance even when it overflows t he LSD capabilit y.
2.3.3
The Out-of-Order Engine
The Out- of- Order engine provides im proved perform ance over prior generat ions wit h excellent power charact erist ics. I t det ect s dependency chains and sends t hem t o execut ion out- of- order while m aint aining t he correct dat a flow. When a dependency chain is wait ing for a resource, such as a second- level dat a cache line, it sends m icro- ops from anot her chain t o t he execut ion core. This increases t he overall rat e of inst ruct ions execut ed per cycle ( I PC) . The out - of- order engine consist s of t wo blocks, shown in Figure 2- 5: Core Funct ional Diagram , t he Renam e/ ret irem ent block, and t he Scheduler. The Out- of- Order engine cont ains t he following m aj or com ponent s: Re na m e r. The Renam er com ponent m oves m icro- ops from t he front end t o t he execut ion core. I t elim inat es false dependencies am ong m icro- ops, t hereby enabling out- of- order execut ion of m icro- ops. Sche dule r. The Scheduler com ponent queues m icro- ops unt il all source operands are ready. Schedules and dispat ches ready m icro- ops t o t he available execut ion unit s in as close t o a first in first out ( FI FO) order as possible. Re t ir e m e n t . The Ret irem ent com ponent ret ires inst ruct ions and m icro- ops in order and handles fault s and except ions.
2-20
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.3.3.1
Renamer
The Renam er is t he bridge bet ween t he in- order part in Figure 2- 5, and t he dat aflow world of t he Scheduler. I t m oves up t o four m icro- ops every cycle from t he m icro- op queue t o t he out- of- order engine. Alt hough t he renam er can send up t o 4 m icro- ops ( unfused, m icro- fused, or m acro- fused) per cycle, t his is equivalent t o t he issue port can dispat ch six m icro- ops per cycle. I n t his process, t he out- of- order core carries out t he following st eps:
• • •
Renam es archit ect ural sources and dest inat ions of t he m icro- ops t o m icro- archit ect ural sources and dest inat ions. Allocat es resources t o t he m icro- ops. For exam ple, load or st ore buffers. Binds t he m icro- op t o an appropriat e dispat ch port .
Som e m icro- ops can execut e t o com plet ion during renam e and are rem oved from t he pipeline at t hat point , effect ively cost ing no execut ion bandwidt h. These include:
• • • •
Zero idiom s ( dependency breaking idiom s) . NOP. VZEROUPPER. FXCHG.
The renam er can allocat e t wo branches each cycle, com pared t o one branch each cycle in t he previous m icroarchit ect ure. This can elim inat e som e bubbles in execut ion. Micro- fused load and st ore operat ions t hat use an index regist er are decom posed t o t wo m icro- ops, hence consum e t wo out of t he four slot s t he Renam er can use every cycle. D e pe n de n cy Br e a k in g I diom s I nst ruct ion parallelism can be im proved by using com m on inst ruct ions t o clear regist er cont ent s t o zero. The renam er can det ect t hem on t he zero evaluat ion of t he dest inat ion regist er. Use one of t hese dependency breaking idiom s t o clear a regist er when possible.
• • • • • • •
XOR REG,REG SUB REG,REG PXOR/ VPXOR XMMREG,XMMREG PSUBB/ W/ D/ Q XMMREG,XMMREG VPSUBB/ W/ D/ Q XMMREG,XMMREG XORPS/ PD XMMREG,XMMREG VXORPS/ PD YMMREG, YMMREG
Since zero idiom s are det ect ed and rem oved by t he renam er, t hey have no execut ion lat ency. There is anot her dependency breaking idiom - t he " ones idiom " .
•
CMPEQ
XMM1, XMM1; " ones idiom " set all elem ent s t o all " ones"
I n t his case, t he m icro- op m ust execut e, however, since it is known t hat regardless of t he input dat a t he out put dat a is always " all ones" t he m icro- op dependency upon it s sources does not exist as wit h t he zero idiom and it can execut e as soon as it finds a free execut ion port .
2.3.3.2
Scheduler
The scheduler cont rols t he dispat ch of m icro- ops ont o t heir execut ion port s. I n order t o do t his, it m ust ident ify which m icro- ops are ready and where it s sources com e from : a regist er file ent ry, or a bypass direct ly from an execut ion unit . Depending on t he availabilit y of dispat ch port s and writ eback buses, and t he priorit y of ready m icro- ops, t he scheduler select s which m icro- ops are dispat ched every cycle.
2-21
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.3.4
The Execution Core
The execut ion core is superscalar and can process inst ruct ions out of order. The execut ion core opt im izes overall perform ance by handling t he m ost com m on operat ions efficient ly, while m inim izing pot ent ial delays. The out - of- order execut ion core im proves execut ion unit organizat ion over prior generat ion in t he following ways:
• • • •
Reduct ion in read port st alls. Reduct ion in writ eback conflict s and delays. Reduct ion in power. Reduct ion of SI MD FP assist s dealing wit h denorm al input s and underflow out put s.
Som e high precision FP algorit hm s need t o operat e wit h FTZ= 0 and DAZ= 0, i.e. perm it t ing underflowed int erm ediat e result s and denorm al input s t o achieve higher num erical precision at t he expense of reduced perform ance on prior generat ion m icroarchit ect ures due t o SI MD FP assist s. The reduct ion of SI MD FP assist s in I nt el m icroarchit ect ure code nam e Sandy Bridge applies t o t he following SSE inst ruct ions ( and AVX variant s) : ADDPD/ ADDPS, MULPD/ MULPS, DI VPD/ DI VPS, and CVTPD2PS. The out- of- order core consist of t hree execut ion st acks, where each st ack encapsulat es a cert ain t ype of dat a. The execut ion core cont ains t he following execut ion st acks:
• • •
General purpose int eger. SI MD int eger and float ing- point . X87.
The execut ion core also cont ains connect ions t o and from t he cache hierarchy. The loaded dat a is fet ched from t he caches and writ t en back int o one of t he st acks. The scheduler can dispat ch up t o six m icro- ops every cycle, one on each port . The following t able sum m arizes which operat ions can be dispat ched on which port .
Table 2-14. Dispatch Port and Execution Stacks Port 0 Integer
ALU, Shift
AVX-Int,
Mul, Shift, STTNI, Int-Div,
MMX
128b-Mov
SSE-FP,
SSE-Int,
AVX-FP_low X87, AVX-FP_High
Port 1
Port 2
Port 3
Port 4
ALU,
Load_Addr,
Load_Addr
Store_data
Fast LEA,
Store_addr
Store_addr
Port 5 ALU, Shift,
Slow LEA,
Branch,
MUL
Fast LEA
ALU, Shuf, Blend, 128bMov
Store_data
ALU, Shuf, Shift, Blend, 128b-Mov
Mul, Div, Blend, 256b-Mov
Add, CVT
Store_data
Shuf, Blend, 256b-Mov
Mul, Div, Blend, 256b-Mov
Add, CVT
Store_data
Shuf, Blend, 256b-Mov
Aft er execut ion, t he dat a is writ t en back on a writ eback bus corresponding t o t he dispat ch port and t he dat a t ype of t he result . Micro- ops t hat are dispat ched on t he sam e port but have different lat encies m ay need t he writ e back bus at t he sam e cycle. I n t hese cases t he execut ion of one of t he m icro- ops is delayed unt il t he writ eback bus is available. For exam ple, MULPS ( five cycles) and BLENDPS ( one cycle) m ay collide if bot h are ready for execut ion on port 0: first t he MULPS and four cycles lat er t he BLENDPS. I nt el m icroarchit ect ure code nam e Sandy Bridge elim inat es such collisions as long as t he m icro- ops writ e
2-22
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
t he result s t o different st acks. For exam ple, int eger ADD ( one cycle) can be dispat ched four cycles aft er MULPS ( five cycles) since t he int eger ADD uses t he int eger st ack while t he MULPS uses t he FP st ack. When a source of a m icro- op execut ed in one st ack com es from a m icro- op execut ed in anot her st ack, a one- or t wo- cycle delay can occur. The delay occurs also for t ransit ions bet ween I nt el SSE int eger and I nt el SSE float ing- point operat ions. I n som e of t he cases t he dat a t ransit ion is done using a m icro- op t hat is added t o t he inst ruct ion flow. The following t able describes how dat a, writ t en back aft er execut ion, can bypass t o m icro- op execut ion in t he following cycles.
Table 2-15. Execution Core Writeback Latency (cycles) Integer
SSE-Int, AVX-Int,
SSE-FP,
X87,
MMX
AVX-FP_low
AVX-FP_High
Integer
0
micro-op (port 0)
micro-op (port 0)
micro-op (port 0) + 1 cycle
SSE-Int, AVX-Int, MMX
micro-op (port 5) or micro-op (port 5) +1 cycle
0
1 cycle delay
0
SSE-FP,
micro-op (port 5) or micro-op (port 5) +1 cycle
1 cycle delay
0
micro-op (port 5) +1 cycle
0
micro-op (port 5) +1 cycle
0
AVX-FP_High
micro-op (port 5) +1 cycle
Load
0
1 cycle delay
1 cycle delay
2 cycle delay
2.3.5
Cache Hierarchy
AVX-FP_low X87,
The cache hierarchy cont ains a first level inst ruct ion cache, a first level dat a cache ( L1 DCache) and a second level ( L2) cache, in each core. The L1D cache m ay be shared by t wo logical processors if t he processor support I nt el HyperThreading Technology. The L2 cache is shared by inst ruct ions and dat a. All cores in a physical processor package connect t o a shared last level cache ( LLC) via a ring connect ion. The caches use t he services of t he I nst ruct ion Translat ion Lookaside Buffer ( I TLB) , Dat a Translat ion Lookaside Buffer ( DTLB) and Shared Translat ion Lookaside Buffer ( STLB) t o t ranslat e linear addresses t o physical address. Dat a coherency in all cache levels is m aint ained using t he MESI prot ocol. For m ore inform at ion, see t he I nt el® 64 I A- 32 Archit ect ures Soft ware Developer's Manual, Volum e 3. Cache hierarchy det ails can be obt ained at run- t im e using t he CPUI D inst ruct ion. see t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A.
Table 2-16. Cache Parameters Level
Capacity
Associativity (ways)
Line Size (bytes)
Write Update Policy
Inclusive
L1 Data
32 KB
8
64
Writeback
-
Instruction
32 KB
8
N/A
N/A
-
L2 (Unified)
256 KB
8
64
Writeback
No
Third Level (LLC)
Varies, query CPUID leaf 4
Varies with cache size
64
Writeback
Yes
2-23
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.3.5.1
Load and Store Operation Overview
This sect ion provides an overview of t he load and st ore operat ions. Loa ds When an inst ruct ion reads dat a from a m em ory locat ion t hat has writ e- back ( WB) t ype, t he processor looks for it in t he caches and m em ory. Table 2- 17 shows t he access lookup order and best case lat ency. The act ual lat ency can vary depending on t he cache queue occupancy, LLC ring occupancy, m em ory com ponent s, and t heir param et ers.
Table 2-17. Lookup Order and Load Latency Level
Latency (cycles)
Bandwidth (per core per cycle)
L1 Data
41
2 x16 bytes
L2 (Unified)
12
1 x 32 bytes 2
Third Level (LLC)
26-31
L2 and L1 DCache in other cores if applicable
43- clean hit;
1 x 32 bytes
60 - dirty hit
NOTES: 1. Subject to execution core bypass restriction shown in Table 2-15. 2. Latency of L3 varies with product segment and sku. The values apply to second generation Intel Core processor families. The LLC is inclusive of all cache levels above it - dat a cont ained in t he core caches m ust also reside in t he LLC. Each cache line in t he LLC holds an indicat ion of t he cores t hat m ay have t his line in t heir L2 and L1 caches. I f t here is an indicat ion in t he LLC t hat ot her cores m ay hold t he line of int erest and it s st at e m ight have t o m odify, t here is a lookup int o t he L1 DCache and L2 of t hese cores t oo. The lookup is called “ clean” if it does not require fet ching dat a from t he ot her core caches. The lookup is called “ dirt y” if m odified dat a has t o be fet ched from t he ot her core caches and t ransferred t o t he loading core. The lat encies shown above are t he best- case scenarios. Som et im es a m odified cache line has t o be evict ed t o m ake space for a new cache line. The m odified cache line is evict ed in parallel t o bringing t he new dat a and does not require addit ional lat ency. However, when dat a is writ t en back t o m em ory, t he evict ion uses cache bandwidt h and possibly m em ory bandwidt h as well. Therefore, when m ult iple cache m isses require t he evict ion of m odified lines wit hin a short t im e, t here is an overall degradat ion in cache response t im e. Mem ory access lat encies vary based on occupancy of t he m em ory cont roller queues, DRAM configurat ion, DDR param et ers, and DDR paging behavior ( if t he request ed page is a page- hit , page- m iss or page- em pt y) . St or e s When an inst ruct ion writ es dat a t o a m em ory locat ion t hat has a writ e back m em ory t ype, t he processor first ensures t hat it has t he line cont aining t his m em ory locat ion in it s L1 DCache, in Exclusive or Modified MESI st at e. I f t he cache line is not t here, in t he right st at e, t he processor fet ches it from t he next levels of t he m em ory hierarchy using a Read for Ownership request . The processor looks for t he cache line in t he following locat ions, in t he specified order: 1. L1 DCache 2. L2 3. Last Level Cache 4. L2 and L1 DCache in ot her cores, if applicable 5. Mem ory Once t he cache line is in t he L1 DCache, t he new dat a is writ t en t o it , and t he line is m arked as Modified. Reading for ownership and st oring t he dat a happens aft er inst ruct ion ret irem ent and follows t he order of st ore inst ruct ion ret irem ent . Therefore, t he st ore lat ency usually does not affect t he st ore inst ruct ion it self. However, several sequent ial st ores t hat m iss t he L1 DCache m ay have cum ulat ive lat ency t hat can
2-24
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
affect perform ance. As long as t he st ore does not com plet e, it s ent ry rem ains occupied in t he st ore buffer. When t he st ore buffer becom es full, new m icro- ops cannot ent er t he execut ion pipe and execut ion m ight st all.
2.3.5.2
L1 DCache
The L1 DCache is t he first level dat a cache. I t m anages all load and st ore request s from all t ypes t hrough it s int ernal dat a st ruct ures. The L1 DCache:
• • •
Enables loads and st ores t o issue speculat ively and out of order. Ensures t hat ret ired loads and st ores have t he correct dat a upon ret irem ent . Ensures t hat loads and st ores follow t he m em ory ordering rules of t he I A- 32 and I nt el 64 inst ruct ion set archit ect ure.
Table 2-18. L1 Data Cache Components Component
Intel microarchitecture code name Sandy Bridge
Intel microarchitecture code name Nehalem
Data Cache Unit (DCU)
32KB, 8 ways
32KB, 8 ways
Load buffers
64 entries
48 entries
Store buffers
36 entries
32 entries
Line fill buffers (LFB)
10 entries
10 entries
The DCU is organized as 32 KByt es, eight- way set associat ive. Cache line size is 64- byt es arranged in eight banks. I nt ernally, accesses are up t o 16 byt es, wit h 256- bit I nt el AVX inst ruct ions ut ilizing t wo 16- byt e accesses. Two load operat ions and one st ore operat ion can be handled each cycle. The L1 DCache m aint ains request s which cannot be serviced im m ediat ely t o com plet ion. Som e reasons for request s t hat are delayed: cache m isses, unaligned access t hat split s across cache lines, dat a not ready t o be forwarded from a preceding st ore, loads experiencing bank collisions, and load block due t o cache line replacem ent . The L1 DCache can m aint ain up t o 64 load m icro- ops from allocat ion unt il ret irem ent . I t can m aint ain up t o 36 st ore operat ions from allocat ion unt il t he st ore value is com m it t ed t o t he cache, or writ t en t o t he line fill buffers ( LFB) in t he case of non- t em poral st ores. The L1 DCache can handle m ult iple out st anding cache m isses and cont inue t o service incom ing st ores and loads. Up t o 10 request s of m issing cache lines can be m anaged sim ult aneously using t he LFB. The L1 DCache is a writ e- back writ e- allocat e cache. St ores t hat hit in t he DCU do not updat e t he lower levels of t he m em ory hierarchy. St ores t hat m iss t he DCU allocat e a cache line. Loa ds The L1 DCache archit ect ure can service t wo loads per cycle, each of which can be up t o 16 byt es. Up t o 32 loads can be m aint ained at different st ages of progress, from t heir allocat ion in t he out of order engine unt il t he loaded value is ret urned t o t he execut ion core. Loads can:
•
• •
Read dat a before preceding st ores when t he load address and st ore address ranges are known not t o conflict . Be carried out speculat ively, before preceding branches are resolved. Take cache m isses out of order and in an overlapped m anner.
Loads cannot :
• •
Speculat ively t ake any sort of fault or t rap. Speculat ively access uncacheable m em ory.
2-25
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The com m on load lat ency is five cycles. When using a sim ple addressing m ode, base plus offset t hat is sm aller t han 2048, t he load lat ency can be four cycles. This t echnique is especially useful for point erchasing code. However, overall lat ency varies depending on t he t arget regist er dat a t ype due t o st ack bypass. See Sect ion 2.3.4 for m ore inform at ion. The following t able list s overall load lat encies. These lat encies assum e t he com m on case of flat segm ent , t hat is, segm ent base address is zero. I f segm ent base is not zero, load lat ency increases.
Table 2-19. Effect of Addressing Modes on Load Latency Data Type/Addressing Mode
Base + Offset > 2048; Base + Index [+ Offset]
Base + Offset < 2048
Integer
5
4
MMX, SSE, 128-bit AVX
6
5
X87
7
6
256-bit AVX
7
7
St or e s St ores t o m em ory are execut ed in t wo phases:
• •
Execut ion phase. Fills t he st ore buffers wit h linear and physical address and dat a. Once st ore address and dat a are known, t he st ore dat a can be forwarded t o t he following load operat ions t hat need it . Com plet ion phase. Aft er t he st ore ret ires, t he L1 DCache m oves it s dat a from t he st ore buffers t o t he DCU, up t o 16 byt es per cycle.
Addr e ss Tr a n sla t ion The DTLB can perform t hree linear t o physical address t ranslat ions every cycle, t wo for load addresses and one for a st ore address. I f t he address is m issing in t he DTLB, t he processor looks for it in t he STLB, which holds dat a and inst ruct ion address t ranslat ions. The penalt y of a DTLB m iss t hat hit s t he STLB is seven cycles. Large page support include 1G byt e pages, in addit ion t o 4K and 2M/ 4M pages. The DTLB and STLB are four way set associat ive. The following t able specifies t he num ber of ent ries in t he DTLB and STLB.
Table 2-20. DTLB and STLB Parameters TLB DTLB
STLB
Page Size
Entries
4KB
64
2MB/4MB
32
1GB
4
4KB
512
St or e For w a r ding I f a load follows a st ore and reloads t he dat a t hat t he st ore writ es t o m em ory, t he dat a can forward direct ly from t he st ore operat ion t o t he load. This process, called st ore t o load forwarding, saves cycles by enabling t he load t o obt ain t he dat a direct ly from t he st ore operat ion inst ead of t hrough m em ory. You can t ake advant age of st ore forwarding t o quickly m ove com plex st ruct ures wit hout losing t he abilit y t o forward t he subfields. The m em ory cont rol unit can handle st ore forwarding sit uat ions wit h less rest rict ions com pared t o previous m icro- archit ect ures. The following rules m ust be m et t o enable st ore t o load forwarding:
• • •
The st ore m ust be t he last st ore t o t hat address, prior t o t he load. The st ore m ust cont ain all dat a being loaded. The load is from a writ e- back m em ory t ype and neit her t he load nor t he st ore are non- t em poral accesses.
2-26
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
St ores cannot forward t o loads in t he following cases:
• •
Four byt e and eight byt e loads t hat cross eight byt e boundary, relat ive t o t he preceding 16- or 32byt e st ore. Any load t hat crosses a 16- byt e boundary of a 32- byt e st ore.
Table 2- 21 t o Table 2- 24 det ail t he st ore t o load forwarding behavior. For a given st ore size, all t he loads t hat m ay overlap are shown and specified by ‘F’. Forwarding from 32 byt e st ore is sim ilar t o forwarding from each of t he 16 byt e halves of t he st ore. Cases t hat cannot forward are shown as ‘N’.
Table 2-21. Store Forwarding Conditions (1 and 2 byte stores) Load Alignment Store Size
Load Size
0
1
1
1
F
2
1
F
F
2
F
N
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Table 2-22. Store Forwarding Conditions (4-16 byte stores) Load Alignment Store Size
Load Size
0
1
2
3
4
1
F
F
F
F
2
F
F
F
N
4
F
N
N
N
1
F
F
F
2
F
F
4
F
8
8
16
4
5
6
7
8
9
10
11
12
13
14
15
F
F
F
F
F
F
F
F
F
F
N
F
F
F
F
N
N
N
F
N
N
N
N
N
N
N
1
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
2
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
N
4
F
F
F
F
F
N
N
N
F
F
F
F
F
N
N
N
8
F
N
N
N
N
N
N
N
F
N
N
N
N
N
N
N
16
F
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
2-27
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-23. 32-byte Store Forwarding Conditions (0-15 byte alignment) Load Alignment Store Size
Load Size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
32
1
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
2
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
N
4
F
F
F
F
F
N
N
N
F
F
F
F
F
N
N
N
8
F
N
N
N
N
N
N
N
F
N
N
N
N
N
N
N
16
F
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
32
F
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
Table 2-24. 32-byte Store Forwarding Conditions (16-31 byte alignment) Load Alignment Store Size
Load Size
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
1
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
2
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
N
4
F
F
F
F
F
N
N
N
F
F
F
F
F
N
N
N
8
F
N
N
N
N
N
N
N
F
N
N
N
N
N
N
N
16
F
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
32
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
N
M e m or y D isa m bigu a t ion A load operat ion m ay depend on a preceding st ore. Many m icroarchit ect ures block loads unt il all preceding st ore addresses are known. The m em ory disam biguat or predict s which loads will not depend on any previous st ores. When t he disam biguat or predict s t hat a load does not have such a dependency, t he load t akes it s dat a from t he L1 dat a cache even when t he st ore address is unknown. This hides t he load lat ency. Event ually, t he predict ion is verified. I f an act ual conflict is det ect ed, t he load and all succeeding inst ruct ions are re- execut ed. The following loads are not disam biguat ed. The execut ion of t hese loads is st alled unt il addresses of all previous st ores are known.
• •
Loads t hat cross t he 16- byt e boundary 32- byt e I nt el AVX loads t hat are not 32- byt e aligned.
The m em ory disam biguat or always assum es dependency bet ween loads and earlier st ores t hat have t he sam e address bit s 0: 11. Ba nk Conflict Since 16- byt e loads can cover up t o t hree banks, and t wo loads can happen every cycle, it is possible t hat six of t he eight banks m ay be accessed per cycle, for loads. A bank conflict happens when t wo load accesses need t he sam e bank ( t heir address has t he sam e 2- 4 bit value) in different set s, at t he sam e t im e. When a bank conflict occurs, one of t he load accesses is recycled int ernally. I n m any cases t wo loads access exact ly t he sam e bank in t he sam e cache line, as m ay happen when popping operands off t he st ack, or any sequent ial accesses. I n t hese cases, conflict does not occur and t he loads are serviced sim ult aneously.
2-28
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.3.5.3
Ring Interconnect and Last Level Cache
The syst em - on- a- chip design provides a high bandwidt h bi- direct ional ring bus t o connect bet ween t he I A cores and various sub- syst em s in t he uncore. I n t he second generat ion I nt el Core processor 2xxx series, t he uncore subsyst em include a syst em agent , t he graphics unit ( GT) and t he last level cache ( LLC) . The LLC consist s of m ult iple cache slices. The num ber of slices is equal t o t he num ber of I A cores. Each slice has logic port ion and dat a array port ion. The logic port ion handles dat a coherency, m em ory ordering, access t o t he dat a array port ion, LLC m isses and writ eback t o m em ory, and m ore. The dat a array port ion st ores cache lines. Each slice cont ains a full cache port t hat can supply 32 byt es/ cycle. The physical addresses of dat a kept in t he LLC dat a arrays are dist ribut ed am ong t he cache slices by a hash funct ion, such t hat addresses are uniform ly dist ribut ed. The dat a array in a cache block m ay have 4/ 8/ 12/ 16 ways corresponding t o 0.5M/ 1M/ 1.5M/ 2M block size. However, due t o t he address dist ribut ion am ong t he cache blocks from t he soft ware point of view, t his does not appear as a norm al N- way cache. From t he processor cores and t he GT view, t he LLC act as one shared cache wit h m ult iple port s and bandwidt h t hat scales wit h t he num ber of cores. The LLC hit lat ency, ranging bet ween 26- 31 cycles, depends on t he core locat ion relat ive t o t he LLC block, and how far t he request needs t o t ravel on t he ring. The num ber of cache- slices increases wit h t he num ber of cores, t herefore t he ring and LLC are not likely t o be a bandwidt h lim it er t o core operat ion. The GT sit s on t he sam e ring int erconnect , and uses t he LLC for it s dat a operat ions as well. I n t his respect it is very sim ilar t o an I A core. Therefore, high bandwidt h graphic applicat ions using cache bandwidt h and significant cache foot print , can int erfere, t o som e ext ent , wit h core operat ions. All t he t raffic t hat cannot be sat isfied by t he LLC, such as LLC m isses, dirt y line writ eback, non- cacheable operat ions, and MMI O/ I O operat ions, st ill t ravels t hrough t he cache- slice logic port ion and t he ring, t o t he syst em agent . I n t he I nt el Xeon Processor E5 Fam ily, t he uncore subsyst em does not include t he graphics unit ( GT) . I nst ead, t he uncore subsyst em cont ains m any m ore com ponent s, including an LLC wit h larger capacit y and snooping capabilit ies t o support m ult iple processors, I nt el ® QuickPat h I nt erconnect int erfaces t hat can support m ult i- socket plat form s, power m anagem ent cont rol hardware, and a syst em agent capable of support ing high- bandwidt h t raffic from m em ory and I / O devices. I n t he I nt el Xeon processor E5 2xxx or 4xxx fam ilies, t he LLC capacit y generally scales wit h t he num ber of processor cores wit h 2.5 MByt es per core.
2.3.5.4
Data Prefetching
Dat a can be speculat ively loaded t o t he L1 DCache using soft ware prefet ching, hardware prefet ching, or any com binat ion of t he t wo. You can use t he four St ream ing SI MD Ext ensions ( SSE) prefet ch inst ruct ions t o enable soft warecont rolled prefet ching. These inst ruct ions are hint s t o bring a cache line of dat a int o t he desired levels of t he cache hierarchy. The soft ware- cont rolled prefet ch is int ended for prefet ching dat a, but not for prefet ching code. The rest of t his sect ion describes t he various hardware prefet ching m echanism s provided by I nt el m icroarchit ect ure code nam e Sandy Bridge and t heir im provem ent over previous processors. The goal of t he prefet chers is t o aut om at ically predict which dat a t he program is about t o consum e. I f t his dat a is not close- by t o t he execut ion core or inner cache, t he prefet chers bring it from t he next levels of cache hierarchy and m em ory. Prefet ching has t he following effect s:
• • •
I m proves perform ance if dat a is arranged sequent ially in t he order used in t he program . May cause slight perform ance degradat ion due t o bandwidt h issues, if access pat t erns are sparse inst ead of local. On rare occasions, if t he algorit hm 's working set is t uned t o occupy m ost of t he cache and unneeded prefet ches evict lines required by t he program , hardware prefet cher m ay cause severe perform ance degradat ion due t o cache capacit y of L1.
2-29
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
D a t a Pr e fe t ch t o L1 D a t a Ca che Dat a prefet ching is t riggered by load operat ions when t he following condit ions are m et :
• • • • •
Load is from writ eback m em ory t ype. The prefet ched dat a is wit hin t he sam e 4K byt e page as t he load inst ruct ion t hat t riggered it . No fence is in progress in t he pipeline. Not m any ot her load m isses are in progress. There is not a cont inuous st ream of st ores.
Two hardware prefet chers load dat a t o t he L1 DCache:
• •
D a t a ca che unit ( D CU) pr e fe t che r. This prefet cher, also known as t he st ream ing prefet cher, is t riggered by an ascending access t o very recent ly loaded dat a. The processor assum es t hat t his access is part of a st ream ing algorit hm and aut om at ically fet ches t he next line. I n st r u ct ion poin t e r ( I P) - ba se d st r ide pr e fe t che r. This prefet cher keeps t rack of individual load inst ruct ions. I f a load inst ruct ion is det ect ed t o have a regular st ride, t hen a prefet ch is sent t o t he next address which is t he sum of t he current address and t he st ride. This prefet cher can prefet ch forward or backward and can det ect st rides of up t o 2K byt es.
D a t a Pr e fe t ch t o t he L2 a nd La st Le ve l Ca che The following t wo hardware prefet chers fet ched dat a from m em ory t o t he L2 cache and last level cache: Spa t ia l Pr e fe t che r : This prefet cher st rives t o com plet e every cache line fet ched t o t he L2 cache wit h t he pair line t hat com plet es it t o a 128- byt e aligned chunk. St r e a m e r : This prefet cher m onit ors read request s from t he L1 cache for ascending and descending sequences of addresses. Monit ored read request s include L1 DCache request s init iat ed by load and st ore operat ions and by t he hardware prefet chers, and L1 I Cache request s for code fet ch. When a forward or backward st ream of request s is det ect ed, t he ant icipat ed cache lines are prefet ched. Prefet ched cache lines m ust be in t he sam e 4K page. The st ream er and spat ial prefet cher prefet ch t he dat a t o t he last level cache. Typically dat a is brought also t o t he L2 unless t he L2 cache is heavily loaded wit h m issing dem and request s. Enhancem ent t o t he st ream er includes t he following feat ures:
• •
• •
The st ream er m ay issue t wo prefet ch request s on every L2 lookup. The st ream er can run up t o 20 lines ahead of t he load request . Adj ust s dynam ically t o t he num ber of out st anding request s per core. I f t here are not m any out st anding request s, t he st ream er prefet ches furt her ahead. I f t here are m any out st anding request s it prefet ches t o t he LLC only and less far ahead. When cache lines are far ahead, it prefet ches t o t he last level cache only and not t o t he L2. This m et hod avoids replacem ent of useful cache lines in t he L2 cache. Det ect s and m aint ains up t o 32 st ream s of dat a accesses. For each 4K byt e page, you can m aint ain one forward and one backward st ream can be m aint ained.
2.3.6
System Agent
The syst em agent im plem ent ed in t he second generat ion I nt el Core processor fam ily cont ains t he following com ponent s:
• • • •
An arbit er t hat handles all accesses from t he ring dom ain and from I / O ( PCI e* and DMI ) and rout es t he accesses t o t he right place. PCI e cont rollers connect t o ext ernal PCI e devices. The PCI e cont rollers have different configurat ion possibilit ies t he varies wit h product segm ent specifics: x16+ x4, x8+ x8+ x4, x8+ x4+ x4+ x4. DMI cont roller connect s t o t he PCH chipset . I nt egrat ed display engine, Flexible Display I nt erconnect , and Display Port , for t he int ernal graphic operat ions.
2-30
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
Mem ory cont roller.
All m ain m em ory t raffic is rout ed from t he arbit er t o t he m em ory cont roller. The m em ory cont roller in t he second generat ion I nt el Core processor 2xxx series support t wo channels of DDR, wit h dat a rat es of 1066MHz, 1333MHz and 1600MHz, and 8 byt es per cycle, depending on t he unit t ype, syst em configurat ion and DRAMs. Addresses are dist ribut ed bet ween m em ory channels based on a local hash funct ion t hat at t em pt s t o balance t he load bet ween t he channels in order t o achieve m axim um bandwidt h and m inim um hot spot collisions. For best perform ance, populat e bot h channels wit h equal am ount s of m em ory, preferably t he exact sam e t ypes of DI MMs. I n addit ion, using m ore ranks for t he sam e am ount of m em ory, result s in som ewhat bet t er m em ory bandwidt h, since m ore DRAM pages can be open sim ult aneously. For best perform ance, populat e t he syst em wit h t he highest support ed speed DRAM ( 1333MHz or 1600MHz dat a rat es, depending on t he m ax support ed frequency) wit h t he best DRAM t im ings. The t wo channels have separat e resources and handle m em ory request s independent ly. The m em ory cont roller cont ains a high- perform ance out- of- order scheduler t hat at t em pt s t o m axim ize m em ory bandwidt h while m inim izing lat ency. Each m em ory channel cont ains a 32 cache- line writ e- dat a- buffer. Writ es t o t he m em ory cont roller are considered com plet ed when t hey are writ t en t o t he writ e- dat a- buffer. The writ e- dat a- buffer is flushed out t o m ain m em ory at a lat er t im e, not im pact ing writ e lat ency. Part ial writ es are not handled efficient ly on t he m em ory cont roller and m ay result in read- m odify- writ e operat ions on t he DDR channel if t he part ial- writ es do not com plet e a full cache- line in t im e. Soft ware should avoid creat ing part ial writ e t ransact ions whenever possible and consider alt ernat ive, such as buffering t he part ial writ es int o full cache line writ es. The m em ory cont roller also support s high- priorit y isochronous request s ( such as USB isochronous, and Display isochronous request s) . High bandwidt h of m em ory request s from t he int egrat ed display engine t akes up som e of t he m em ory bandwidt h and im pact s core access lat ency t o som e degree.
2.3.7
Intel® Microarchitecture Code Name Ivy Bridge
Third generat ion I nt el Core processors are based on I nt el m icroarchit ect ure code nam e I vy Bridge. Most of t he feat ures described in Sect ion 2.3.1 - Sect ion 2.3.6 also apply t o I nt el m icroarchit ect ure code nam e I vy Bridge. This sect ion covers feat ure differences in m icroarchit ect ure t hat can affect coding and perform ance. Support for new inst ruct ions enabling include:
• • •
Num eric conversion t o and from half- precision float ing- point values. Hardware- based random num ber generat or com pliant t o NI ST SP 800- 90A. Reading and writ ing t o FS/ GS base regist ers in any ring t o im prove user- m ode t hreading support .
For det ails about using t he hardware based random num ber generat or inst ruct ion RDRAND, please refer t o t he art icle available from I nt el Soft ware Net work at ht t p: / / soft ware.int el.com / en- us/ art icles/ download- t he- lat est- bull- m ount ain- soft ware- im plem ent at ion- guide/ ?wapkw= bull+ m ount ain. A sm all num ber of m icroarchit ect ural enhancem ent s t hat can be beneficial t o soft ware:
• •
•
Hardware prefet ch enhancem ent : A next- page prefet cher ( NPP) is added in I nt el m icroarchit ect ure code nam e I vy Bridge. The NPP is t riggered by sequent ial accesses t o cache lines approaching t he page boundary, eit her upwards or downwards. Zero- lat ency regist er m ove operat ion: A subset of regist er- t o- regist er MOV inst ruct ions are execut ed at t he front end, conserving scheduling and execut ion resource in t he out- of- order engine. Front end enhancem ent : I n I nt el m icroarchit ect ure code nam e Sandy Bridge, t he m icro- op queue is st at ically part it ioned t o provide 28 ent ries for each logical processor, irrespect ive of soft ware execut ing in single t hread or m ult iple t hreads. I f one logical processor is not act ive in I nt el m icroarchit ect ure code nam e I vy Bridge, t hen a single t hread execut ing on t hat processor core can use t he 56 ent ries in t he m icro- op queue. I n t his case, t he LSD can handle larger loop st ruct ure t hat would require m ore t han 28 ent ries.
2-31
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
The lat ency and t hroughput of som e inst ruct ions have been im proved over t hose of I nt el m icroarchit ect ure code nam e Sandy Bridge. For exam ple, 256- bit packed float ing- point divide and square root operat ions are fast er; ROL and ROR inst ruct ions are also im proved.
2.4
INTEL® CORE™ MICROARCHITECTURE AND ENHANCED INTEL® CORE™ MICROARCHITECTURE
I nt el Core m icroarchit ect ure int roduces t he following feat ures t hat enable high perform ance and powerefficient perform ance for single- t hreaded as well as m ult i- t hreaded workloads:
•
I nt e l ® W ide D yna m ic Ex e cut ion enables each processor core t o fet ch, dispat ch, execut e wit h high bandwidt hs and ret ire up t o four inst ruct ions per cycle. Feat ures include: — Fourt een- st age efficient pipeline. — Three arit hm et ic logical unit s. — Four decoders t o decode up t o five inst ruct ion per cycle. — Macro- fusion and m icro- fusion t o im prove front end t hroughput . — Peak issue rat e of dispat ching up t o six m icro- ops per cycle. — Peak ret irem ent bandwidt h of up t o four m icro- ops per cycle. — Advanced branch predict ion.
•
— St ack point er t racker t o im prove efficiency of execut ing funct ion/ procedure ent ries and exit s. I nt e l ® Adva nce d Sm a r t Ca che delivers higher bandwidt h from t he second level cache t o t he core, opt im al perform ance and flexibilit y for single- t hreaded and m ult i- t hreaded applicat ions. Feat ures include: — Opt im ized for m ult icore and single- t hreaded execut ion environm ent s. — 256 bit int ernal dat a pat h t o im prove bandwidt h from L2 t o first- level dat a cache.
•
— Unified, shared second- level cache of 4 Mbyt e, 16 way ( or 2 MByt e, 8 way) . I nt e l ® Sm a r t M e m or y Acce ss prefet ches dat a from m em ory in response t o dat a access pat t erns and reduces cache- m iss exposure of out- of- order execut ion. Feat ures include: — Hardware prefet chers t o reduce effect ive lat ency of second- level cache m isses. — Hardware prefet chers t o reduce effect ive lat ency of first- level dat a cache m isses.
•
— Mem ory disam biguat ion t o im prove efficiency of speculat ive execut ion engine. I nt e l ® Adva n ce d D igit a l M e dia Boost im proves m ost 128- bit SI MD inst ruct ions wit h single- cycle t hroughput and float ing- point operat ions. Feat ures include: — Single- cycle t hroughput of m ost 128- bit SI MD inst ruct ions ( except 128- bit shuffle, pack, unpack operat ions) — Up t o eight float ing- point operat ions per cycle — Three issue port s available t o dispat ching SI MD inst ruct ions for execut ion.
The Enhanced I nt el Core m icroarchit ect ure support s all of t he feat ures of I nt el Core m icroarchit ect ure and provides a com prehensive set of enhancem ent s.
•
I nt e l ® W ide D yna m ic Ex e cut ion includes several enhancem ent s: — A radix- 16 divider replacing previous radix- 4 based divider t o speedup long- lat ency operat ions such as divisions and square root s.
•
— I m proved syst em prim it ives t o speedup long- lat ency operat ions such as RDTSC, STI , CLI , and VM exit t ransit ions. I nt e l ® Adva nce d Sm a r t Ca che provides up t o 6 MByt es of second- level cache shared bet ween t wo processor cores ( quad- core processors have up t o 12 MByt es of L2) ; up t o 24 way/ set associat ivit y.
2-32
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
• •
I n t e l ® Sm a r t M e m or y Acce ss support s high- speed syst em bus up 1600 MHz and provides m ore efficient handling of m em ory operat ions such as split cache line load and st ore- t o- load forwarding sit uat ions. I n t e l ® Adva n ce d D igit a l M e dia Boost provides 128- bit shuffler unit t o speedup shuffle, pack, unpack operat ions; adds support for 47 SSE4.1 inst ruct ions.
I n t he sub- sect ions of 2.1.x, m ost of t he descript ions on I nt el Core m icroarchit ect ure also applies t o Enhanced I nt el Core m icroarchit ect ure. Differences bet ween t hem are not e explicit ly.
2.4.1
Intel® Core™ Microarchitecture Pipeline Overview
The pipeline of t he I nt el Core m icroarchit ect ure cont ains:
•
• •
An in- order issue front end t hat fet ches inst ruct ion st ream s from m em ory, wit h four inst ruct ion decoders t o supply decoded inst ruct ion ( m icro- ops) t o t he out- of- order execut ion core. An out- of- order superscalar execut ion core t hat can issue up t o six m icro- ops per cycle ( see Table 2- 26) and reorder m icro- ops t o execut e as soon as sources are ready and execut ion resources are available. An in- order ret irem ent unit t hat ensures t he result s of execut ion of m icro- ops are processed and archit ect ural st at es are updat ed according t o t he original program order.
I nt el Core 2 Ext rem e processor X6800, I nt el Core 2 Duo processors and I nt el Xeon processor 3000, 5100 series im plem ent t wo processor cores based on t he I nt el Core m icroarchit ect ure. I nt el Core 2 Ext rem e quad- core processor, I nt el Core 2 Quad processors and I nt el Xeon processor 3200 series, 5300 series im plem ent four processor cores. Each physical package of t hese quad- core processors cont ains t wo processor dies, each die cont aining t wo processor cores. The funct ionalit y of t he subsyst em s in each core are depict ed in Figure 2- 6.
Instruction Fetch and P reD ecode Instruction Q ueue M icrocode ROM
D ecode Shared L2 C ache U p to 10.7 G B/s FS B
R enam e/Alloc R etirem ent U nit (R e-O rder B uffer) Scheduler
ALU B ranch M M X/SS E/FP M ove
ALU FAdd M M X /SSE
ALU FM ul M M X/S SE
Load
Store
L1D C ache and D TLB OM1 9 8 0 8
Figure 2-6. Intel Core Microarchitecture Pipeline Functionality
2-33
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.4.2
Front End
The front ends needs t o supply decoded inst ruct ions ( m icro- ops) and sust ain t he st ream t o a six- issue wide out- of- order engine. The com ponent s of t he front end, t heir funct ions, and t he perform ance challenges t o m icroarchit ect ural design are described in Table 2- 25.
Table 2-25. Components of the Front End Component
Functions
Branch Prediction Unit (BPU)
Instruction Fetch Unit
•
• • •
Instruction Queue and Decode Unit
• • • •
Helps the instruction fetch unit fetch the most likely instruction to be executed by predicting the various branch types: conditional, indirect, direct, call, and return. Uses dedicated hardware for each type. Prefetches instructions that are likely to be executed Caches frequently-used instructions Predecodes and buffers instructions, maintaining a constant bandwidth despite irregularities in the instruction stream Decodes up to four instructions, or up to five with macro-fusion Stack pointer tracker algorithm for efficient procedure entry and exit Implements the Macro-Fusion feature, providing higher performance and efficiency The Instruction Queue is also used as a loop cache, enabling some loops to be executed with both higher bandwidth and lower power
Performance Challenges
• •
• • • • •
Enables speculative execution. Improves speculative execution efficiency by reducing the amount of code in the “non-architected path”1 to be fetched into the pipeline. Variable length instruction format causes unevenness (bubbles) in decode bandwidth. Taken branches and misaligned targets causes disruptions in the overall bandwidth delivered by the fetch unit. Varying amounts of work per instruction requires expansion into variable numbers of micro-ops. Prefix adds a dimension of decoding complexity. Length Changing Prefix (LCP) can cause front end bubbles.
NOTES: 1. Code paths that the processor thought it should execute but then found out it should go in another path and therefore reverted from its initial intention.
2.4.2.1
Branch Prediction Unit
Branch predict ion enables t he processor t o begin execut ing inst ruct ions long before t he branch out com e is decided. All branches ut ilize t he BPU for predict ion. The BPU cont ains t he following feat ures:
• •
16- ent ry Ret urn St ack Buffer ( RSB) . I t enables t he BPU t o accurat ely predict RET inst ruct ions. Front end queuing of BPU lookups. The BPU m akes branch predict ions for 32 byt es at a t im e, t wice t he widt h of t he fet ch engine. This enables t aken branches t o be predict ed wit h no penalt y. Even t hough t his BPU m echanism generally elim inat es t he penalt y for t aken branches, soft ware should st ill regard t aken branches as consum ing m ore resources t han do not- t aken branches.
The BPU m akes t he following t ypes of predict ions:
• •
•
Direct Calls and Jum ps. Target s are read as a t arget array, wit hout regarding t he t aken or not- t aken predict ion. I ndirect Calls and Jum ps. These m ay eit her be predict ed as having a m onot onic t arget or as having t arget s t hat vary in accordance wit h recent program behavior. Condit ional branches. Predict s t he branch t arget and whet her or not t he branch will be t aken.
For inform at ion about opt im izing soft ware for t he BPU, see Sect ion 3.4, “ Opt im izing t he Front End.”
2-34
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.4.2.2
Instruction Fetch Unit
The inst ruct ion fet ch unit com prises t he inst ruct ion t ranslat ion lookaside buffer ( I TLB) , an inst ruct ion prefet cher, t he inst ruct ion cache and t he predecode logic of t he inst ruct ion queue ( I Q) .
Instruction Cache and ITLB An inst ruct ion fet ch is a 16- byt e aligned lookup t hrough t he I TLB int o t he inst ruct ion cache and inst ruct ion prefet ch buffers. A hit in t he inst ruct ion cache causes 16 byt es t o be delivered t o t he inst ruct ion predecoder. Typical program s average slight ly less t han 4 byt es per inst ruct ion, depending on t he code being execut ed. Since m ost inst ruct ions can be decoded by all decoders, an ent ire fet ch can oft en be consum ed by t he decoders in one cycle. A m isaligned t arget reduces t he num ber of inst ruct ion byt es by t he am ount of offset int o t he 16 byt e fet ch quant it y. A t aken branch reduces t he num ber of inst ruct ion byt es delivered t o t he decoders since t he byt es aft er t he t aken branch are not decoded. Branches are t aken approxim at ely every 10 inst ruct ions in t ypical int eger code, which t ranslat es int o a “ part ial” inst ruct ion fet ch every 3 or 4 cycles. Due t o st alls in t he rest of t he m achine, front end st arvat ion does not usually cause perform ance degradat ion. For ext rem ely fast code wit h larger inst ruct ions ( such as SSE2 int eger m edia kernels) , it m ay be beneficial t o use t arget ed alignm ent t o prevent inst ruct ion st arvat ion.
Instruction PreDecode The predecode unit accept s t he sixt een byt es from t he inst ruct ion cache or prefet ch buffers and carries out t he following t asks:
• • •
Det erm ine t he lengt h of t he inst ruct ions. Decode all prefixes associat ed wit h inst ruct ions. Mark various propert ies of inst ruct ions for t he decoders ( for exam ple, “ is branch.” ) .
The predecode unit can writ e up t o six inst ruct ions per cycle int o t he inst ruct ion queue. I f a fet ch cont ains m ore t han six inst ruct ions, t he predecoder cont inues t o decode up t o six inst ruct ions per cycle unt il all inst ruct ions in t he fet ch are writ t en t o t he inst ruct ion queue. Subsequent fet ches can only ent er predecoding aft er t he current fet ch com plet es. For a fet ch of seven inst ruct ions, t he predecoder decodes t he first six in one cycle, and t hen only one in t he next cycle. This process would support decoding 3.5 inst ruct ions per cycle. Even if t he inst ruct ion per cycle ( I PC) rat e is not fully opt im ized, it is higher t han t he perform ance seen in m ost applicat ions. I n general, soft ware usually does not have t o t ake any ext ra m easures t o prevent inst ruct ion st arvat ion. The following inst ruct ion prefixes cause problem s during lengt h decoding. These prefixes can dynam ically change t he lengt h of inst ruct ions and are known as lengt h changing prefixes ( LCPs) :
• •
Operand Size Override ( 66H) preceding an inst ruct ion wit h a word im m ediat e dat a. Address Size Override ( 67H) preceding an inst ruct ion wit h a m od R/ M in real, 16- bit prot ect ed or 32bit prot ect ed m odes.
When t he predecoder encount ers an LCP in t he fet ch line, it m ust use a slower lengt h decoding algorit hm . Wit h t he slower lengt h decoding algorit hm , t he predecoder decodes t he fet ch in 6 cycles, inst ead of t he usual 1 cycle. Norm al queuing wit hin t he processor pipeline usually cannot hide LCP penalt ies. The REX prefix ( 4xh) in t he I nt el 64 archit ect ure inst ruct ion set can change t he size of t wo classes of inst ruct ion: MOV offset and MOV im m ediat e. Nevert heless, it does not cause an LCP penalt y and hence is not considered an LCP.
2.4.2.3
Instruction Queue (IQ)
The inst ruct ion queue is 18 inst ruct ions deep. I t sit s bet ween t he inst ruct ion predecode unit and t he inst ruct ion decoders. I t sends up t o five inst ruct ions per cycle, and support s one m acro- fusion per cycle. I t also serves as a loop cache for loops sm aller t han 18 inst ruct ions. The loop cache operat es as described below. 2-35
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
A Loop St ream Det ect or ( LSD) resides in t he BPU. The LSD at t em pt s t o det ect loops which are candidat es for st ream ing from t he inst ruct ion queue ( I Q) . When such a loop is det ect ed, t he inst ruct ion byt es are locked down and t he loop is allowed t o st ream from t he I Q unt il a m ispredict ion ends it . When t he loop plays back from t he I Q, it provides higher bandwidt h at reduced power ( since m uch of t he rest of t he front end pipeline is shut off ) . The LSD provides t he following benefit s:
• • • •
No loss of bandwidt h due t o t aken branches. No loss of bandwidt h due t o m isaligned inst ruct ions. No LCP penalt ies, as t he pre- decode st age has already been passed. Reduced front end power consum pt ion, because t he inst ruct ion cache, BPU and predecode unit can be idle.
Soft ware should use t he loop cache funct ionalit y opport unist ically. Loop unrolling and ot her code opt im izat ions m ay m ake t he loop t oo big t o fit int o t he LSD. For high perform ance code, loop unrolling is generally preferable for perform ance even when it overflows t he loop cache capabilit y.
2.4.2.4
Instruction Decode
The I nt el Core m icroarchit ect ure cont ains four inst ruct ion decoders. The first , Decoder 0, can decode I nt el 64 and I A- 32 inst ruct ions up t o 4 m icro- ops in size. Three ot her decoders handle single m icro- op inst ruct ions. The m icrosequencer can provide up t o 3 m icro- ops per cycle, and helps decode inst ruct ions larger t han 4 m icro- ops. All decoders support t he com m on cases of single m icro- op flows, including: m icro- fusion, st ack point er t racking and m acro- fusion. Thus, t he t hree sim ple decoders are not lim it ed t o decoding single m icro- op inst ruct ions. Packing inst ruct ions int o a 4- 1- 1- 1 t em plat e is not necessary and not recom m ended. Macro- fusion m erges t wo inst ruct ions int o a single m icro- op. I nt el Core m icroarchit ect ure is capable of one m acro- fusion per cycle in 32- bit operat ion ( including com pat ibilit y sub- m ode of t he I nt el 64 archit ect ure) , but not in 64- bit m ode because code t hat uses longer inst ruct ions ( lengt h in byt es) m ore oft en is less likely t o t ake advant age of hardware support for m acro- fusion.
2.4.2.5
Stack Pointer Tracker
The I nt el 64 and I A- 32 archit ect ures have several com m only used inst ruct ions for param et er passing and procedure ent ry and exit : PUSH, POP, CALL, LEAVE and RET. These inst ruct ions im plicit ly updat e t he st ack point er regist er ( RSP) , m aint aining a com bined cont rol and param et er st ack wit hout soft ware int ervent ion. These inst ruct ions are t ypically im plem ent ed by several m icro- ops in previous m icroarchit ect ures. The St ack Point er Tracker m oves all t hese im plicit RSP updat es t o logic cont ained in t he decoders t hem selves. The feat ure provides t he following benefit s:
•
• • •
I m proves decode bandwidt h, as PUSH, POP and RET are single m icro- op inst ruct ions in I nt el Core m icroarchit ect ure. Conserves execut ion bandwidt h as t he RSP updat es do not com pet e for execut ion resources. I m proves parallelism in t he out of order execut ion engine as t he im plicit serial dependencies bet ween m icro- ops are rem oved. I m proves power efficiency as t he RSP updat es are carried out on sm all, dedicat ed hardware.
2.4.2.6
Micro-fusion
Micro- fusion fuses m ult iple m icro- ops from t he sam e inst ruct ion int o a single com plex m icro- op. The com plex m icro- op is dispat ched in t he out- of- order execut ion core. Micro- fusion provides t he following perform ance advant ages:
•
I m proves inst ruct ion bandwidt h delivered from decode t o ret irem ent .
2-36
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
Reduces power consum pt ion as t he com plex m icro- op represent s m ore work in a sm aller form at ( in t erm s of bit densit y) , reducing overall “ bit- t oggling” in t he m achine for a given am ount of work and virt ually increasing t he am ount of st orage in t he out- of- order execut ion engine.
Many inst ruct ions provide regist er flavors and m em ory flavors. The flavor involving a m em ory operand will decodes int o a longer flow of m icro- ops t han t he regist er version. Micro- fusion enables soft ware t o use m em ory t o regist er operat ions t o express t he act ual program behavior wit hout worrying about a loss of decode bandwidt h.
2.4.3
Execution Core
The execut ion core of t he I nt el Core m icroarchit ect ure is superscalar and can process inst ruct ions out of order. When a dependency chain causes t he m achine t o wait for a resource ( such as a second- level dat a cache line) , t he execut ion core execut es ot her inst ruct ions. This increases t he overall rat e of inst ruct ions execut ed per cycle ( I PC) . The execut ion core cont ains t he following t hree m aj or com ponent s:
•
• •
Re na m e r — Moves m icro- ops from t he front end t o t he execut ion core. Archit ect ural regist ers are renam ed t o a larger set of m icroarchit ect ural regist ers. Renam ing elim inat es false dependencies known as read- aft er- read and writ e- aft er- read hazards. Re or de r buffe r ( ROB) — Holds m icro- ops in various st ages of com plet ion, buffers com plet ed m icroops, updat es t he archit ect ural st at e in order, and m anages ordering of except ions. The ROB has 96 ent ries t o handle inst ruct ions in flight . Re se r va t ion st a t ion ( RS) — Queues m icro- ops unt il all source operands are ready, schedules and dispat ches ready m icro- ops t o t he available execut ion unit s. The RS has 32 ent ries.
The init ial st ages of t he out of order core m ove t he m icro- ops from t he front end t o t he ROB and RS. I n t his process, t he out of order core carries out t he following st eps:
• • • •
Allocat es resources t o m icro- ops ( for exam ple: t hese resources could be load or st ore buffers) . Binds t he m icro- op t o an appropriat e issue port . Renam es sources and dest inat ions of m icro- ops, enabling out of order execut ion. Provides dat a t o t he m icro- op when t he dat a is eit her an im m ediat e value or a regist er value t hat has already been calculat ed.
The following list describes various t ypes of com m on operat ions and how t he core execut es t hem efficient ly:
•
• • • •
M icr o- ops w it h sin gle - cycle la t e n cy — Most m icro- ops wit h single- cycle lat ency can be execut ed by m ult iple execut ion unit s, enabling m ult iple st ream s of dependent operat ions t o be execut ed quickly. Fr e que nt ly- use d µops w it h longe r la t e n cy — These m icro- ops have pipelined execut ion unit s so t hat m ult iple m icro- ops of t hese t ypes m ay be execut ing in different part s of t he pipeline sim ult aneously. Ope r a t ions w it h da t a - de pe nde nt la t e ncie s — Som e operat ions, such as division, have dat a dependent lat encies. I nt eger division parses t he operands t o perform t he calculat ion only on significant port ions of t he operands, t hereby speeding up com m on cases of dividing by sm all num bers. Floa t ing- poin t ope r a t ions w it h fix e d la t e ncy for ope r a nds t ha t m e e t ce r t a in r e st r ict ions — Operands t hat do not fit t hese rest rict ions are considered except ional cases and are execut ed wit h higher lat ency and reduced t hroughput . The lower- t hroughput cases do not affect lat ency and t hroughput for m ore com m on cases. M e m or y ope r a n ds w it h va r ia ble la t e ncy, e ve n in t he ca se of a n L1 ca ch e hit — Loads t hat are not known t o be safe from forwarding m ay wait unt il a st ore- address is resolved before execut ing. The m em ory order buffer ( MOB) accept s and processes all m em ory operat ions. See Sect ion 2.4.4 for m ore inform at ion about t he MOB.
2-37
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.4.3.1
Issue Ports and Execution Units
The scheduler can dispat ch up t o six m icro- ops per cycle t hrough t he issue port s. The issue port s of I nt el Core m icroarchit ect ure and Enhanced I nt el Core m icroarchit ect ure are depict ed in Table 2- 26, t he form er is denot ed by it s CPUI D signat ure of DisplayFam ily_DisplayModel value of 06_0FH, t he lat t er denot ed by t he corresponding signat ure value of 06_17H. The t able provides lat ency and t hroughput dat a of com m on int eger and float ing- point ( FP) operat ions for each issue port in cycles.
Table 2-26. Issue Ports of Intel Core Microarchitecture and Enhanced Intel Core Microarchitecture Executable operations
Comment1
Latency, Throughput Signature = 06_0FH
Signature = 06_17H
Integer ALU
1, 1
1, 1
Includes 64-bit mode integer MUL;
Integer SIMD ALU
1, 1
1, 1
Issue port 0; Writeback port 0;
FP/SIMD/SSE2 Move and Logic
1, 1
1, 1
Single-precision (SP) FP MUL
4, 1
4, 1
Issue port 0; Writeback port 0
Double-precision FP MUL
5, 1
5, 1
FP MUL (X87)
5, 2
5, 2
Issue port 0; Writeback port 0
FP Shuffle
1, 1
1, 1
FP shuffle does not handle QW shuffle.
Integer ALU
1, 1
1, 1
Excludes 64-bit mode integer MUL;
Integer SIMD ALU
1, 1
1, 1
Issue port 1; Writeback port 1;
FP/SIMD/SSE2 Move and Logic
1, 1
1, 1
FP ADD
3, 1
3, 1
QW Shuffle
1, 12
1, 13
Integer loads
3, 1
3, 1
FP loads
4, 1
4, 1
Store address4
3, 1
3, 1
DIV/SQRT
Store data
Issue port 1; Writeback port 1; Issue port 2; Writeback port 2; Issue port 3;
5.
Issue Port 4;
Integer ALU
1, 1
1, 1
Integer SIMD ALU
1, 1
1, 1
FP/SIMD/SSE2 Move and Logic
1, 1
1, 1
QW shuffles
1, 12
128-bit Shuffle/Pack/Unpack
2-4,
2-46
Issue port 5; Writeback port 5;
1, 13
Issue port 5; Writeback port 5; 7
1-3, 1
NOTES: 1. Mixing operations of different latencies that use the same port can result in writeback bus conflicts; this can reduce overall throughput. 2. 128-bit instructions executes with longer latency and reduced throughput. 3. Uses 128-bit shuffle unit in port 5. 4. Prepares the store forwarding and store retirement logic with the address of the data being stored. 5. Prepares the store forwarding and store retirement logic with the data being stored. 6. Varies with instructions; 128-bit instructions are executed using QW shuffle units. 7. Varies with instructions, 128-bit shuffle unit replaces QW shuffle units in Intel Core microarchitecture. I n each cycle, t he RS can dispat ch up t o six m icro- ops. Each cycle, up t o 4 result s m ay be writ t en back t o t he RS and ROB, t o be used as early as t he next cycle by t he RS. This high execut ion bandwidt h enables execut ion burst s t o keep up wit h t he funct ional expansion of t he m icro- fused m icro- ops t hat are decoded and ret ired.
2-38
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The execut ion core cont ains t he following t hree execut ion st acks:
• • •
SI MD int eger. Regular int eger. x87/ SI MD float ing- point .
The execut ion core also cont ains connect ions t o and from t he m em ory clust er. See Figure 2- 7.
EXE Data Cache Unit 0,1,5 SI MD I nteger
I nteger/ SI MD MUL
0,1,5
0,1,5
I nteger
Floating Point
dtlb Memory ordering store forwarding
Load
2
Store (address)
3
Store (data)
4
Figure 2-7. Execution Core of Intel Core Microarchitecture
Not ice t hat t he t wo dark squares inside t he execut ion block ( in grey color) and appear in t he pat h connect ing t he int eger and SI MD int eger st acks t o t he float ing- point st ack. This delay shows up as an ext ra cycle called a bypass delay. Dat a from t he L1 cache has one ext ra cycle of lat ency t o t he float ingpoint unit . The dark- colored squares in Figure 2- 7 represent t he ext ra cycle of lat ency.
2.4.4
Intel® Advanced Memory Access
The I nt el Core m icroarchit ect ure cont ains an inst ruct ion cache and a first- level dat a cache in each core. The t wo cores share a 2 or 4- MByt e L2 cache. All caches are writ eback and non- inclusive. Each core cont ains:
•
L1 da t a ca che , k now n a s t he da t a ca che unit ( D CU) — The DCU can handle m ult iple out st anding cache m isses and cont inue t o service incom ing st ores and loads. I t support s m aint aining cache coherency. The DCU has t he following specificat ions: — 32- KByt es size. — 8- way set associat ive.
•
— 64- byt es line size. D a t a t r a nsla t ion look a side buffe r ( D TLB) — The DTLB in I nt el Core m icroarchit ect ure im plem ent s t wo levels of hierarchy. Each level of t he DTLB have m ult iple ent ries and can support eit her 4- KByt e pages or large pages. The ent ries of t he inner level ( DTLB0) is used for loads. The ent ries in t he out er level ( DTLB1) support st ore operat ions and loads t hat m issed DTLB0. All ent ries are 4- way associat ive. Here is a list of ent ries in each DTLB:
2-39
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
— DTLB1 for large pages: 32 ent ries. — DTLB1 for 4- KByt e pages: 256 ent ries. — DTLB0 for large pages: 16 ent ries. — DTLB0 for 4- KByt e pages: 16 ent ries.
• •
An DTLB0 m iss and DTLB1 hit causes a penalt y of 2 cycles. Soft ware only pays t his penalt y if t he DTLB0 is used in som e dispat ch cases. The delays associat ed wit h a m iss t o t he DTLB1 and PMH are largely non- blocking due t o t he design of I nt el Sm art Mem ory Access. Pa ge m iss h a ndle r ( PM H ) A m e m or y or de r ing buffe r ( M OB) — Which: — Enables loads and st ores t o issue speculat ively and out of order. — Ensures ret ired loads and st ores have t he correct dat a upon ret irem ent . — Ensures loads and st ores follow m em ory ordering rules of t he I nt el 64 and I A- 32 archit ect ures.
The m em ory clust er of t he I nt el Core m icroarchit ect ure uses t he following t o speed up m em ory operat ions:
• • • • • • • • •
128- bit load and st ore operat ions. Dat a prefet ching t o L1 caches. Dat a prefet ch logic for prefet ching t o t he L2 cache. St ore forwarding. Mem ory disam biguat ion. 8 fill buffer ent ries. 20 st ore buffer ent ries. Out of order execut ion of m em ory operat ions. Pipelined read- for- ownership operat ion ( RFO) .
For inform at ion on opt im izing soft ware for t he m em ory clust er, see Sect ion 3.6, “ Opt im izing Mem ory Accesses.”
2.4.4.1
Loads and Stores
The I nt el Core m icroarchit ect ure can execut e up t o one 128- bit load and up t o one 128- bit st ore per cycle, each t o different m em ory locat ions. The m icroarchit ect ure enables execut ion of m em ory operat ions out of order wit h respect t o ot her inst ruct ions and wit h respect t o ot her m em ory operat ions. Loads can:
• • • •
I ssue before preceding st ores when t he load address and st ore address are known not t o conflict . Be carried out speculat ively, before preceding branches are resolved. Take cache m isses out of order and in an overlapped m anner. I ssue before preceding st ores, speculat ing t hat t he st ore is not going t o be t o a conflict ing address.
Loads cannot :
• •
Speculat ively t ake any sort of fault or t rap. Speculat ively access t he uncacheable m em ory t ype.
Fault ing or uncacheable loads are det ect ed and wait unt il ret irem ent , when t hey updat e t he program m er visible st at e. x87 and float ing- point SI MD loads add 1 addit ional clock lat ency. St ores t o m em ory are execut ed in t wo phases:
•
Ex e cu t ion ph a se — Prepares t he st ore buffers wit h address and dat a for st ore forwarding. Consum es dispat ch port s, which are port s 3 and 4.
2-40
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
Com ple t ion pha se — The st ore is ret ired t o program m er- visible m em ory. I t m ay com pet e for cache banks wit h execut ing loads. St ore ret irem ent is m aint ained as a background t ask by t he m em ory order buffer, m oving t he dat a from t he st ore buffers t o t he L1 cache.
2.4.4.2
Data Prefetch to L1 caches
I nt el Core m icroarchit ect ure provides t wo hardware prefet chers t o speed up dat a accessed by a program by prefet ching t o t he L1 dat a cache:
•
•
D a t a ca che unit ( D CU) pr e fe t che r — This prefet cher, also known as t he st ream ing prefet cher, is t riggered by an ascending access t o very recent ly loaded dat a. The processor assum es t hat t his access is part of a st ream ing algorit hm and aut om at ically fet ches t he next line. I n st r uct ion poin t e r ( I P) - ba se d st r ide d pr e fe t ch e r — This prefet cher keeps t rack of individual load inst ruct ions. I f a load inst ruct ion is det ect ed t o have a regular st ride, t hen a prefet ch is sent t o t he next address which is t he sum of t he current address and t he st ride. This prefet cher can prefet ch forward or backward and can det ect st rides of up t o half of a 4KB- page, or 2 KByt es.
Dat a prefet ching works on loads only when t he following condit ions are m et :
• • • • • •
Load is from writ eback m em ory t ype. Prefet ch request is wit hin t he page boundary of 4 Kbyt es. No fence or lock is in progress in t he pipeline. Not m any ot her load m isses are in progress. The bus is not very busy. There is not a cont inuous st ream of st ores.
DCU Prefet ching has t he following effect s:
• • •
I m proves perform ance if dat a in large st ruct ures is arranged sequent ially in t he order used in t he program . May cause slight perform ance degradat ion due t o bandwidt h issues if access pat t erns are sparse inst ead of local. On rare occasions, if t he algorit hm 's working set is t uned t o occupy m ost of t he cache and unneeded prefet ches evict lines required by t he program , hardware prefet cher m ay cause severe perform ance degradat ion due t o cache capacit y of L1.
I n cont rast t o hardware prefet chers relying on hardware t o ant icipat e dat a t raffic, soft ware prefet ch inst ruct ions relies on t he program m er t o ant icipat e cache m iss t raffic, soft ware prefet ch act as hint s t o bring a cache line of dat a int o t he desired levels of t he cache hierarchy. The soft ware- cont rolled prefet ch is int ended for prefet ching dat a, but not for prefet ching code.
2.4.4.3
Data Prefetch Logic
Dat a prefet ch logic ( DPL) prefet ches dat a t o t he second- level ( L2) cache based on past request pat t erns of t he DCU from t he L2. The DPL m aint ains t wo independent arrays t o st ore addresses from t he DCU: one for upst ream s ( 12 ent ries) and one for down st ream s ( 4 ent ries) . The DPL t racks accesses t o one 4K byt e page in each ent ry. I f an accessed page is not in any of t hese arrays, t hen an array ent ry is allocat ed. The DPL m onit ors DCU reads for increm ent al sequences of request s, known as st ream s. Once t he DPL det ect s t he second access of a st ream , it prefet ches t he next cache line. For exam ple, when t he DCU request s t he cache lines A and A+ 1, t he DPL assum es t he DCU will need cache line A+ 2 in t he near fut ure. I f t he DCU t hen reads A+ 2, t he DPL prefet ches cache line A+ 3. The DPL works sim ilarly for “ downward” loops. The I nt el Pent ium M processor int roduced DPL. The I nt el Core m icroarchit ect ure added t he following feat ures t o DPL:
•
The DPL can det ect m ore com plicat ed st ream s, such as when t he st ream skips cache lines. DPL m ay issue 2 prefet ch request s on every L2 lookup. The DPL in t he I nt el Core m icroarchit ect ure can run up t o 8 lines ahead from t he load request .
2-41
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
• •
DPL in t he I nt el Core m icroarchit ect ure adj ust s dynam ically t o bus bandwidt h and t he num ber of request s. DPL prefet ches far ahead if t he bus is not busy, and less far ahead if t he bus is busy. DPL adj ust s t o various applicat ions and syst em configurat ions.
Ent ries for t he t wo cores are handled separat ely.
2.4.4.4
Store Forwarding
I f a load follows a st ore and reloads t he dat a t hat t he st ore writ es t o m em ory, t he I nt el Core m icroarchit ect ure can forward t he dat a direct ly from t he st ore t o t he load. This process, called st ore t o load forwarding, saves cycles by enabling t he load t o obt ain t he dat a direct ly from t he st ore operat ion inst ead of t hrough m em ory. The following rules m ust be m et for st ore t o load forwarding t o occur:
• • • • •
The st ore m ust be t he last st ore t o t hat address prior t o t he load. The st ore m ust be equal or great er in size t han t he size of dat a being loaded. The load cannot cross a cache line boundary. The load cannot cross an 8- Byt e boundary. 16- Byt e loads are an except ion t o t his rule. The load m ust be aligned t o t he st art of t he st ore address, except for t he following except ions: — An aligned 64- bit st ore m ay forward eit her of it s 32- bit halves. — An aligned 128- bit st ore m ay forward any of it s 32- bit quart ers. — An aligned 128- bit st ore m ay forward eit her of it s 64- bit halves.
Soft ware can use t he except ions t o t he last rule t o m ove com plex st ruct ures wit hout losing t he abilit y t o forward t he subfields. I n Enhanced I nt el Core m icroarchit ect ure, t he alignm ent rest rict ions t o perm it st ore forwarding t o proceed have been relaxed. Enhanced I nt el Core m icroarchit ect ure perm it s st ore- forwarding t o proceed in several sit uat ions t hat t he succeeding load is not aligned t o t he preceding st ore. Figure 2- 8 shows six sit uat ions ( in gradient- filled background) of st ore- forwarding t hat are perm it t ed in Enhanced I nt el Core m icroarchit ect ure but not in I nt el Core m icroarchit ect ure. The cases wit h backward slash background depict s st ore- forwarding t hat can proceed in bot h I nt el Core m icroarchit ect ure and Enhanced I nt el Core m icroarchit ect ure.
2-42
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Byte 0
Byte 1
Byte 2
Byte 3
Byte 4
Byte 5
8 byte boundary
Byte 6
Byte 7
8 byte boundary
Store 32 bit Load 32 bit Load 16 bit
Example: 7 byte misalignment
Load 8
Load 16 bit
Load 8
Load 8
Load 8 Store 64 bit Load 64 bit
Example: 1 byte misalignment
Load 32 bit Load 16 bit Load 8
Load 8
Load 32 bit Load 16 bit
Load 8
Load 8
Load 16 bit Load 8
Load 8
Load 16 bit Load 8
Load 8
Store 64 bit Load 64 bit
Store
Load 32 bit Load 16 bit Load 8
Load 8
Load 32 bit Load 16 bit
Load 8
Load 8
Load 16 bit Load 8
Load 8
Store-forwarding (SF) can not proceed Load 16 bit
Load 8
Load 8
SF proceed in Enhanced Intel Core microarchitectu SF proceed
Figure 2-8. Store-Forwarding Enhancements in Enhanced Intel Core Microarchitecture
2.4.4.5
Memory Disambiguation
A load inst ruct ion m icro- op m ay depend on a preceding st ore. Many m icroarchit ect ures block loads unt il all preceding st ore address are known. The m em ory disam biguat or predict s which loads will not depend on any previous st ores. When t he disam biguat or predict s t hat a load does not have such a dependency, t he load t akes it s dat a from t he L1 dat a cache. Event ually, t he predict ion is verified. I f an act ual conflict is det ect ed, t he load and all succeeding inst ruct ions are re- execut ed.
2.4.5
Intel® Advanced Smart Cache
The I nt el Core m icroarchit ect ure opt im ized a num ber of feat ures for t wo processor cores on a single die. The t wo cores share a second- level cache and a bus int erface unit , collect ively known as I nt el Advanced Sm art Cache. This sect ion describes t he com ponent s of I nt el Advanced Sm art Cache. Figure 2- 9 illust rat es t he archit ect ure of t he I nt el Advanced Sm art Cache.
2-43
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Core 1
Core 0 Branch Predict ion
Ret irem ent
Branch Predict ion
Fet ch/ Decode
Execut ion
L1 Dat a Cache
Ret irem ent
Fet ch/ Decode
Execut ion
L1 Dat a Cache
L1 I nst r. Cache
L1 I nst r. Cache
L2 Cache
Bus I nt erface Unit
Syst em Bus
Figure 2-9. Intel Advanced Smart Cache Architecture
Table 2- 27 det ails t he param et ers of caches in t he I nt el Core m icroarchit ect ure. For inform at ion on enum erat ing t he cache hierarchy ident ificat ion using t he det erm inist ic cache param et er leaf of CPUI D inst ruct ion, see t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A.
Table 2-27. Cache Parameters of Processors based on Intel Core Microarchitecture Line Size (bytes)
Access Latency (clocks)
Access Throughput (clocks)
Write Update Policy
1
Writeback
Level
Capacity
Associativity (ways)
First Level
32 KB
8
64
3
Instruction
32 KB
8
N/A
N/A
N/A
N/A
2
Second Level (Shared L2)1
2, 4 MB
8 or 16
64
14
2
Writeback
Second Level (Shared L2)3
3, 6MB
12 or 24
64
152
2
Writeback
Third Level4
8, 12, 16 MB 16
64
~110
12
Writeback
NOTES: 1. Intel Core microarchitecture (CPUID signature DisplayFamily = 06H, DisplayModel = 0FH). 2. Software-visible latency will vary depending on access patterns and other factors. 3. Enhanced Intel Core microarchitecture (CPUID signature DisaplyFamily = 06H, DisplayModel = 17H or 1DH). 4. Enhanced Intel Core microarchitecture (CPUID signature DisaplyFamily = 06H, DisplayModel = 1DH).
2-44
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.4.5.1
Loads
When an inst ruct ion reads dat a from a m em ory locat ion t hat has writ e- back ( WB) t ype, t he processor looks for t he cache line t hat cont ains t his dat a in t he caches and m em ory in t he following order: 1. DCU of t he init iat ing core. 2. DCU of t he ot her core and second- level cache. 3. Syst em m em ory. The cache line is t aken from t he DCU of t he ot her core only if it is m odified, ignoring t he cache line availabilit y or st at e in t he L2 cache. Table 2- 28 shows t he charact erist ics of fet ching t he first four byt es of different localit ies from t he m em ory clust er. The lat ency colum n provides an est im at e of access lat ency. However, t he act ual lat ency can vary depending on t he load of cache, m em ory com ponent s, and t heir param et ers.
Table 2-28. Characteristics of Load and Store Operations in Intel Core Microarchitecture Data Locality
Load
Store
Latency
Throughput
Latency
Throughput
DCU
3
1
2
1
DCU of the other core in modified state
14 + 5.5 bus cycles
14 + 5.5 bus cycles
14 + 5.5 bus cycles
2nd-level cache
14
3
14
3
Memory
14 + 5.5 bus cycles + memory
Depends on bus read protocol
14 + 5.5 bus cycles + memory
Depends on bus write protocol
Som et im es a m odified cache line has t o be evict ed t o m ake space for a new cache line. The m odified cache line is evict ed in parallel t o bringing t he new dat a and does not require addit ional lat ency. However, when dat a is writ t en back t o m em ory, t he evict ion uses cache bandwidt h and possibly bus bandwidt h as well. Therefore, when m ult iple cache m isses require t he evict ion of m odified lines wit hin a short t im e, t here is an overall degradat ion in cache response t im e.
2.4.5.2
Stores
When an inst ruct ion writ es dat a t o a m em ory locat ion t hat has WB m em ory t ype, t he processor first ensures t hat t he line is in Exclusive or Modified st at e in it s own DCU. The processor looks for t he cache line in t he following locat ions, in t he specified order: 1. DCU of init iat ing core. 2. DCU of t he ot her core and L2 cache. 3. Syst em m em ory. The cache line is t aken from t he DCU of t he ot her core only if it is m odified, ignoring t he cache line availabilit y or st at e in t he L2 cache. Aft er reading for ownership is com plet ed, t he dat a is writ t en t o t he firstlevel dat a cache and t he line is m arked as m odified. Reading for ownership and st oring t he dat a happens aft er inst ruct ion ret irem ent and follows t he order of ret irem ent . Therefore, t he st ore lat ency does not effect t he st ore inst ruct ion it self. However, several sequent ial st ores m ay have cum ulat ive lat ency t hat can affect perform ance. Table 2- 28 present s st ore lat encies depending on t he init ial cache line locat ion.
2-45
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.5
INTEL® MICROARCHITECTURE CODE NAME NEHALEM
I nt el m icroarchit ect ure code nam e Nehalem provides t he foundat ion for m any innovat ive feat ures of I nt el Core i7 processors and I nt el Xeon processor 3400, 5500, and 7500 series. I t builds on t he success of 45 nm enhanced I nt el Core m icroarchit ect ure and provides t he following feat ure enhancem ent s:
•
Enh a n ce d pr oce ssor cor e — I m proved branch predict ion and recovery from m ispredict ion. — Enhanced loop st ream ing t o im prove front end perform ance and reduce power consum pt ion. — Deeper buffering in out- of- order engine t o ext ract parallelism .
•
— Enhanced execut ion unit s t o provide accelerat ion in CRC, st ring/ t ext processing and dat a shuffling. H ype r - Thr e a din g Te chn ology — Provides t wo hardware t hreads ( logical processors) per core.
•
— Takes advant age of 4- wide execut ion engine, large L3, and m assive m em ory bandwidt h. Sm a r t M e m or y Acce ss — I nt egrat ed m em ory cont roller provides low- lat ency access t o syst em m em ory and scalable m em ory bandwidt h. — New cache hierarchy organizat ion wit h shared, inclusive L3 t o reduce snoop t raffic. — Two level TLBs and increased TLB size.
•
— Fast unaligned m em ory access. D e dica t e d Pow e r m a na ge m e nt I n n ova t ions — I nt egrat ed m icrocont roller wit h opt im ized em bedded firm ware t o m anage power consum pt ion. — Em bedded real- t im e sensors for t em perat ure, current , and power. — I nt egrat ed power gat e t o t urn off/ on per- core power consum pt ion. — Versat ilit y t o reduce power consum pt ion of m em ory, link subsyst em s.
I nt el m icroarchit ect ure code nam e West m ere is a 32 nm version of I nt el m icroarchit ect ure code nam e Nehalem . All of t he feat ures of lat t er also apply t o t he form er.
2.5.1
Microarchitecture Pipeline
I nt el m icroarchit ect ure code nam e Nehalem cont inues t he four- wide m icroarchit ect ure pipeline pioneered by t he 65nm I nt el Core m icroarchit ect ure. Figure 2- 10 illust rat es t he basic com ponent s of t he pipeline of I nt el m icroarchit ect ure code nam e Nehalem as im plem ent ed in I nt el Core i7 processor, only t wo of t he four cores are sket ched in t he Figure 2- 10 pipeline diagram .
2-46
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Instruction Fetch and PreDecode
Instruction Fetch and PreDecode
Instruction Queue
Microcode ROM
Instruction Queue
Microcode ROM
Decode
Decode
Rename/Alloc
Rename/Alloc
Retirement Unit (Re-Order Buffer)
Retirement Unit (Re-Order Buffer)
Scheduler
Scheduler
EXE Unit Cluster 0
EXE Unit Cluster 1
EXE Unit Cluster 5
Load
Stor e
L1D Cache and DTLB
L2 Cache
EXE Unit Cluster 0
EXE Unit Cluster 1
EXE Unit Cluster 5
Stor e
Load
L1D Cache and DTLB
L2 Cache Other L2 Inclusive L3 Cache by all cores
OM19808p
Intel QPI Link Logic
Figure 2-10. Intel Microarchitecture Code Name Nehalem Pipeline Functionality
The lengt h of t he pipeline in I nt el m icroarchit ect ure code nam e Nehalem is t wo cycles longer t han it s predecessor in 45 nm I nt el Core 2 processor fam ily, as m easured by branch m ispredict ion delay. The front end can decode up t o 4 inst ruct ions in one cycle and support s t wo hardware t hreads by decoding t he inst ruct ion st ream s bet ween t wo logical processors in alt ernat e cycles. The front end includes enhancem ent in branch handling, loop det ect ion, MSROM t hroughput , et c. These are discussed in subsequent sect ions. The scheduler ( or reservat ion st at ion) can dispat ch up t o six m icro- ops in one cycle t hrough six issue port s ( five issue port s are shown in Figure 2- 10; st ore operat ion involves separat e port s for st ore address and st ore dat a but is depict ed as one in t he diagram ) . The out- of- order engine has m any execut ion unit s t hat are arranged in t hree execut ion clust ers shown in Figure 2- 10. I t can ret ire four m icro- ops in one cycle, sam e as it s predecessor.
2-47
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.5.2
Front End Overview
Figure 2- 11 depict s t he key com ponent s of t he front end of t he m icroarchit ect ure. The inst ruct ion fet ch unit ( I FU) can fet ch up t o 16 byt es of aligned inst ruct ion byt es each cycle from t he inst ruct ion cache t o t he inst ruct ion lengt h decoder ( I LD) . The inst ruct ion queue ( I Q) buffers t he I LD- processed inst ruct ions and can deliver up t o four inst ruct ions in one cycle t o t he inst ruct ion decoder.
MSROM 4 micro-ops per cycle
ICache
4 ILD
IDQ 4 micro-ops per cycle max
IQ 1
I Fetch U
1 Instr. Length Decoder
1
Instr. Queue
Br. Predict U
Instr. Decoder
LSD Instr. Decoder Queue
Figure 2-11. Front End of Intel Microarchitecture Code Name Nehalem
The inst ruct ion decoder has t hree decoder unit s t hat can decode one sim ple inst ruct ion per cycle per unit . The ot her decoder unit can decode one inst ruct ion every cycle, eit her sim ple inst ruct ion or com plex inst ruct ion m ade up of several m icro- ops. I nst ruct ions m ade up of m ore t han four m icro- ops are delivered from t he MSROM. Up t o four m icro- ops can be delivered each cycle t o t he inst ruct ion decoder queue ( I DQ) . The loop st ream det ect or is locat ed inside t he I DQ t o im prove power consum pt ion and front end efficiency for loops wit h a short sequence of inst ruct ions. The inst ruct ion decoder support s m icro- fusion t o im prove front end t hroughput , increase t he effect ive size of queues in t he scheduler and re- order buffer ( ROB) . The rules for m icro- fusion are sim ilar t o t hose of I nt el Core m icroarchit ect ure. The inst ruct ion queue also support s m acro- fusion t o com bine adj acent inst ruct ions int o one m icro- ops where possible. I n previous generat ions of I nt el Core m icroarchit ect ure, m acro- fusion support for CMP/ Jcc sequence is lim it ed t o t he CF and ZF flag, and m acrofusion is not support ed in 64- bit m ode. I n I nt el m icroarchit ect ure code nam e Nehalem , m acro- fusion is support ed in 64- bit m ode, and t he following inst ruct ion sequences are support ed:
•
•
CMP or TEST can be fused when com paring ( unchanged) : REG-REG. For example: CMP EAX,ECX; JZ label REG-IMM. For example: CMP EAX,0x80; JZ label REG-MEM. For example: CMP EAX,[ECX]; JZ label MEM-REG. For example: CMP [ EAX] ,ECX; JZ label TEST can fused wit h all condit ional j um ps ( unchanged) .
2-48
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
•
CMP can be fused wit h t he following condit ional j um ps. These condit ional j um ps check carry flag ( CF) or zero flag ( ZF) . The list of m acro- fusion- capable condit ional j um ps are ( unchanged) : JA or JNBE JAE or JNB or JNC JE or JZ JNA or JBE JNAE or JC or JB JNE or JNZ CMP can be fused wit h t he following condit ional j um ps in I nt el m icroarchit ect ure code nam e Nehalem , ( t his is an enhancem ent ) : JL or JNGE JGE or JNL JLE or JNG JG or JNLE
The hardware im proves branch handling in several ways. Branch t arget buffer has increased t o increase t he accuracy of branch predict ions. Renam ing is support ed wit h ret urn st ack buffer t o reduce m ispredict ions of ret urn inst ruct ions in t he code. Furt herm ore, hardware enhancem ent im proves t he handling of branch m ispredict ion by expedit ing resource reclam at ion so t hat t he front end would not be wait ing t o decode inst ruct ions in an archit ect ed code pat h ( t he code pat h in which inst ruct ions will reach ret irem ent ) while resources were allocat ed t o execut ing m ispredict ed code pat h. I nst ead, new m icro- ops st ream can st art forward progress as soon as t he front end decodes t he inst ruct ions in t he archit ect ed code pat h.
2.5.3
Execution Engine
The I DQ ( Figure 2- 11) delivers m icro- op st ream t o t he allocat ion/ renam ing st age ( Figure 2- 10) of t he pipeline. The out- of- order engine support s up t o 128 m icro- ops in flight . Each m icro- ops m ust be allocat ed wit h t he following resources: an ent ry in t he re- order buffer ( ROB) , an ent ry in t he reservat ion st at ion ( RS) , and a load/ st ore buffer if a m em ory access is required. The allocat or also renam es t he regist er file ent ry of each m icro- op in flight . The input dat a associat ed wit h a m icro- op are generally eit her read from t he ROB or from t he ret ired regist er file. The RS is expanded t o 36 ent ry deep ( com pared t o 32 ent ries in previous generat ion) . I t can dispat ch up t o six m icro- ops in one cycle if t he m icro- ops are ready t o execut e. The RS dispat ch a m icro- op t hrough an issue port t o a specific execut ion clust er, each clust er m ay cont ain a collect ion of int eger/ FP/ SI MD execut ion unit s. The result from t he execut ion unit execut ing a m icro- op is writ t en back t o t he regist er file, or forwarded t hrough a bypass net work t o a m icro- op in- flight t hat needs t he result . I nt el m icroarchit ect ure code nam e Nehalem can support writ e back t hroughput of one regist er file writ e per cycle per port . The bypass net work consist s of t hree dom ains of int eger/ FP/ SI MD. Forwarding t he result wit hin t he sam e bypass dom ain from a producer m icro- op t o a consum er m icro is done efficient ly in hardware wit hout delay. Forwarding t he result across different bypass dom ains m ay be subj ect t o addit ional bypass delays. The bypass delays m ay be visible t o soft ware in addit ion t o t he lat ency and t hroughput charact erist ics of individual execut ion unit s. The bypass delays bet ween a producer m icro- op and a consum er m icro- op across different bypass dom ains are shown in Table 2- 29.
2-49
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-29. Bypass Delay Between Producer and Consumer Micro-ops (cycles) FP
Integer
SIMD
FP
0
2
2
Integer
2
0
1
SIMD
2
1
0
2.5.3.1
Issue Ports and Execution Units
Table 2- 30 sum m arizes t he key charact erist ics of t he issue port s and t he execut ion unit lat ency/ t hroughput s for com m on operat ions in t he m icroarchit ect ure.
Table 2-30. Issue Ports of Intel Microarchitecture Code Name Nehalem Port
Executable operations
Latency
Throughpu t
Domain
Port 0
Integer ALU
1
1
Integer
Integer Shift
1
1
Port 0 Port 0
Integer SIMD ALU
1
1
Integer SIMD Shuffle
1
1
Single-precision (SP) FP MUL
4
1
Double-precision FP MUL
5
1
5
1
1
1
DIV/SQRT
1
1
SIMD FP
FP MUL (X87) FP/SIMD/SSE2 Move and Logic FP Shuffle Port 1
Port 1
Integer ALU
1
1
Integer LEA
1
1
Integer Mul
3
1
Integer SIMD MUL
1
1
Integer SIMD Shift
1
1
PSAD
3
1
Integer
SIMD
StringCompare Port 1
FP ADD
3
1
FP
Port 2
Integer loads
4
1
Integer
Port 3
Store address
5
1
Integer
Port 4
Store data
Port 5
Port 5
2-50
Integer
Integer ALU
1
1
Integer Shift
1
1
Jmp
1
1
Integer SIMD ALU
1
1
Integer SIMD Shuffle
1
1
Integer
SIMD
Comment
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Table 2-30. Issue Ports of Intel Microarchitecture Code Name Nehalem (Contd.) Port
Executable operations
Latency
Throughpu t
Domain
Port 5
FP/SIMD/SSE2 Move and Logic
1
1
FP
2.5.4
Comment
Cache and Memory Subsystem
I nt el m icroarchit ect ure code nam e Nehalem cont ains an inst ruct ion cache, a first- level dat a cache and a second- level unified cache in each core ( see Figure 2- 10) . Each physical processor m ay cont ain several processor cores and a shared collect ion of sub- syst em s t hat are referred t o as “ uncore“ . Specifically in I nt el Core i7 processor, t he uncore provides a unified t hird- level cache shared by all cores in t he physical processor, I nt el QuickPat h I nt erconnect links and associat ed logic. The L1 and L2 caches are writ eback and non- inclusive. The shared L3 cache is writ eback and inclusive, such t hat a cache line t hat exist s in eit her L1 dat a cache, L1 inst ruct ion cache, unified L2 cache also exist s in L3. The L3 is designed t o use t he inclusive nat ure t o m inim ize snoop t raffic bet ween processor cores. Table 2- 31 list s charact erist ics of t he cache hierarchy. The lat ency of L3 access m ay vary as a funct ion of t he frequency rat io bet ween t he processor and t he uncore sub- syst em .
Table 2-31. Cache Parameters of Intel Core i7 Processors Line Size (bytes)
Access Latency (clocks)
Access Throughput (clocks)
Write Update Policy
Level
Capacity
Associativity (ways)
First Level Data
32 KB
8
64
4
1
Writeback
Instruction
32 KB
4
N/A
N/A
N/A
N/A
64
101
Varies
Writeback
64
35-40+2
Varies
Writeback
Second Level Third Level (Shared L3)2
256KB 8MB
8 16
NOTES: 1. Software-visible latency will vary depending on access patterns and other factors. 2. Minimal L3 latency is 35 cycles if the frequency ratio between core and uncore is unity. The I nt el m icroarchit ect ure code nam e Nehalem im plem ent s t wo levels of t ranslat ion lookaside buffer ( TLB) . The first level consist s of separat e TLBs for dat a and code. DTLB0 handles address t ranslat ion for dat a accesses, it provides 64 ent ries t o support 4KB pages and 32 ent ries for large pages. The I TLB provides 64 ent ries ( per t hread) for 4KB pages and 7 ent ries ( per t hread) for large pages. The second level TLB ( STLB) handles bot h code and dat a accesses for 4KB pages. I t support 4KB page t ranslat ion operat ion t hat m issed DTLB0 or I TLB. All ent ries are 4- way associat ive. Here is a list of ent ries in each DTLB:
• • •
STLB for 4- KByt e pages: 512 ent ries ( services bot h dat a and inst ruct ion look- ups) . DTLB0 for large pages: 32 ent ries. DTLB0 for 4- KByt e pages: 64 ent ries.
An DTLB0 m iss and STLB hit causes a penalt y of 7cycles. Soft ware only pays t his penalt y if t he DTLB0 is used in som e dispat ch cases. The delays associat ed wit h a m iss t o t he STLB and PMH are largely nonblocking.
2-51
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.5.5
Load and Store Operation Enhancements
The m em ory clust er of I nt el m icroarchit ect ure code nam e Nehalem provides t he following enhancem ent s t o speed up m em ory operat ions:
• • • • •
Peak issue rat e of one 128- bit load and one 128- bit st ore operat ion per cycle. Deeper buffers for load and st ore operat ions: 48 load buffers, 32 st ore buffers and 10 fill buffers. Fast unaligned m em ory access and robust handling of m em ory alignm ent hazards. I m proved st ore- forwarding for aligned and non- aligned scenarios. St ore forwarding for m ost address alignm ent s.
2.5.5.1
Efficient Handling of Alignment Hazards
The cache and m em ory subsyst em s handles a significant percent age of inst ruct ions in every workload. Different address alignm ent scenarios will produce varying perform ance im pact for m em ory and cache operat ions. For exam ple, 1- cycle t hroughput of L1 ( see Table 2- 32) generally applies t o nat urally- aligned loads from L1 cache. But using unaligned load inst ruct ions ( e.g. MOVUPS, MOVUPD, MOVDQU, et c.) t o access dat a from L1 will experience varying am ount of delays depending on specific m icroarchit ect ures and alignm ent scenarios.
Table 2-32. Performance Impact of Address Alignments of MOVDQU from L1 Throughput (cycle)
Intel Core i7 Processor
45 nm Intel Core Microarchitecture
65 nm Intel Core Microarchitecture
Alignment Scenario
06_1AH
06_17H
06_0FH
16B aligned
1
2
2
Not-16B aligned, not cache split
1
~2
~2
Split cache line boundary
~4.5
~20
~20
Table 2- 32 list s approxim at e t hroughput of issuing MOVDQU inst ruct ions wit h different address alignm ent scenarios t o load dat a from t he L1 cache. I f a 16- byt e load spans across cache line boundary, previous m icroarchit ect ure generat ions will experience significant soft ware- visible delays. I nt el m icroarchit ect ure code nam e Nehalem provides hardware enhancem ent s t o reduce t he delays of handling different address alignm ent scenarios including cache line split s.
2.5.5.2
Store Forwarding Enhancement
When a load follows a st ore and reloads t he dat a t hat t he st ore writ es t o m em ory, t he m icroarchit ect ure can forward t he dat a direct ly from t he st ore t o t he load in m any cases. This sit uat ion, called st ore t o load forwarding, saves several cycles by enabling t he load t o obt ain t he dat a direct ly from t he st ore operat ion inst ead of t hrough t he m em ory syst em . Several general rules m ust be m et for st ore t o load forwarding t o proceed wit hout delay:
• • •
The st ore m ust be t he last st ore t o t hat address prior t o t he load. The st ore m ust be equal or great er in size t han t he size of dat a being loaded. The load dat a m ust be com plet ely cont ained in t he preceding st ore.
Specific address alignm ent and dat a sizes bet ween t he st ore and load operat ions will det erm ine whet her a st ore- forward sit uat ion m ay proceed wit h dat a forwarding or experience a delay via t he cache/ m em ory sub- syst em . The 45 nm Enhanced I nt el Core m icroarchit ect ure offers m ore flexible address alignm ent and dat a sizes requirem ent t han previous m icroarchit ect ures. I nt el m icroarchit ect ure code nam e Nehalem offers addit ional enhancem ent wit h allowing m ore sit uat ions t o forward dat a expedit iously. 2-52
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The st ore- forwarding sit uat ions for wit h respect t o st ore operat ions of 16 byt es are illust rat ed in Figure 2- 12.
Figure 2-12. Store-Forwarding Scenarios of 16-Byte Store Operations I nt el m icroarchit ect ure code nam e Nehalem allows st ore- t o- load forwarding t o proceed regardless of st ore address alignm ent ( The whit e space in t he diagram does not correspond t o an applicable st ore- t oload scenario) . Figure 2- 13 illust rat es sit uat ions for st ore operat ion of 8 byt es or less.
2-53
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Figure 2-13. Store-Forwarding Enhancement in Intel Microarchitecture Code Name Nehalem
2.5.6
REP String Enhancement
REP prefix in conj unct ion wit h MOVS/ STOS inst ruct ion and a count value in ECX are frequent ly used t o im plem ent library funct ions such as m em cpy( ) / m em set ( ) . These are referred t o as " REP st ring" inst ruct ions. Each it erat ion of t hese inst ruct ion can copy/ writ e const ant a value in byt e/ word/ dword/ qword granularit y The perform ance charact erist ics of using REP st ring can be at t ribut ed t o t wo com ponent s: st art up overhead and dat a t ransfer t hroughput . The t wo com ponent s of perform ance charact erist ics of REP St ring varies furt her depending on granularit y, alignm ent , and/ or count values. Generally, MOVSB is used t o handle very sm all chunks of dat a. Therefore, processor im plem ent at ion of REP MOVSB is opt im ized t o handle ECX < 4. Using REP MOVSB wit h ECX > 3 will achieve low dat a t hroughput due t o not only byt e- granular dat a t ransfer but also addit ional st art up overhead. The lat ency for MOVSB, is 9 cycles if ECX < 4; ot herwise REP MOVSB wit h ECX > 9 have a 50- cycle st art up cost . For REP st ring of larger granularit y dat a t ransfer, as ECX value increases, t he st art up overhead of REP St ring exhibit st ep- wise increase:
• •
Short st ring ( ECX < = 12) : t he lat ency of REP MOVSW/ MOVSD/ MOVSQ is about 20 cycles. Fast st ring ( ECX > = 76: excluding REP MOVSB) : t he processor im plem ent at ion provides hardware opt im izat ion by m oving as m any pieces of dat a in 16 byt es as possible. The lat ency of REP st ring lat ency will vary if one of t he 16- byt e dat a t ransfer spans across cache line boundary: — Split- free: t he lat ency consist s of a st art up cost of about 40 cycles and each 64 byt es of dat a adds 4 cycles.
•
— Cache split s: t he lat ency consist s of a st art up cost of about 35 cycles and each 64 byt es of dat a adds 6cycles. I nt erm ediat e st ring lengt hs: t he lat ency of REP MOVSW/ MOVSD/ MOVSQ has a st art up cost of about 15 cycles plus one cycle for each it erat ion of t he dat a m ovem ent in word/ dword/ qword.
I nt el m icroarchit ect ure code nam e Nehalem im proves t he perform ance of REP st rings significant ly over previous m icroarchit ect ures in several ways:
• •
St art up overhead have been reduced in m ost cases relat ive t o previous m icroarchit ect ure. Dat a t ransfer t hroughput are im proved over previous generat ion.
2-54
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
•
I n order for REP st ring t o operat e in “ fast st ring” m ode, previous m icroarchit ect ures requires address alignm ent . I n I nt el m icroarchit ect ure code nam e Nehalem , REP st ring can operat e in “ fast st ring” m ode even if address is not aligned t o 16 byt es.
2.5.7
Enhancements for System Software
I n addit ion t o m icroarchit ect ural enhancem ent s t hat can benefit bot h applicat ion- level and syst em - level soft ware, I nt el m icroarchit ect ure code nam e Nehalem enhances several operat ions t hat prim arily benefit syst em soft ware. Lock prim it ives: Synchronizat ion prim it ives using t he Lock prefix ( e.g. XCHG, CMPXCHG8B) execut es wit h significant ly reduced lat ency t han previous m icroarchit ect ures. VMM overhead im provem ent s: VMX t ransit ions bet ween a Virt ual Machine ( VM) and it s supervisor ( t he VMM) can t ake t housands of cycle each t im e on previous m icroarchit ect ures. The lat ency of VMX t ransit ions has been reduced in processors based on I nt el m icroarchit ect ure code nam e Nehalem .
2.5.8
Efficiency Enhancements for Power Consumption
I nt el m icroarchit ect ure code nam e Nehalem is not only designed for high perform ance and power- efficient perform ance under wide range of loading sit uat ions, it also feat ures enhancem ent for low power consum pt ion while t he syst em idles. I nt el m icroarchit ect ure code nam e Nehalem support s processorspecific C6 st at es, which have t he lowest leakage power consum pt ion t hat OS can m anage t hrough ACPI and OS power m anagem ent m echanism s.
2.5.9
Hyper-Threading Technology Support in Intel® Microarchitecture Code Name Nehalem
I nt el m icroarchit ect ure code nam e Nehalem support s Hyper-Threading Technology ( HT) . I t s im plem ent at ion of HT provides t wo logical processors sharing m ost execut ion/ cache resources in each core. The HT im plem ent at ion in I nt el m icroarchit ect ure code nam e Nehalem differs from previous generat ions of HT im plem ent at ions using I nt el Net Burst m icroarchit ect ure in several areas:
•
• •
I nt el m icroarchit ect ure code nam e Nehalem provides four- wide execut ion engine, m ore funct ional execut ion unit s coupled t o t hree issue port s capable of issuing com put at ional operat ions. I nt el m icroarchit ect ure code nam e Nehalem support s int egrat ed m em ory cont roller t hat can provide peak m em ory bandwidt h of up t o 25.6 GB/ sec in I nt el Core i7 processor. Deeper buffering and enhanced resource sharing/ part it ion policies: — Replicat ed resource for HT operat ion: regist er st at e, renam ed ret urn st ack buffer, large- page I TLB. — Part it ioned resources for HT operat ion: load buffers, st ore buffers, re- order buffers, sm all- page I TLB are st at ically allocat ed bet ween t wo logical processors. — Com pet it ively- shared resource during HT operat ion: t he reservat ion st at ion, cache hierarchy, fill buffers, bot h DTLB0 and STLB. — Alt ernat ing during HT operat ion: front end operat ion generally alt ernat es bet ween t wo logical processors t o ensure fairness. — HT unaware resources: execut ion unit s.
2.6
INTEL® HYPER-THREADING TECHNOLOGY
I nt el ® Hyper-Threading Technology ( HT Technology) enables soft ware t o t ake advant age of t ask- level, or t hread- level parallelism by providing m ult iple logical processors wit hin a physical processor package, or wit hin each processor core in a physical processor package. I n it s first im plem ent at ion in t he I nt el Xeon 2-55
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
processor, Hyper-Threading Technology m akes a single physical processor ( or a processor core) appear as t wo or m ore logical processors. I nt el Xeon Phi processors based on t he Knight s Landing m icroarchit ect ure support 4 logical processors in each processor core; see Chapt er 15 for det ailed inform at ion of Hyper-Threading Technology t hat is im plem ent ed in t he Knight s Landing m icroarchit ect ure. Most I nt el Archit ect ure processor fam ilies support Hyper-Threading Technology wit h t wo logical processors in each processor core, or in a physical processor in early im plem ent at ions. The rest of t his sect ion describes feat ures of t he early im plem ent at ion of Hyper-Threading Technology. Most of t he descript ions also apply t o lat er Hyper-Threading Technology im plem ent at ions support ing t wo logical processors. The m icroarchit ect ure sect ions in t his chapt er provide addit ional det ails t o individual m icroarchit ect ure and enhancem ent s t o Hyper-Threading Technology. The t wo logical processors each have a com plet e set of archit ect ural regist ers while sharing one single physical processor's resources. By m aint aining t he archit ect ure st at e of t wo processors, an HT Technology capable processor looks like t wo processors t o soft ware, including operat ing syst em and applicat ion code. By sharing resources needed for peak dem ands bet ween t wo logical processors, HT Technology is well suit ed for m ult iprocessor syst em s t o provide an addit ional perform ance boost in t hroughput when com pared t o t radit ional MP syst em s. Figure 2- 14 shows a t ypical bus- based sym m et ric m ult iprocessor ( SMP) based on processors support ing HT Technology. Each logical processor can execut e a soft ware t hread, allowing a m axim um of t wo software t hreads t o execut e sim ult aneously on one physical processor. The t wo soft ware t hreads execut e sim ult aneously, m eaning t hat in t he sam e clock cycle an “ add” operat ion from logical processor 0 and anot her “ add” operat ion and load from logical processor 1 can be execut ed sim ult aneously by t he execut ion engine. I n t he first im plem ent at ion of HT Technology, t he physical execut ion resources are shared and t he archit ect ure st at e is duplicat ed for each logical processor. This m inim izes t he die area cost of im plem ent ing HT Technology while st ill achieving perform ance gains for m ult it hreaded applicat ions or m ult it asking workloads.
Architectural State
Architectural State
Architectural State
Execution Engine
Local APIC
Architectural State
Execution Engine
Local APIC
Local APIC
Bus Interface
Local APIC
Bus Interface System Bus OM15152
Figure 2-14. Hyper-Threading Technology on an SMP
2-56
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
The perform ance pot ent ial due t o HT Technology is due t o:
•
•
The fact t hat operat ing syst em s and user program s can schedule processes or t hreads t o execut e sim ult aneously on t he logical processors in each physical processor. The abilit y t o use on- chip execut ion resources at a higher level t han when only a single t hread is consum ing t he execut ion resources; higher level of resource ut ilizat ion can lead t o higher syst em t hroughput .
2.6.1
Processor Resources and HT Technology
The m aj orit y of m icroarchit ect ure resources in a physical processor are shared bet ween t he logical processors. Only a few sm all dat a st ruct ures were replicat ed for each logical processor. This sect ion describes how resources are shared, part it ioned or replicat ed.
2.6.1.1
Replicated Resources
The archit ect ural st at e is replicat ed for each logical processor. The archit ect ure st at e consist s of regist ers t hat are used by t he operat ing syst em and applicat ion code t o cont rol program behavior and st ore dat a for com put at ions. This st at e includes t he eight general- purpose regist ers, t he cont rol regist ers, m achine st at e regist ers, debug regist ers, and ot hers. There are a few except ions, m ost not ably t he m em ory t ype range regist ers ( MTRRs) and t he perform ance m onit oring resources. For a com plet e list of t he archit ect ure st at e and except ions, see t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 3A, 3B & 3C. Ot her resources such as inst ruct ion point ers and regist er renam ing t ables were replicat ed t o sim ult aneously t rack execut ion and st at e changes of t he t wo logical processors. The ret urn st ack predict or is replicat ed t o im prove branch predict ion of ret urn inst ruct ions. I n addit ion, a few buffers ( for exam ple, t he 2- ent ry inst ruct ion st ream ing buffers) were replicat ed t o reduce com plexit y.
2.6.1.2
Partitioned Resources
Several buffers are shared by lim it ing t he use of each logical processor t o half t he ent ries. These are referred t o as part it ioned resources. Reasons for t his part it ioning include:
• •
Operat ional fairness. Perm it t ing t he abilit y t o allow operat ions from one logical processor t o bypass operat ions of t he ot her logical processor t hat m ay have st alled.
For exam ple: a cache m iss, a branch m ispredict ion, or inst ruct ion dependencies m ay prevent a logical processor from m aking forward progress for som e num ber of cycles. The part it ioning prevent s t he st alled logical processor from blocking forward progress. I n general, t he buffers for st aging inst ruct ions bet ween m aj or pipe st ages are part it ioned. These buffers include µop queues aft er t he execut ion t race cache, t he queues aft er t he regist er renam e st age, t he reorder buffer which st ages inst ruct ions for ret irem ent , and t he load and st ore buffers. I n t he case of load and st ore buffers, part it ioning also provided an easier im plem ent at ion t o m aint ain m em ory ordering for each logical processor and det ect m em ory ordering violat ions.
2-57
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.6.1.3
Shared Resources
Most resources in a physical processor are fully shared t o im prove t he dynam ic ut ilizat ion of t he resource, including caches and all t he execut ion unit s. Som e shared resources which are linearly addressed, like t he DTLB, include a logical processor I D bit t o dist inguish whet her t he ent ry belongs t o one logical processor or t he ot her. The first level cache can operat e in t wo m odes depending on a cont ext- I D bit :
• •
Shared m ode: The L1 dat a cache is fully shared by t wo logical processors. Adapt ive m ode: I n adapt ive m ode, m em ory accesses using t he page direct ory is m apped ident ically across logical processors sharing t he L1 dat a cache.
The ot her resources are fully shared.
2.6.2
Microarchitecture Pipeline and HT Technology
This sect ion describes t he HT Technology m icroarchit ect ure and how inst ruct ions from t he t wo logical processors are handled bet ween t he front end and t he back end of t he pipeline. Alt hough inst ruct ions originat ing from t wo program s or t wo t hreads execut e sim ult aneously and not necessarily in program order in t he execut ion core and m em ory hierarchy, t he front end and back end cont ain several select ion point s t o select bet ween inst ruct ions from t he t wo logical processors. All select ion point s alt ernat e bet ween t he t wo logical processors unless one logical processor cannot m ake use of a pipeline st age. I n t his case, t he ot her logical processor has full use of every cycle of t he pipeline st age. Reasons why a logical processor m ay not use a pipeline st age include cache m isses, branch m ispredict ions, and inst ruct ion dependencies.
2.6.3
Front End Pipeline
The execut ion t race cache is shared bet ween t wo logical processors. Execut ion t race cache access is arbit rat ed by t he t wo logical processors every clock. I f a cache line is fet ched for one logical processor in one clock cycle, t he next clock cycle a line would be fet ched for t he ot her logical processor provided t hat bot h logical processors are request ing access t o t he t race cache. I f one logical processor is st alled or is unable t o use t he execut ion t race cache, t he ot her logical processor can use t he full bandwidt h of t he t race cache unt il t he init ial logical processor ’s inst ruct ion fet ches ret urn from t he L2 cache. Aft er fet ching t he inst ruct ions and building t races of µops, t he µops are placed in a queue. This queue decouples t he execut ion t race cache from t he regist er renam e pipeline st age. As described earlier, if bot h logical processors are act ive, t he queue is part it ioned so t hat bot h logical processors can m ake independent forward progress.
2.6.4
Execution Core
The core can dispat ch up t o six µops per cycle, provided t he µops are ready t o execut e. Once t he µops are placed in t he queues wait ing for execut ion, t here is no dist inct ion bet ween inst ruct ions from t he t wo logical processors. The execut ion core and m em ory hierarchy is also oblivious t o which inst ruct ions belong t o which logical processor. Aft er execut ion, inst ruct ions are placed in t he re- order buffer. The re- order buffer decouples t he execut ion st age from t he ret irem ent st age. The re- order buffer is part it ioned such t hat each uses half t he ent ries.
2-58
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.6.5
Retirement
The ret irem ent logic t racks when inst ruct ions from t he t wo logical processors are ready t o be ret ired. I t ret ires t he inst ruct ion in program order for each logical processor by alt ernat ing bet ween t he t wo logical processors. I f one logical processor is not ready t o ret ire any inst ruct ions, t hen all ret irem ent bandwidt h is dedicat ed t o t he ot her logical processor. Once st ores have ret ired, t he processor needs t o writ e t he st ore dat a int o t he level- one dat a cache. Select ion logic alt ernat es bet ween t he t wo logical processors t o com m it st ore dat a t o t he cache.
2.7
INTEL® 64 ARCHITECTURE
I nt el 64 archit ect ure support s alm ost all feat ures in t he I A- 32 I nt el archit ect ure and ext ends support t o run 64- bit OS and 64- bit applicat ions in 64- bit linear address space. I nt el 64 archit ect ure provides a new operat ing m ode, referred t o as I A- 32e m ode, and increases t he linear address space for soft ware t o 64 bit s and support s physical address space up t o 40 bit s. I A- 32e m ode consist s of t wo sub- m odes: ( 1) com pat ibilit y m ode enables a 64- bit operat ing syst em t o run m ost legacy 32- bit soft ware unm odified, ( 2) 64- bit m ode enables a 64- bit operat ing syst em t o run applicat ions writ t en t o access 64- bit linear address space. I n t he 64- bit m ode of I nt el 64 archit ect ure, soft ware m ay access:
• • • • • • •
64- bit flat linear addressing. 8 addit ional general- purpose regist ers ( GPRs) . 8 addit ional regist ers ( XMM) for st ream ing SI MD ext ensions ( SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AESNI , PCLMULQDQ) . — Sixt een 256- bit YMM regist ers ( whose lower 128 bit s are overlaid t o t he respect ive XMM regist ers) if AVX, F16C, AVX2 or FMA are support ed. 64- bit- wide GPRs and inst ruct ion point ers. Uniform byt e- regist er addressing. Fast int errupt- priorit izat ion m echanism . A new inst ruct ion- point er relat ive- addressing m ode.
2.8
SIMD TECHNOLOGY
SI MD com put at ions ( see Figure 2- 15) were int roduced t o t he archit ect ure wit h MMX t echnology. MMX t echnology allows SI MD com put at ions t o be perform ed on packed byt e, word, and doubleword int egers. The int egers are cont ained in a set of eight 64- bit regist ers called MMX regist ers ( see Figure 2- 16) . The Pent ium III processor ext ended t he SI MD com put at ion m odel wit h t he int roduct ion of t he St ream ing SI MD Ext ensions ( SSE) . SSE allows SI MD com put at ions t o be perform ed on operands t hat cont ain four packed single- precision float ing- point dat a elem ent s. The operands can be in m em ory or in a set of eight 128- bit XMM regist ers ( see Figure 2- 16) . SSE also ext ended SI MD com put at ional capabilit y by adding addit ional 64- bit MMX inst ruct ions. Figure 2- 15 shows a t ypical SI MD com put at ion. Two set s of four packed dat a elem ent s ( X1, X2, X3, and X4, and Y1, Y2, Y3, and Y4) are operat ed on in parallel, wit h t he sam e operat ion being perform ed on each corresponding pair of dat a elem ent s ( X1 and Y1, X2 and Y2, X3 and Y3, and X4 and Y4) . The result s of t he four parallel com put at ions are sort ed as a set of four packed dat a elem ent s.
2-59
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
X4
Y4
X3
Y3 OP
X4 op Y4
X2
Y2 OP
X3 op Y3
X1
Y1 OP
X2 op Y2
OP X1 op Y1 OM15148
Figure 2-15. Typical SIMD Operations
The Pent ium 4 processor furt her ext ended t he SI MD com put at ion m odel wit h t he int roduct ion of St ream ing SI MD Ext ensions 2 ( SSE2) , St ream ing SI MD Ext ensions 3 ( SSE3) , and I nt el Xeon processor 5100 series int roduced Supplem ent al St ream ing SI MD Ext ensions 3 ( SSSE3) . SSE2 works wit h operands in eit her m em ory or in t he XMM regist ers. The t echnology ext ends SI MD com put at ions t o process packed double- precision float ing- point dat a elem ent s and 128- bit packed int egers. There are 144 inst ruct ions in SSE2 t hat operat e on t wo packed double- precision float ing- point dat a elem ent s or on 16 packed byt e, 8 packed word, 4 doubleword, and 2 quadword int egers. SSE3 enhances x87, SSE and SSE2 by providing 13 inst ruct ions t hat can accelerat e applicat ion perform ance in specific areas. These include video processing, com plex arit hm et ics, and t hread synchronizat ion. SSE3 com plem ent s SSE and SSE2 wit h inst ruct ions t hat process SI MD dat a asym m et rically, facilit at e horizont al com put at ion, and help avoid loading cache line split s. See Figure 2- 16. SSSE3 provides addit ional enhancem ent for SI MD com put at ion wit h 32 inst ruct ions on digit al video and signal processing. SSE4.1, SSE4.2 and AESNI are addit ional SI MD ext ensions t hat provide accelerat ion for applicat ions in m edia processing, t ext / lexical processing, and block encrypt ion/ decrypt ion. The SI MD ext ensions operat es t he sam e way in I nt el 64 archit ect ure as in I A- 32 archit ect ure, wit h t he following enhancem ent s:
• •
128- bit SI MD inst ruct ions referencing XMM regist er can access 16 XMM regist ers in 64- bit m ode. I nst ruct ions t hat reference 32- bit general purpose regist ers can access 16 general purpose regist ers in 64- bit m ode.
2-60
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
64-bit M MX Registers
128-bit XMM Registers
MM7
XMM7 MM7
MM6
XMM6
MM5
XMM5
MM4
XMM4
MM3
XMM3
MM2
XMM2
MM1
XMM1
MM0
XMM0 OM15149
Figure 2-16. SIMD Instruction Register Usage
SI MD im proves t he perform ance of 3D graphics, speech recognit ion, im age processing, scient ific applicat ions and applicat ions t hat have t he following charact erist ics:
• • • •
I nherent ly parallel. Recurring m em ory access pat t erns. Localized recurring operat ions perform ed on t he dat a. Dat a- independent cont rol flow.
2.9
SUMMARY OF SIMD TECHNOLOGIES AND APPLICATION LEVEL EXTENSIONS
SI MD float ing- point inst ruct ions fully support t he I EEE St andard 754 for Binary Float ing- Point Arit hm et ic. They are accessible from all I A- 32 execut ion m odes: prot ect ed m ode, real address m ode, and Virt ual 8086 m ode. SSE, SSE2, and MMX t echnologies are archit ect ural ext ensions. Exist ing soft ware will cont inue t o run correct ly, wit hout m odificat ion on I nt el m icroprocessors t hat incorporat e t hese t echnologies. Exist ing soft ware will also run correct ly in t he presence of applicat ions t hat incorporat e SI MD t echnologies. SSE and SSE2 inst ruct ions also int roduced cacheabilit y and m em ory ordering inst ruct ions t hat can im prove cache usage and applicat ion perform ance. For m ore on SSE, SSE2, SSE3 and MMX t echnologies, see t he following chapt ers in t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1:
• • • • • •
Chapt er 9, “ Program m ing wit h I nt el® MMX™ Technology” . Chapt er 10, “ Program m ing wit h St ream ing SI MD Ext ensions ( SSE) ” . Chapt er 11, “ Program m ing wit h St ream ing SI MD Ext ensions 2 ( SSE2) ” . Chapt er 12, “ Program m ing wit h SSE3, SSSE3 and SSE4” . Chapt er 14, “ Program m ing wit h AVX, FMA and AVX2” . Chapt er 15, “ Program m ing wit h I nt el® Transact ional Synchronizat ion Ext ensions” .
2-61
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.9.1
MMX™ Technology
MMX Technology int roduced:
• •
64- bit MMX regist ers. Support for SI MD operat ions on packed byt e, word, and doubleword int egers.
MMX inst ruct ions are useful for m ult im edia and com m unicat ions soft ware.
2.9.2
Streaming SIMD Extensions
St ream ing SI MD ext ensions int roduced:
• • • • •
128- bit XMM regist ers. 128- bit dat a t ype wit h four packed single- precision float ing- point operands. Dat a prefet ch inst ruct ions. Non- t em poral st ore inst ruct ions and ot her cacheabilit y and m em ory ordering inst ruct ions. Ext ra 64- bit SI MD int eger support .
SSE inst ruct ions are useful for 3D geom et ry, 3D rendering, speech recognit ion, and video encoding and decoding.
2.9.3
Streaming SIMD Extensions 2
St ream ing SI MD ext ensions 2 add t he following:
• •
• • • •
128- bit dat a t ype wit h t wo packed double- precision float ing- point operands. 128- bit dat a t ypes for SI MD int eger operat ion on 16- byt e, 8- word, 4- doubleword, or 2- quadword int egers. Support for SI MD arit hm et ic on 64- bit int eger operands. I nst ruct ions for convert ing bet ween new and exist ing dat a t ypes. Ext ended support for dat a shuffling. Ext ended support for cacheabilit y and m em ory ordering operat ions.
SSE2 inst ruct ions are useful for 3D graphics, video decoding/ encoding, and encrypt ion.
2.9.4
Streaming SIMD Extensions 3
St ream ing SI MD ext ensions 3 add t he following:
• • • •
SI MD float ing- point inst ruct ions for asym m et ric and horizont al com put at ion. A special- purpose 128- bit load inst ruct ion t o avoid cache line split s. An x87 FPU inst ruct ion t o convert t o int eger independent of t he float ing- point cont rol word ( FCW) . I nst ruct ions t o support t hread synchronizat ion.
SSE3 inst ruct ions are useful for scient ific, video and m ult i- t hreaded applicat ions.
2.9.5
Supplemental Streaming SIMD Extensions 3
The Supplem ent al St ream ing SI MD Ext ensions 3 int roduces 32 new inst ruct ions t o accelerat e eight t ypes of com put at ions on packed int egers. These include:
• • •
12 inst ruct ions t hat perform horizont al addit ion or subt ract ion operat ions. 6 inst ruct ions t hat evaluat e t he absolut e values. 2 inst ruct ions t hat perform m ult iply and add operat ions and speed up t he evaluat ion of dot product s.
2-62
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
• • • •
2 inst ruct ions t hat accelerat e packed- int eger m ult iply operat ions and produce int eger values wit h scaling. 2 inst ruct ions t hat perform a byt e- wise, in- place shuffle according t o t he second shuffle cont rol operand. 6 inst ruct ions t hat negat e packed int egers in t he dest inat ion operand if t he signs of t he corresponding elem ent in t he source operand is less t han zero. 2 inst ruct ions t hat align dat a from t he com posit e of t wo operands.
2.9.6
SSE4.1
SSE4.1 int roduces 47 new inst ruct ions t o accelerat e video, im aging and 3D applicat ions. SSE4.1 also im proves com piler vect orizat ion and significant ly increase support for packed dword com put at ion. These include:
• • • • • •
• • • • • • •
Two inst ruct ions perform packed dword m ult iplies. Two inst ruct ions perform float ing- point dot product s wit h input / out put select s. One inst ruct ion provides a st ream ing hint for WC loads. Six inst ruct ions sim plify packed blending. Eight inst ruct ions expand support for packed int eger MI N/ MAX. Four inst ruct ions support float ing- point round wit h select able rounding m ode and precision except ion override. Seven inst ruct ions im prove dat a insert ion and ext ract ions from XMM regist ers Twelve inst ruct ions im prove packed int eger form at conversions ( sign and zero ext ensions) . One inst ruct ion im proves SAD ( sum absolut e difference) generat ion for sm all block sizes. One inst ruct ion aids horizont al searching operat ions of word int egers. One inst ruct ion im proves m asked com parisons. One inst ruct ion adds qword packed equalit y com parisons. One inst ruct ion adds dword packing wit h unsigned sat urat ion.
2.9.7
SSE4.2
SSE4.2 int roduces 7 new inst ruct ions. These include:
• •
A 128- bit SI MD int eger inst ruct ion for com paring 64- bit int eger dat a elem ent s. Four st ring/ t ext processing inst ruct ions providing a rich set of prim it ives, t hese prim it ives can accelerat e: — Basic and advanced st ring library funct ions from st rlen, st rcm p, t o st rcspn. — Delim it er processing, t oken ext ract ion for lexing of t ext st ream s.
• •
— Parser, schem a validat ion including XML processing. A general- purpose inst ruct ion for accelerat ing cyclic redundancy checksum signat ure calculat ions. A general- purpose inst ruct ion for calculat ing bit count populat ion of int eger num bers.
2.9.8
AESNI and PCLMULQDQ
AESNI int roduces 7 new inst ruct ions, six of t hem are prim it ives for accelerat ing algorit hm s based on AES encrypt ion/ decrypt ion st andard, referred t o as AESNI . The PCLMULQDQ inst ruct ion accelerat es general- purpose block encrypt ion, which can perform carry- less m ult iplicat ion for t wo binary num bers up t o 64- bit wide.
2-63
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
Typically, algorit hm based on AES st andard involve t ransform at ion of block dat a over m ult iple it erat ions via several prim it ives. The AES st andard support s cipher key of sizes 128, 192, and 256 bit s. The respect ive cipher key sizes correspond t o 10, 12, and 14 rounds of it erat ion. AES encrypt ion involves processing 128- bit input dat a ( plaint ext ) t hrough a finit e num ber of it erat ive operat ion, referred t o as “AES round”, int o a 128- bit encrypt ed block ( ciphert ext ) . Decrypt ion follows t he reverse direct ion of it erat ive operat ion using t he “ equivalent inverse cipher ” inst ead of t he “ inverse cipher ”. The crypt ographic processing at each round involves t wo input dat a, one is t he “ st at e”, t he ot her is t he “ round key”. Each round uses a different “ round key”. The round keys are derived from t he cipher key using a “ key schedule” algorit hm . The “ key schedule” algorit hm is independent of t he dat a processing of encrypt ion/ decrypt ion, and can be carried out independent ly from t he encrypt ion/ decrypt ion phase. The AES ext ensions provide t wo prim it ives t o accelerat e AES rounds on encrypt ion, t wo prim it ives for AES rounds on decrypt ion using t he equivalent inverse cipher, and t wo inst ruct ions t o support t he AES key expansion procedure.
2.9.9
Intel® Advanced Vector Extensions
I nt el® Advanced Vect or Ext ensions offers com prehensive archit ect ural enhancem ent s over previous generat ions of St ream ing SI MD Ext ensions. I nt el AVX int roduces t he following archit ect ural enhancem ent s:
• •
• • •
Support for 256- bit wide vect ors and SI MD regist er set . 256- bit float ing- point inst ruct ion set enhancem ent wit h up t o 2X perform ance gain relat ive t o 128- bit St ream ing SI MD ext ensions. I nst ruct ion synt ax support for generalized t hree- operand synt ax t o im prove inst ruct ion program m ing flexibilit y and efficient encoding of new inst ruct ion ext ensions. Enhancem ent of legacy 128- bit SI MD inst ruct ion ext ensions t o support t hree- operand synt ax and t o sim plify com piler vect orizat ion of high- level language expressions. Support flexible deploym ent of 256- bit AVX code, 128- bit AVX code, legacy 128- bit code and scalar code.
I nt el AVX inst ruct ion set and 256- bit regist er st at e m anagem ent det ail are described in I A- 32 I nt el® Archit ect ure Soft ware Developer ’s Manual, Volum es 2A, 2B and 3A. Opt im izat ion t echniques for I nt el AVX is discussed in Chapt er 11, “ Opt im izat ion for I nt el AVX, FMA, and AVX2” .
2.9.10
Half-Precision Floating-Point Conversion (F16C)
VCVTPH2PS and VCVTPS2PH are t wo inst ruct ions support ing half- precision float ing- point dat a t ype conversion t o and from single- precision float ing- point dat a t ypes. These t wo inst ruct ion ext ends on t he sam e program m ing m odel as I nt el AVX.
2.9.11
RDRAND
The RDRAND inst ruct ion ret rieves a random num ber supplied by a crypt ographically secure, det erm inist ic random bit generat or ( DBRG) . The DBRG is designed t o m eet NI ST SP 800- 90A st andard.
2.9.12
Fused-Multiply-ADD (FMA) Extensions
FMA ext ensions enhances I nt el AVX wit h high- t hroughput , arit hm et ic capabilit ies covering fused m ult iply- add, fused m ult iply- subt ract , fused m ult iply add/ subt ract int erleave, signed- reversed m ult iply on fused m ult iply- add and m ult iply- subt ract operat ions. FMA ext ensions provide 36 256- bit float ingpoint inst ruct ions t o perform com put at ion on 256- bit vect ors and addit ional 128- bit and scalar FMA inst ruct ions.
2-64
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2.9.13
Intel AVX2
I nt el AVX2 ext ends I nt el AVX by prom ot ing m ost of t he 128- bit SI MD int eger inst ruct ions wit h 256- bit num eric processing capabilit ies. AVX2 inst ruct ions follow t he sam e program m ing m odel as AVX inst ruct ions. I n addit ion, AVX2 provide enhanced funct ionalit ies for broadcast / perm ut e operat ions on dat a elem ent s, vect or shift inst ruct ions wit h variable- shift count per dat a elem ent , and inst ruct ions t o fet ch non- cont iguous dat a elem ent s from m em ory.
2.9.14
General-Purpose Bit-Processing Instructions
The fourt h generat ion I nt el Core processor fam ily int roduces a collect ion of bit processing inst ruct ions t hat operat e on t he general purpose regist ers. The m aj orit y of t hese inst ruct ions uses t he VEX- prefix encoding schem e t o provide non- dest ruct ive source operand synt ax. There inst ruct ions are enum erat ed by t hree separat e feat ure flags report ed by CPUI D. For det ails, see Sect ion 5.1 of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1 and CHAPTER 3, 4 of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum es 2A, 2B & 2C.
2.9.15
Intel® Transactional Synchronization Extensions
The fourt h generat ion I nt el Core processor fam ily int roduces I nt el® Transact ional Synchronizat ion Ext ensions ( I nt el TSX) , which aim t o im prove t he perform ance of lock- prot ect ed crit ical sect ions of m ult it hreaded applicat ions while m aint aining t he lock- based program m ing m odel. For background and det ails, see Chapt er 15, “ Program m ing wit h I nt el® Transact ional Synchronizat ion Ext ensions” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1. Soft ware t uning recom m endat ions for using I nt el TSX on lock- prot ect ed crit ical sect ions of m ult it hreaded applicat ions are described in Chapt er 12, “ I nt el® TSX Recom m endat ions” .
2.9.16
RDSEED
The I nt el Core M processor fam ily int roduces t he RDSEED, ADCX and ADOX inst ruct ions. The RDSEED inst ruct ion ret rieves a random num ber supplied by a crypt ographically secure, enhanced det erm inist ic random bit generat or Enhanced NRBG) . The NRBG is designed t o m eet t he NI ST SP 80090B and NI ST SP 800- 90C st andards.
2.9.17
ADCX and ADOX Instructions
The ADCX and ADOX inst ruct ions, in conj unct ion wit h MULX inst ruct ion, enable soft ware t o speed up calculat ions t hat require large int eger num erics. Det ails can be found at ht t ps: / / wwwssl.int el.com / cont ent / www/ us/ en/ int elligent- syst em s/ int el- t echnology/ ia- large- int eger- arit hm et icpaper.ht m l? and ht t p: / / www.int el.com / cont ent / www/ us/ en/ int elligent- syst em s/ int el- t echnology/ largeint eger- squaring- ia- paper.ht m l.
2-65
INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES
2-66
CHAPTER 3 GENERAL OPTIMIZATION GUIDELINES This chapt er discusses general opt im izat ion t echniques t hat can im prove t he perform ance of applicat ions running on processors based on I nt el m icroarchit ect ure code nam e Haswell, I vy Bridge, Sandy Bridge, West m ere, Nehalem , Enhanced I nt el Core m icroarchit ect ure and I nt el Core m icroarchit ect ures. These t echniques t ake advant age of m icroarchit ect ural described in Chapt er 2, “ I nt el® 64 and I A- 32 Processor Archit ect ures.” Opt im izat ion guidelines focusing on I nt el m ult i- core processors, Hyper-Threading Technology and 64- bit m ode applicat ions are discussed in Chapt er 8, “ Mult icore and Hyper-Threading Technology,” and Chapt er 9, “ 64- bit Mode Coding Guidelines.” Pract ices t hat opt im ize perform ance focus on t hree areas:
• • •
Tools and t echniques for code generat ion. Analysis of t he perform ance charact erist ics of t he workload and it s int eract ion wit h m icroarchit ect ural sub- syst em s. Tuning code t o t he t arget m icroarchit ect ure ( or fam ilies of m icroarchit ect ure) t o im prove perform ance.
Som e hint s on using t ools are sum m arized first t o sim plify t he first t wo t asks. t he rest of t he chapt er will focus on recom m endat ions of code generat ion or code t uning t o t he t arget m icroarchit ect ures. This chapt er explains opt im izat ion t echniques for t he I nt el C+ + Com piler, t he I nt el Fort ran Com piler, and ot her com pilers.
3.1
PERFORMANCE TOOLS
I nt el offers several t ools t o help opt im ize applicat ion perform ance, including com pilers, perform ance analyzer and m ult it hreading t ools.
3.1.1
Intel® C++ and Fortran Compilers
I nt el com pilers support m ult iple operat ing syst em s ( Windows* , Linux* , Mac OS* and em bedded) . The I nt el com pilers opt im ize perform ance and give applicat ion developers access t o advanced feat ures:
• • • • •
Flexibilit y t o t arget 32- bit or 64- bit I nt el processors for opt im izat ion Com pat ibilit y wit h m any int egrat ed developm ent environm ent s or t hird- part y com pilers. Aut om at ic opt im izat ion feat ures t o t ake advant age of t he t arget processor ’s archit ect ure. Aut om at ic com piler opt im izat ion reduces t he need t o writ e different code for different processors. Com m on com piler feat ures t hat are support ed across Windows, Linux and Mac OS include: — General opt im izat ion set t ings. — Cache- m anagem ent feat ures. — I nt erprocedural opt im izat ion ( I PO) m et hods. — Profile- guided opt im izat ion ( PGO) m et hods. — Mult it hreading support . — Float ing- point arit hm et ic precision and consist ency support . — Com piler opt im izat ion and vect orizat ion report s.
GENERAL OPTIMIZATION GUIDELINES
3.1.2
General Compiler Recommendations
Generally speaking, a com piler t hat has been t uned for t he t arget m icroarchit ect ure can be expect ed t o m at ch or out perform hand- coding. However, if perform ance problem s are not ed wit h t he com piled code, som e com pilers ( like I nt el C+ + and Fort ran Com pilers) allow t he coder t o insert int rinsics or inline assem bly in order t o exert cont rol over what code is generat ed. I f inline assem bly is used, t he user m ust verify t hat t he code generat ed is of good qualit y and yields good perform ance. Default com piler swit ches are t arget ed for com m on cases. An opt im izat ion m ay be m ade t o t he com piler default if it is beneficial for m ost program s. I f t he root cause of a perform ance problem is a poor choice on t he part of t he com piler, using different swit ches or com piling t he t arget ed m odule wit h a different com piler m ay be t he solut ion.
3.1.3
VTune™ Performance Analyzer
VTune uses perform ance m onit oring hardware t o collect st at ist ics and coding inform at ion of your applicat ion and it s int eract ion wit h t he m icroarchit ect ure. This allows soft ware engineers t o m easure perform ance charact erist ics of t he workload for a given m icroarchit ect ure. VTune support s all current and past I nt el processor fam ilies. The VTune Perform ance Analyzer provides t wo kinds of feedback:
• •
I ndicat ion of a perform ance im provem ent gained by using a specific coding recom m endat ion or m icroarchit ect ural feat ure. I nform at ion on whet her a change in t he program has im proved or degraded perform ance wit h respect t o a part icular m et ric.
The VTune Perform ance Analyzer also provides m easures for a num ber of workload charact erist ics, including:
•
• •
Ret irem ent t hroughput of inst ruct ion execut ion as an indicat ion of t he degree of ext ract able inst ruct ion- level parallelism in t he workload. Dat a t raffic localit y as an indicat ion of t he st ress point of t he cache and m em ory hierarchy. Dat a t raffic parallelism as an indicat ion of t he degree of effect iveness of am ort izat ion of dat a access lat ency.
NOTE I m proving perform ance in one part of t he m achine does not necessarily bring significant gains t o overall perform ance. I t is possible t o degrade overall perform ance by im proving perform ance for som e part icular m et ric. Where appropriat e, coding recom m endat ions in t his chapt er include descript ions of t he VTune Perform ance Analyzer event s t hat provide m easurable dat a on t he perform ance gain achieved by following t he recom m endat ions. For m ore on using t he VTune analyzer, refer t o t he applicat ion’s online help.
3.2
PROCESSOR PERSPECTIVES
Many coding recom m endat ions for work well across m odern m icroarchit ect ures from I nt el Core m icroarchit ect ure t o t he Haswell m icroarchit ect ure. However, t here are sit uat ions where a recom m endat ion m ay benefit one m icroarchit ect ure m ore t han anot her. Som e of t hese are:
•
•
3-2
I nst ruct ion decode t hroughput is im port ant . Addit ionally, t aking advant age of decoded I Cache, Loop St ream Det ect or and m acrofusion can furt her im prove front end perform ance. Generat ing code t o t ake advant age 4 decoders and em ploy m icro- fusion and m acro- fusion so t hat each of t hree sim ple decoders are not rest rict ed t o handling sim ple inst ruct ions consist ing of one m icro- op.
GENERAL OPTIMIZATION GUIDELINES
• • • •
On processors based on Sandy Bridge, I vy Bridge and Haswell m icroarchit ect ures, t he code size for opt im al front end perform ance is associat ed wit h t he decode I Cache. Dependencies for part ial regist er writ es can incur varying degree of penalt ies To avoid false dependences from part ial regist er updat es, use full regist er updat es and ext ended m oves. Use appropriat e inst ruct ions t hat support dependence- breaking ( e.g. PXOR, SUB, XOR, XORPS) . Hardware prefet ching can reduce t he effect ive m em ory lat ency for dat a and inst ruct ion accesses in general. But different m icroarchit ect ures m ay require som e cust om m odificat ions t o adapt t o t he specific hardware prefet ch im plem ent at ion of each m icroarchit ect ure.
3.2.1
CPUID Dispatch Strategy and Compatible Code Strategy
When opt im um perform ance on all processor generat ions is desired, applicat ions can t ake advant age of t he CPUI D inst ruct ion t o ident ify t he processor generat ion and int egrat e processor- specific inst ruct ions int o t he source code. The I nt el C+ + Com piler support s t he int egrat ion of different versions of t he code for different t arget processors. The select ion of which code t o execut e at runt im e is m ade based on t he CPU ident ifiers. Binary code t arget ed for different processor generat ions can be generat ed under t he cont rol of t he program m er or by t he com piler. For applicat ions t hat t arget m ult iple generat ions of m icroarchit ect ures, and where m inim um binary code size and single code pat h is im port ant , a com pat ible code st rat egy is t he best . Opt im izing applicat ions using t echniques developed for t he I nt el Core m icroarchit ect ure and com bined wit h I nt el m icroarchit ect ure code nam e Nehalem are likely t o im prove code efficiency and scalabilit y when running on processors based on current and fut ure generat ions of I nt el 64 and I A- 32 processors.
3.2.2
Transparent Cache-Parameter Strategy
I f t he CPUI D inst ruct ion support s funct ion leaf 4, also known as det erm inist ic cache param et er leaf, t he leaf report s cache param et ers for each level of t he cache hierarchy in a det erm inist ic and forwardcom pat ible m anner across I nt el 64 and I A- 32 processor fam ilies. For coding t echniques t hat rely on specific param et ers of a cache level, using t he det erm inist ic cache param et er allows soft ware t o im plem ent t echniques in a way t hat is forward- com pat ible wit h fut ure generat ions of I nt el 64 and I A- 32 processors, and cross- com pat ible wit h processors equipped wit h different cache sizes.
3.2.3
Threading Strategy and Hardware Multithreading Support
I nt el 64 and I A- 32 processor fam ilies offer hardware m ult it hreading support in t wo form s: dual- core t echnology and HT Technology. To fully harness t he perform ance pot ent ial of hardware m ult it hreading in current and fut ure generat ions of I nt el 64 and I A- 32 processors, soft ware m ust em brace a t hreaded approach in applicat ion design. At t he sam e t im e, t o address t he widest range of inst alled m achines, m ult i- t hreaded soft ware should be able t o run wit hout failure on a single processor wit hout hardware m ult it hreading support and should achieve perform ance on a single logical processor t hat is com parable t o an unt hreaded im plem ent at ion ( if such com parison can be m ade) . This generally requires archit ect ing a m ult i- t hreaded applicat ion t o m inim ize t he overhead of t hread synchronizat ion. Addit ional guidelines on m ult it hreading are discussed in Chapt er 8, “ Mult icore and Hyper-Threading Technology.”
3.3
CODING RULES, SUGGESTIONS AND TUNING HINTS
This sect ion includes rules, suggest ions and hint s. They are t arget ed for engineers who are:
• •
Modifying source code t o enhance perform ance ( user/ source rules) . Writ ing assem blers or com pilers ( assem bly/ com piler rules) .
3-3
GENERAL OPTIMIZATION GUIDELINES
•
Doing det ailed perform ance t uning ( t uning suggest ions) .
Coding recom m endat ions are ranked in im port ance using t wo m easures:
• •
Local im pact ( high, m edium , or low) refers t o a recom m endat ion’s affect on t he perform ance of a given inst ance of code. Generalit y ( high, m edium , or low) m easures how oft en such inst ances occur across all applicat ion dom ains. Generalit y m ay also be t hought of as “ frequency”.
These recom m endat ions are approxim at e. They can vary depending on coding st yle, applicat ion dom ain, and ot her fact ors. The purpose of t he high, m edium , and low ( H, M, and L) priorit ies is t o suggest t he relat ive level of perform ance gain one can expect if a recom m endat ion is im plem ent ed. Because it is not possible t o predict t he frequency of a part icular code inst ance in applicat ions, priorit y hint s cannot be direct ly correlat ed t o applicat ion- level perform ance gain. I n cases in which applicat ionlevel perform ance gain has been observed, we have provided a quant it at ive charact erizat ion of t he gain ( for inform at ion only) . I n cases in which t he im pact has been deem ed inapplicable, no priorit y is assigned.
3.4
OPTIMIZING THE FRONT END
Opt im izing t he front end covers t wo aspect s:
•
•
Maint aining st eady supply of m icro- ops t o t he execut ion engine — Mispredict ed branches can disrupt st ream s of m icro- ops, or cause t he execut ion engine t o wast e execut ion resources on execut ing st ream s of m icro- ops in t he non- archit ect ed code pat h. Much of t he t uning in t his respect focuses on working wit h t he Branch Predict ion Unit . Com m on t echniques are covered in Sect ion 3.4.1, “ Branch Predict ion Opt im izat ion.” Supplying st ream s of m icro- ops t o ut ilize t he execut ion bandwidt h and ret irem ent bandwidt h as m uch as possible — For I nt el Core m icroarchit ect ure and I nt el Core Duo processor fam ily, t his aspect focuses m aint aining high decode t hroughput . I n I nt el m icroarchit ect ure code nam e Sandy Bridge, t his aspect focuses on keeping t he hod code running from Decoded I Cache. Techniques t o m axim ize decode t hroughput for I nt el Core m icroarchit ect ure are covered in Sect ion 3.4.2, “ Fet ch and Decode Opt im izat ion.”
3.4.1
Branch Prediction Optimization
Branch opt im izat ions have a significant im pact on perform ance. By underst anding t he flow of branches and im proving t heir predict abilit y, you can increase t he speed of code significant ly. Opt im izat ions t hat help branch predict ion are:
•
• • • • • •
3-4
Keep code and dat a on separat e pages. This is very im port ant ; see Sect ion 3.6, “ Opt im izing Mem ory Accesses,” for m ore inform at ion. Elim inat e branches whenever possible. Arrange code t o be consist ent wit h t he st at ic branch predict ion algorit hm . Use t he PAUSE inst ruct ion in spin- wait loops. I nline funct ions and pair up calls and ret urns. Unroll as necessary so t hat repeat edly- execut ed loops have sixt een or fewer it erat ions ( unless t his causes an excessive code size increase) . Avoid put t ing t wo condit ional branch inst ruct ions in a loop so t hat bot h have t he sam e branch t arget address and, at t he sam e t im e, belong t o ( i.e. have t heir last byt es' addresses wit hin) t he sam e 16byt e aligned code block.
GENERAL OPTIMIZATION GUIDELINES
3.4.1.1
Eliminating Branches
Elim inat ing branches im proves perform ance because:
• •
I t reduces t he possibilit y of m ispredict ions. I t reduces t he num ber of required branch t arget buffer ( BTB) ent ries. Condit ional branches, which are never t aken, do not consum e BTB resources.
There are four principal ways of elim inat ing branches:
• • • •
Arrange code t o m ake basic blocks cont iguous. Unroll loops, as discussed in Sect ion 3.4.1.7, “ Loop Unrolling.” Use t he CMOV inst ruct ion. Use t he SETCC inst ruct ion.
The following rules apply t o branch elim inat ion: Asse m bly/ Com pile r Coding Ru le 1 . ( M H im pa ct , M ge ne r a lit y) Arrange code t o m ake basic blocks cont iguous and elim inat e unnecessary branches. Asse m bly/ Com pile r Codin g Rule 2 . ( M im pa ct , M L ge ne r a lit y) Use t he SETCC and CMOV inst ruct ions t o elim inat e unpredict able condit ional branches where possible. Do not do t his for predict able branches. Do not use t hese inst ruct ions t o elim inat e all unpredict able condit ional branches ( because using t hese inst ruct ions will incur execut ion overhead due t o t he requirem ent for execut ing bot h pat hs of a condit ional branch) . I n addit ion, convert ing a condit ional branch t o SETCC or CMOV t rades off cont rol flow dependence for dat a dependence and rest rict s t he capabilit y of t he out - of- order engine. When t uning, not e t hat all I nt el 64 and I A- 32 processors usually have very high branch predict ion rat es. Consist ent ly m ispredict ed branches are generally rare. Use t hese inst ruct ions only if t he increase in com put at ion t im e is less t han t he expect ed cost of a m ispredict ed branch. Consider a line of C code t hat has a condit ion dependent upon one of t he const ant s: X = (A < B) ? CONST1 : CONST2; This code condit ionally com pares t wo values, A and B. I f t he condit ion is t rue, X is set t o CONST1; ot herwise it is set t o CONST2. An assem bly code sequence equivalent t o t he above C code can cont ain branches t hat are not predict able if t here are no correlat ion in t he t wo values. Exam ple 3- 1 shows t he assem bly code wit h unpredict able branches. The unpredict able branches can be rem oved wit h t he use of t he SETCC inst ruct ion. Exam ple 3- 2 shows opt im ized code t hat has no branches. Example 3-1. Assembly Code with an Unpredictable Branch cmp a, b jbe L30 mov ebx const1 jmp L31 L30: mov ebx, const2 L31:
; Condition ; Conditional branch ; ebx holds X ; Unconditional branch
Example 3-2. Code Optimization to Eliminate Branches xor ebx, ebx cmp A, B setge bl
; Clear ebx (X in the C code)
; When ebx = 0 or 1 ; OR the complement condition sub ebx, 1 ; ebx=11...11 or 00...00 and ebx, CONST3; CONST3 = CONST1-CONST2 add ebx, CONST2; ebx=CONST1 or CONST2
3-5
GENERAL OPTIMIZATION GUIDELINES
The opt im ized code in Exam ple 3- 2 set s EBX t o zero, t hen com pares A and B. I f A is great er t han or equal t o B, EBX is set t o one. Then EBX is decreased and AND’d wit h t he difference of t he const ant values. This set s EBX t o eit her zero or t he difference of t he values. By adding CONST2 back t o EBX, t he correct value is writ t en t o EBX. When CONST2 is equal t o zero, t he last inst ruct ion can be delet ed. Anot her way t o rem ove branches is t o use t he CMOV and FCMOV inst ruct ions. Exam ple 3- 3 shows how t o change a TEST and branch inst ruct ion sequence using CMOV t o elim inat e a branch. I f t he TEST set s t he equal flag, t he value in EBX will be m oved t o EAX. This branch is dat a- dependent , and is represent at ive of an unpredict able branch.
Example 3-3. Eliminating Branch with CMOV Instruction test ecx, ecx jne 1H mov eax, ebx 1H: ; To optimize code, combine jne and mov into one cmovcc instruction that checks the equal flag test ecx, ecx ; Test the flags cmoveq eax, ebx ; If the equal flag is set, move ; ebx to eax- the 1H: tag no longer needed
3.4.1.2
Spin-Wait and Idle Loops
The Pent ium 4 processor int roduces a new PAUSE inst ruct ion; t he inst ruct ion is archit ect urally a NOP on I nt el 64 and I A- 32 processor im plem ent at ions. To t he Pent ium 4 and lat er processors, t his inst ruct ion act s as a hint t hat t he code sequence is a spin- wait loop. Wit hout a PAUSE inst ruct ion in such loops, t he Pent ium 4 processor m ay suffer a severe penalt y when exit ing t he loop because t he processor m ay det ect a possible m em ory order violat ion. I nsert ing t he PAUSE inst ruct ion significant ly reduces t he likelihood of a m em ory order violat ion and as a result im proves perform ance. I n Exam ple 3- 4, t he code spins unt il m em ory locat ion A m at ches t he value st ored in t he regist er EAX. Such code sequences are com m on when prot ect ing a crit ical sect ion, in producer- consum er sequences, for barriers, or ot her synchronizat ion. Example 3-4. Use of PAUSE Instruction lock:
loop:
cmp eax, a jne loop ; Code in critical section: pause cmp eax, a jne loop jmp lock
3.4.1.3
Static Prediction
Branches t hat do not have a hist ory in t he BTB ( see Sect ion 3.4.1, “ Branch Predict ion Opt im izat ion” ) are predict ed using a st at ic predict ion algorit hm :
• •
Predict uncondit ional branches t o be t aken. Predict indirect branches t o be NOT t aken.
The following rule applies t o st at ic elim inat ion:
3-6
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Codin g Rule 3 . ( M im pa ct , H ge ne r a lit y) Arrange code t o be consist ent wit h t he st at ic branch predict ion algorit hm : m ake t he fall- t hrough code following a condit ional branch be t he likely t arget for a branch wit h a forward t arget , and m ake t he fall- t hrough code following a condit ional branch be t he unlikely t arget for a branch wit h a backward t arget . Exam ple 3- 5 illust rat es t he st at ic branch predict ion algorit hm . The body of an I F- THEN condit ional is predict ed. Example 3-5. Static Branch Prediction Algorithm //Forward condition branches not taken (fall through) IF {....
↓ } IF {...
↓ } //Backward conditional branches are taken LOOP {... ↑ −− } //Unconditional branches taken JMP ------→ Exam ple 3- 6 and Exam ple 3- 7 provide basic rules for a st at ic predict ion algorit hm . I n Exam ple 3- 6, t he backward branch (JC BEGI N) is not in t he BTB t he first t im e t hrough; t herefore, t he BTB does not issue a predict ion. The st at ic predict or, however, will predict t he branch t o be t aken, so a m ispredict ion will not occur.
Example 3-6. Static Taken Prediction Begin: mov and imul shld jc
eax, mem32 eax, ebx eax, edx eax, 7 Begin
The first branch inst ruct ion ( JC BEGI N) in Exam ple 3- 7 is a condit ional forward branch. I t is not in t he BTB t he first t im e t hrough, but t he st at ic predict or will predict t he branch t o fall t hrough . The st at ic predict ion algorit hm correct ly predict s t hat t he CALL CONVERT inst ruct ion will be t aken, even before t he branch has any branch hist ory in t he BTB.
Example 3-7. Static Not-Taken Prediction mov and imul shld jc mov Begin: call
eax, mem32 eax, ebx eax, edx eax, 7 Begin eax, 0 Convert
3-7
GENERAL OPTIMIZATION GUIDELINES
The I nt el Core m icroarchit ect ure does not use t he st at ic predict ion heurist ic. However, t o m aint ain consist ency across I nt el 64 and I A- 32 processors, soft ware should m aint ain t he st at ic predict ion heurist ic as t he default .
3.4.1.4
Inlining, Calls and Returns
The ret urn address st ack m echanism augm ent s t he st at ic and dynam ic predict ors t o opt im ize specifically for calls and ret urns. I t holds 16 ent ries, which is large enough t o cover t he call dept h of m ost program s. I f t here is a chain of m ore t han 16 nest ed calls and m ore t han 16 ret urns in rapid succession, perform ance m ay degrade. The t race cache in I nt el Net Burst m icroarchit ect ure m aint ains branch predict ion inform at ion for calls and ret urns. As long as t he t race wit h t he call or ret urn rem ains in t he t race cache and t he call and ret urn t arget s rem ain unchanged, t he dept h lim it of t he ret urn address st ack described above will not im pede perform ance. To enable t he use of t he ret urn st ack m echanism , calls and ret urns m ust be m at ched in pairs. I f t his is done, t he likelihood of exceeding t he st ack dept h in a m anner t hat will im pact perform ance is very low. The following rules apply t o inlining, calls, and ret urns: Asse m bly/ Com pile r Codin g Ru le 4 . ( M H im pa ct , M H ge ne r a lit y) Near calls m ust be m at ched wit h near ret urns, and far calls m ust be m at ched wit h far ret urns. Pushing t he ret urn address on t he st ack and j um ping t o t he rout ine t o be called is not recom m ended since it creat es a m ism at ch in calls and ret urns. Calls and ret urns are expensive; use inlining for t he following reasons:
• • • •
Param et er passing overhead can be elim inat ed. I n a com piler, inlining a funct ion exposes m ore opport unit y for opt im izat ion. I f t he inlined rout ine cont ains branches, t he addit ional cont ext of t he caller m ay im prove branch predict ion wit hin t he rout ine. A m ispredict ed branch can lead t o perform ance penalt ies inside a sm all funct ion t hat are larger t han t hose t hat would occur if t hat funct ion is inlined.
Asse m bly/ Com pile r Coding Rule 5 . ( M H im pa ct , M H ge ne r a lit y) Select ively inline a funct ion if doing so decreases code size or if t he funct ion is sm all and t he call sit e is frequent ly execut ed. Asse m bly/ Com pile r Coding Rule 6 . ( H im pa ct , H ge ne r a lit y) Do not inline a funct ion if doing so increases t he working set size beyond what will fit in t he t race cache. Asse m bly/ Com pile r Codin g Rule 7 . ( M L im pa ct , M L ge ne r a lit y) I f t here are m ore t han 16 nest ed calls and ret urns in rapid succession; consider t ransform ing t he program wit h inline t o reduce t he call dept h. Asse m bly/ Com pile r Coding Rule 8 . ( M L im pa ct , M L ge ne r a lit y) Favor inlining sm all funct ions t hat cont ain branches wit h poor predict ion rat es. I f a branch m ispredict ion result s in a RETURN being prem at urely predict ed as t aken, a perform ance penalt y m ay be incurred. Asse m bly/ Com pile r Coding Rule 9 . ( L im pa ct , L ge ne r a lit y) I f t he last st at em ent in a funct ion is a call t o anot her funct ion, consider convert ing t he call t o a j um p. This will save t he call/ ret urn overhead as well as an ent ry in t he ret urn st ack buffer. Asse m bly/ Com pile r Codin g Ru le 1 0 . ( M im pa ct , L ge ne r a lit y) Do not put m ore t han four branches in a 16- byt e chunk. Asse m bly/ Com pile r Codin g Rule 1 1 . ( M im pa ct , L ge ne r a lit y) Do not put m ore t han t wo end loop branches in a 16- byt e chunk.
3.4.1.5
Code Alignment
Careful arrangem ent of code can enhance cache and m em ory localit y. Likely sequences of basic blocks should be laid out cont iguously in m em ory. This m ay involve rem oving unlikely code, such as code t o handle error condit ions, from t he sequence. See Sect ion 3.7, “ Prefet ching,” on opt im izing t he inst ruct ion prefet cher. 3-8
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Codin g Rule 1 2 . ( M im pa ct , H ge ne r a lit y) All branch t arget s should be 16byt e aligned. Asse m bly/ Com pile r Codin g Rule 1 3 . ( M im pa ct , H ge ne r a lit y) I f t he body of a condit ional is not likely t o be execut ed, it should be placed in anot her part of t he program . I f it is highly unlikely t o be execut ed and code localit y is an issue, it should be placed on a different code page.
3.4.1.6
Branch Type Selection
The default predict ed t arget for indirect branches and calls is t he fall- t hrough pat h. Fall- t hrough predict ion is overridden if and when a hardware predict ion is available for t hat branch. The predict ed branch t arget from branch predict ion hardware for an indirect branch is t he previously execut ed branch t arget . The default predict ion t o t he fall- t hrough pat h is only a significant issue if no branch predict ion is available, due t o poor code localit y or pat hological branch conflict problem s. For indirect calls, predict ing t he fall- t hrough pat h is usually not an issue, since execut ion will likely ret urn t o t he inst ruct ion aft er t he associat ed ret urn. Placing dat a im m ediat ely following an indirect branch can cause a perform ance problem . I f t he dat a consist s of all zeros, it looks like a long st ream of ADDs t o m em ory dest inat ions and t his can cause resource conflict s and slow down branch recovery. Also, dat a im m ediat ely following indirect branches m ay appear as branches t o t he branch predicat ion hardware, which can branch off t o execut e ot her dat a pages. This can lead t o subsequent self- m odifying code problem s. Asse m bly/ Com pile r Codin g Rule 1 4 . ( M im pa ct , L ge ne r a lit y) When indirect branches are present , t ry t o put t he m ost likely t arget of an indirect branch im m ediat ely following t he indirect branch. Alt ernat ively, if indirect branches are com m on but t hey cannot be predict ed by branch predict ion hardware, t hen follow t he indirect branch wit h a UD2 inst ruct ion, which will st op t he processor from decoding down t he fall- t hrough pat h. I ndirect branches result ing from code const ruct s ( such as swit ch st at em ent s, com put ed GOTOs or calls t hrough point ers) can j um p t o an arbit rary num ber of locat ions. I f t he code sequence is such t hat t he t arget dest inat ion of a branch goes t o t he sam e address m ost of t he t im e, t hen t he BTB will predict accurat ely m ost of t he t im e. Since only one t aken ( non- fall- t hrough) t arget can be st ored in t he BTB, indirect branches wit h m ult iple t aken t arget s m ay have lower predict ion rat es. The effect ive num ber of t arget s st ored m ay be increased by int roducing addit ional condit ional branches. Adding a condit ional branch t o a t arget is fruit ful if:
•
•
The branch direct ion is correlat ed wit h t he branch hist ory leading up t o t hat branch; t hat is, not j ust t he last t arget , but how it got t o t his branch. The source/ t arget pair is com m on enough t o warrant using t he ext ra branch predict ion capacit y. This m ay increase t he num ber of overall branch m ispredict ions, while im proving t he m ispredict ion of indirect branches. The profit abilit y is lower if t he num ber of m ispredict ing branches is very large.
Use r / Sou r ce Codin g Rule 1 . ( M im pa ct , L ge ne r a lit y) I f an indirect branch has t wo or m ore com m on t aken t arget s and at least one of t hose t arget s is correlat ed wit h branch hist ory leading up t o t he branch, t hen convert t he indirect branch t o a t ree where one or m ore indirect branches are preceded by condit ional branches t o t hose t arget s. Apply t his “ peeling” procedure t o t he com m on t arget of an indirect branch t hat correlat es t o branch hist ory. The purpose of t his rule is t o reduce t he t ot al num ber of m ispredict ions by enhancing t he predict abilit y of branches ( even at t he expense of adding m ore branches) . The added branches m ust be predict able for t his t o be wort hwhile. One reason for such predict abilit y is a st rong correlat ion wit h preceding branch hist ory. That is, t he direct ions t aken on preceding branches are a good indicat or of t he direct ion of t he branch under considerat ion.
3-9
GENERAL OPTIMIZATION GUIDELINES
Exam ple 3- 8 shows a sim ple exam ple of t he correlat ion bet ween a t arget of a preceding condit ional branch and a t arget of an indirect branch. Example 3-8. Indirect Branch With Two Favored Targets function () { int n = rand(); // random integer 0 to RAND_MAX if ( ! (n & 0x01) ) { // n will be 0 half the times n = 0; // updates branch history to predict taken } // indirect branches with multiple taken targets // may have lower prediction rates switch (n) { case 0: handle_0(); break; case 1: handle_1(); break; case 3: handle_3(); break; default: handle_other(); }
// common target, correlated with // branch history that is forward taken // uncommon // uncommon // common target
} Correlat ion can be difficult t o det erm ine analyt ically, for a com piler and for an assem bly language program m er. I t m ay be fruit ful t o evaluat e perform ance wit h and wit hout peeling t o get t he best perform ance from a coding effort . An exam ple of peeling out t he m ost favored t arget of an indirect branch wit h correlat ed branch hist ory is shown in Exam ple 3- 9.
Example 3-9. A Peeling Technique to Reduce Indirect Branch Misprediction function () { int n = rand(); if( ! (n & 0x01) ) THEN n = 0; if (!n) THEN handle_0();
// Random integer 0 to RAND_MAX // n will be 0 half the times // Peel out the most common target // with correlated branch history
{ switch (n) { case 1: handle_1(); break; case 3: handle_3(); break; default: handle_other(); } } }
3-10
// Uncommon // Uncommon // Make the favored target in // the fall-through path
GENERAL OPTIMIZATION GUIDELINES
3.4.1.7
Loop Unrolling
Benefit s of unrolling loops are:
• • •
Unrolling am ort izes t he branch overhead, since it elim inat es branches and som e of t he code t o m anage induct ion variables. Unrolling allows one t o aggressively schedule ( or pipeline) t he loop t o hide lat encies. This is useful if you have enough free regist ers t o keep variables live as you st ret ch out t he dependence chain t o expose t he crit ical pat h. Unrolling exposes t he code t o various ot her opt im izat ions, such as rem oval of redundant loads, com m on subexpression elim inat ion, and so on.
The pot ent ial cost s of unrolling loops are:
•
•
Excessive unrolling or unrolling of very large loops can lead t o increased code size. This can be harm ful if t he unrolled loop no longer fit s in t he t race cache ( TC) . Unrolling loops whose bodies cont ain branches increases dem and on BTB capacit y. I f t he num ber of it erat ions of t he unrolled loop is 16 or fewer, t he branch predict or should be able t o correct ly predict branches in t he loop body t hat alt ernat e direct ion.
Asse m bly/ Com pile r Codin g Rule 1 5 . ( H im pa ct , M ge ne r a lit y) Unroll sm all loops unt il t he overhead of t he branch and induct ion variable account s ( generally) for less t han 10% of t he execut ion t im e of t he loop. Asse m bly/ Com pile r Codin g Rule 1 6 . ( H im pa ct , M ge ne r a lit y) Avoid unrolling loops excessively; t his m ay t hrash t he t race cache or inst ruct ion cache. Asse m bly/ Com pile r Codin g Rule 1 7 . ( M im pa ct , M ge ne r a lit y) Unroll loops t hat are frequent ly execut ed and have a predict able num ber of it erat ions t o reduce t he num ber of it erat ions t o 16 or fewer. Do t his unless it increases code size so t hat t he working set no longer fit s in t he t race or inst ruct ion cache. I f t he loop body cont ains m ore t han one condit ional branch, t hen unroll so t hat t he num ber of it erat ions is 16/ ( # condit ional branches) . Exam ple 3- 10 shows how unrolling enables ot her opt im izat ions. Example 3-10. Loop Unrolling Before unrolling: do i = 1, 100 if ( i mod 2 == 0 ) then a( i ) = x else a( i ) = y enddo After unrolling do i = 1, 100, 2 a( i ) = y a( i+1 ) = x enddo I n t his exam ple, t he loop t hat execut es 100 t im es assigns X t o every even- num bered elem ent and Y t o every odd- num bered elem ent . By unrolling t he loop you can m ake assignm ent s m ore efficient ly, rem oving one branch in t he loop body.
3.4.1.8
Compiler Support for Branch Prediction
Com pilers generat e code t hat im proves t he efficiency of branch predict ion in I nt el processors. The I nt el C+ + Com piler accom plishes t his by:
• • • •
Keeping code and dat a on separat e pages. Using condit ional m ove inst ruct ions t o elim inat e branches. Generat ing code consist ent wit h t he st at ic branch predict ion algorit hm . I nlining where appropriat e. 3-11
GENERAL OPTIMIZATION GUIDELINES
•
Unrolling if t he num ber of it erat ions is predict able.
Wit h profile- guided opt im izat ion, t he com piler can lay out basic blocks t o elim inat e branches for t he m ost frequent ly execut ed pat hs of a funct ion or at least im prove t heir predict abilit y. Branch predict ion need not be a concern at t he source level. For m ore inform at ion, see I nt el C+ + Com piler docum ent at ion.
3.4.2
Fetch and Decode Optimization
I nt el Core m icroarchit ect ure provides several m echanism s t o increase front end t hroughput . Techniques t o t ake advant age of som e of t hese feat ures are discussed below.
3.4.2.1
Optimizing for Micro-fusion
An I nst ruct ion t hat operat es on a regist er and a m em ory operand decodes int o m ore m icro- ops t han it s corresponding regist er- regist er version. Replacing t he equivalent work of t he form er inst ruct ion using t he regist er- regist er version usually require a sequence of t wo inst ruct ions. The lat t er sequence is likely t o result in reduced fet ch bandwidt h. Asse m bly/ Com pile r Coding Rule 1 8 . ( M L im pa ct , M ge ne r a lit y) For im proving fet ch/ decode t hroughput , Give preference t o m em ory flavor of an inst ruct ion over t he regist er- only flavor of t he sam e inst ruct ion, if such inst ruct ion can benefit from m icro- fusion. The following exam ples are som e of t he t ypes of m icro- fusions t hat can be handled by all decoders:
• • • •
All st ores t o m em ory, including st ore im m ediat e. St ores execut e int ernally as t wo separat e m icroops: st ore- address and st ore- dat a. All “ read- m odify” ( load+ op) inst ruct ions bet ween regist er and m em ory, for exam ple: ADDPS XMM9, OWORD PTR [RSP+40] FADD DOUBLE PTR [RDI+RSI*8] XOR RAX, QWORD PTR [RBP+32] All inst ruct ions of t he form “ load and j um p,” for exam ple: JMP [RDI+200] RET CMP and TEST wit h im m ediat e operand and m em ory.
An I nt el 64 inst ruct ion wit h RI P relat ive addressing is not m icro- fused in t he following cases:
•
•
When an addit ional im m ediat e is needed, for exam ple: CMP [RIP+400], 27 MOV [RIP+3000], 142 When an RI P is needed for cont rol flow purposes, for exam ple: JMP [RIP+5000000]
I n t hese cases, I nt el Core m icroarchit ect ure and I nt el m icroarchit ect ure code nam e Sandy Bridge provides a 2 m icro- op flow from decoder 0, result ing in a slight loss of decode bandwidt h since 2 m icroop flow m ust be st eered t o decoder 0 from t he decoder wit h which it was aligned. RI P addressing m ay be com m on in accessing global dat a. Since it will not benefit from m icro- fusion, com piler m ay consider accessing global dat a wit h ot her m eans of m em ory addressing.
3.4.2.2
Optimizing for Macro-fusion
Macro- fusion m erges t wo inst ruct ions t o a single m icro- op. I nt el Core m icroarchit ect ure perform s t his hardware opt im izat ion under lim it ed circum st ances. The first inst ruct ion of t he m acro- fused pair m ust be a CMP or TEST inst ruct ion. This inst ruct ion can be REG- REG, REG- I MM, or a m icro- fused REG- MEM com parison. The second inst ruct ion ( adj acent in t he inst ruct ion st ream ) should be a condit ional branch. Since t hese pairs are com m on ingredient in basic it erat ive program m ing sequences, m acro- fusion im proves perform ance even on un- recom piled binaries. All of t he decoders can decode one m acro- fused 3-12
GENERAL OPTIMIZATION GUIDELINES
pair per cycle, wit h up t o t hree ot her inst ruct ions, result ing in a peak decode bandwidt h of 5 inst ruct ions per cycle. Each m acro- fused inst ruct ion execut es wit h a single dispat ch. This process reduces lat ency, which in t his case shows up as a cycle rem oved from branch m ispredict penalt y. Soft ware also gain all ot her fusion benefit s: increased renam e and ret ire bandwidt h, m ore st orage for inst ruct ions in- flight , and power savings from represent ing m ore work in fewer bit s. The following list det ails when you can use m acro- fusion:
•
• •
CMP or TEST can be fused when com paring: REG-REG. For example: CMP EAX,ECX; JZ label REG-IMM. For example: CMP EAX,0x80; JZ label REG-MEM. For example: CMP EAX,[ECX]; JZ label MEM-REG. For example: CMP [ EAX] ,ECX; JZ label TEST can fused wit h all condit ional j um ps. CMP can be fused wit h only t he following condit ional j um ps in I nt el Core m icroarchit ect ure. These condit ional j um ps check carry flag ( CF) or zero flag ( ZF) . j um p. The list of m acro- fusion- capable condit ional j um ps are: JA or JNBE JAE or JNB or JNC JE or JZ JNA or JBE JNAE or JC or JB JNE or JNZ
CMP and TEST can not be fused when com paring MEM- I MM ( e.g. CMP [ EAX] ,0x80; JZ label) . Macrofusion is not support ed in 64- bit m ode for I nt el Core m icroarchit ect ure.
•
I nt el m icroarchit ect ure code nam e Nehalem support s t he following enhancem ent s in m acrofusion: — CMP can be fused wit h t he following condit ional j um ps ( t hat was not support ed in I nt el Core m icroarchit ect ure) :
• • • •
•
JL or JNGE JGE or JNL JLE or JNG JG or JNLE
— Macro- fusion is support in 64- bit m ode. Enhanced m acrofusion support in I nt el m icroarchit ect ure code nam e Sandy Bridge is sum m arized in Table 3- 1 wit h addit ional inform at ion in Sect ion 2.3.2.1 and Exam ple 3- 15:
Table 3-1. Macro-Fusible Instructions in Intel Microarchitecture Code Name Sandy Bridge Instructions
TEST
AND
CMP
ADD
SUB
INC
DEC
JO/JNO
Y
Y
N
N
N
N
N
JC/JB/JAE/JNB
Y
Y
Y
Y
Y
N
N
JE/JZ/JNE/JNZ
Y
Y
Y
Y
Y
Y
Y
JNA/JBE/JA/JNBE
Y
Y
Y
Y
Y
N
N
JS/JNS/JP/JPE/JNP/JPO
Y
Y
N
N
N
N
N
JL/JNGE/JGE/JNL/JLE/JNG/JG/JNLE
Y
Y
Y
Y
Y
Y
Y
3-13
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Coding Rule 1 9 . ( M im pa ct , M L ge ne r a lit y) Em ploy m acro- fusion where possible using inst ruct ion pairs t hat support m acro- fusion. Prefer TEST over CMP if possible. Use unsigned variables and unsigned j um ps when possible. Try t o logically verify t hat a variable is nonnegat ive at t he t im e of com parison. Avoid CMP or TEST of MEM- I MM flavor when possible. However, do not add ot her inst ruct ions t o avoid using t he MEM- I MM flavor.
Example 3-11. Macro-fusion, Unsigned Iteration Count Without Macro-fusion 1
With Macro-fusion
C code
for (int i = 0; i < 1000; i++) a++;
for ( unsigned int2 i = 0; i < 1000; i++) a++;
Disassembly
for (int i = 0; i < 1000; i++) mov dword ptr [ i ], 0 jmp First Loop: mov eax, dword ptr [ i ] add eax, 1 mov dword ptr [ i ], eax
for ( unsigned int i = 0; i < 1000; i++) xor eax, eax mov dword ptr [ i ], eax jmp First Loop: mov eax, dword ptr [ i ] add eax, 1 mov dword ptr [ i ], eax
First: cmp jge
First: cmp jae
dword ptr [ i ], 3E8H3 End a++; mov eax, dword ptr [ a ] addqq eax,1 mov dword ptr [ a ], eax jmp Loop End:
mov add mov jmp End:
eax, 3E8H 4 End a++; eax, dword ptr [ a ] eax, 1 dword ptr [ a ], eax Loop
NOTES: 1. Signed iteration count inhibits macro-fusion. 2. Unsigned iteration count is compatible with macro-fusion. 3. CMP MEM-IMM, JGE inhibit macro-fusion. 4. CMP REG-IMM, JAE permits macro-fusion.
Example 3-12. Macro-fusion, If Statement Without Macro-fusion
With Macro-fusion
C code
int1
a = 7; if ( a < 77 ) a++; else a--;
unsigned int2 a = 7; if ( a < 77 ) a++; else a--;
Disassembly
int a = 7; mov dword ptr [ a ], 7 if (a < 77) cmp dword ptr [ a ], 4DH 3 jge Dec
unsigned int a = 7; mov dword ptr [ a ], 7 if ( a < 77 ) mov eax, dword ptr [ a ] cmp eax, 4DH jae Dec
3-14
GENERAL OPTIMIZATION GUIDELINES
Example 3-12. Macro-fusion, If Statement (Contd.) Without Macro-fusion
With Macro-fusion
a++; mov eax, dword ptr [ a ] add eax, 1 mov dword ptr [a], eax else jmp End a--; Dec: mov eax, dword ptr [ a ] sub eax, 1 mov dword ptr [ a ], eax End::
add mov else jmp Dec: sub mov End::
a++; eax,1 dword ptr [ a ], eax End a--; eax, 1 dword ptr [ a ], eax
NOTES: 1. Signed iteration count inhibits macro-fusion. 2. Unsigned iteration count is compatible with macro-fusion. 3. CMP MEM-IMM, JGE inhibit macro-fusion. Asse m bly/ Com pile r Codin g Rule 2 0 . ( M im pa ct , M L ge ne r a lit y) Soft ware can enable m acro fusion when it can be logically det erm ined t hat a variable is non- negat ive at t he t im e of com parison; use TEST appropriat ely t o enable m acro- fusion when com paring a variable wit h 0. Example 3-13. Macro-fusion, Signed Variable Without Macro-fusion test ecx, ecx jle OutSideTheIF cmp ecx, 64H jge OutSideTheIF
OutSideTheIF:
With Macro-fusion test ecx, ecx jle OutSideTheIF cmp ecx, 64H jae OutSideTheIF
OutSideTheIF:
For eit her signed or unsigned variable ‘a’; “ CMP a,0” and “ TEST a,a” produce t he sam e result as far as t he flags are concerned. Since TEST can be m acro- fused m ore oft en, soft ware can use “ TEST a,a” t o replace “ CMP a,0” for t he purpose of enabling m acro- fusion. Example 3-14. Macro-fusion, Signed Comparison C Code Without Macro-fusion
With Macro-fusion
if (a == 0)
cmp a, 0 jne lbl ... lbl:
test a, a jne lbl ... lbl:
if ( a >= 0)
cmp a, 0 jl lbl; ... lbl:
test a, a jl lbl ... lbl:
I nt el m icroarchit ect ure code nam e Sandy Bridge enables m ore arit hm et ic and logic inst ruct ions t o m acro- fuse wit h condit ional branches. I n loops where t he ALU port s are already congest ed, perform ing one of t hese m acro- fusions can relieve t he pressure, as t he m acro- fused inst ruct ion consum es only port 5, inst ead of an ALU port plus port 5. I n Exam ple 3- 15, t he “ add/ cm p/ j nz” loop cont ains t wo ALU inst ruct ions t hat can be dispat ched via eit her port 0, 1, 5. So t here is higher probabilit y of port 5 m ight bind t o eit her ALU inst ruct ion causing JNZ t o 3-15
GENERAL OPTIMIZATION GUIDELINES
wait a cycle. The “ sub/ j nz” loop, t he likelihood of ADD/ SUB/ JNZ can be dispat ched in t he sam e cycle is increased because only SUB is free t o bind wit h eit her port 0, 1, 5.
Example 3-15. Additional Macro-fusion Benefit in Intel Microarchitecture Code Name Sandy Bridge Add + cmp + jnz alternative Loop control with sub + jnz lea xor xor loop: add add cmp jnz
3.4.2.3
rdx, buff rcx, rcx eax, eax eax, [rdx + 4 * rcx] rcx, 1 rcx, LEN loop
lea xor xor loop: add sub jnz
rdx, buff - 4 rcx, LEN eax, eax eax, [rdx + 4 * rcx] rcx, 1 loop
Length-Changing Prefixes (LCP)
The lengt h of an inst ruct ion can be up t o 15 byt es in lengt h. Som e prefixes can dynam ically change t he lengt h of an inst ruct ion t hat t he decoder m ust recognize. Typically, t he pre- decode unit will est im at e t he lengt h of an inst ruct ion in t he byt e st ream assum ing t he absence of LCP. When t he predecoder encount ers an LCP in t he fet ch line, it m ust use a slower lengt h decoding algorit hm . Wit h t he slower lengt h decoding algorit hm , t he predecoder decodes t he fet ch in 6 cycles, inst ead of t he usual 1 cycle. Norm al queuing t hroughout of t he m achine pipeline generally cannot hide LCP penalt ies. The prefixes t hat can dynam ically change t he lengt h of a inst ruct ion include:
• •
Operand size prefix ( 0x66) . Address size prefix ( 0x67) .
The inst ruct ion MOV DX, 01234h is subj ect t o LCP st alls in processors based on I nt el Core m icroarchit ect ure, and in I nt el Core Duo and I nt el Core Solo processors. I nst ruct ions t hat cont ain im m 16 as part of t heir fixed encoding but do not require LCP t o change t he im m ediat e size are not subj ect t o LCP st alls. The REX prefix ( 4xh) in 64- bit m ode can change t he size of t wo classes of inst ruct ion, but does not cause an LCP penalt y. I f t he LCP st all happens in a t ight loop, it can cause significant perform ance degradat ion. When decoding is not a bot t leneck, as in float ing- point heavy code, isolat ed LCP st alls usually do not cause perform ance degradat ion. Asse m bly/ Com pile r Coding Rule 2 1 . ( M H im pa ct , M H ge ne r a lit y) Favor generat ing code using im m 8 or im m 32 values inst ead of im m 16 values. I f im m 16 is needed, load equivalent im m 32 int o a regist er and use t he word value in t he regist er inst ead.
Double LCP Stalls I nst ruct ions t hat are subj ect t o LCP st alls and cross a 16- byt e fet ch line boundary can cause t he LCP st all t o t rigger t wice. The following alignm ent sit uat ions can cause LCP st alls t o t rigger t wice:
•
•
An inst ruct ion is encoded wit h a MODR/ M and SI B byt e, and t he fet ch line boundary crossing is bet ween t he MODR/ M and t he SI B byt es. An inst ruct ion st art s at offset 13 of a fet ch line references a m em ory locat ion using regist er and im m ediat e byt e offset addressing m ode.
The first st all is for t he 1st fet ch line, and t he 2nd st all is for t he 2nd fet ch line. A double LCP st all causes a decode penalt y of 11 cycles.
3-16
GENERAL OPTIMIZATION GUIDELINES
The following exam ples cause LCP st all once, regardless of t heir fet ch- line locat ion of t he first byt e of t he inst ruct ion: ADD DX, 01234H ADD word ptr [EDX], 01234H ADD word ptr 012345678H[EDX], 01234H ADD word ptr [012345678H], 01234H The following inst ruct ions cause a double LCP st all when st art ing at offset 13 of a fet ch line: ADD word ptr [ EDX+ ESI ], 01234H ADD word ptr 012H[EDX], 01234H ADD word ptr 012345678H[EDX+ESI ], 01234H To avoid double LCP st alls, do not use inst ruct ions subj ect t o LCP st alls t hat use SI B byt e encoding or addressing m ode wit h byt e displacem ent .
False LCP Stalls False LCP st alls have t he sam e charact erist ics as LCP st alls, but occur on inst ruct ions t hat do not have any im m 16 value. False LCP st alls occur when ( a) inst ruct ions wit h LCP t hat are encoded using t he F7 opcodes, and ( b) are locat ed at offset 14 of a fet ch line. These inst ruct ions are: not , neg, div, idiv, m ul, and im ul. False LCP experiences delay because t he inst ruct ion lengt h decoder can not det erm ine t he lengt h of t he inst ruct ion before t he next fet ch line, which holds t he exact opcode of t he inst ruct ion in it s MODR/ M byt e. The following t echniques can help avoid false LCP st alls:
• •
Upcast all short operat ions from t he F7 group of inst ruct ions t o long, using t he full 32 bit version. Ensure t hat t he F7 opcode never st art s at offset 14 of a fet ch line.
Asse m bly/ Com pile r Codin g Rule 2 2 . ( M im pa ct , M L ge ne r a lit y) Ensure inst ruct ions using 0xF7 opcode byt e does not st art at offset 14 of a fet ch line; and avoid using t hese inst ruct ion t o operat e on 16- bit dat a, upcast short dat a t o 32 bit s. Example 3-16. Avoiding False LCP Delays with 0xF7 Group Instructions A Sequence Causing Delay in the Decoder Alternate Sequence to Avoid Delay neg word ptr a
3.4.2.4
movsx eax, word ptr a neg eax mov word ptr a, AX
Optimizing the Loop Stream Detector (LSD)
Loops t hat fit t he following crit eria are det ect ed by t he LSD and replayed from t he inst ruct ion queue t o feed t he decoder in I nt el Core m icroarchit ect ure:
• • • •
Must be less t han or equal t o four 16- byt e fet ches. Must be less t han or equal t o 18 inst ruct ions. Can cont ain no m ore t han four t aken branches and none of t hem can be a RET. Should usually have m ore t han 64 it erat ions.
Loop St ream Det ect or in I nt el m icroarchit ect ure code nam e Nehalem is im proved by:
• •
Caching decoded m icro- operat ions in t he inst ruct ion decoder queue ( I DQ, see Sect ion 2.5.2) t o feed t he renam e/ alloc st age. The size of t he LSD is increased t o 28 m icro- ops.
3-17
GENERAL OPTIMIZATION GUIDELINES
The LSD and m icro- op queue im plem ent at ion cont inue t o im prove in Sandy Bridge and Haswell m icroarchit ect ures. They have t he following charact erist ics:
Table 3-2. Small Loop Criteria Detected by Sandy Bridge and Haswell Microarchitectures Sandy Bridge and Ivy Bridge microarchitectures
Haswell microarchitecture
Up to 8 chunk fetches of 32 instruction bytes
8 chunk fetches if HTT active, 11 chunk fetched if HTT off
Up to 28 micro ops
28 micro-ops if HTT active, 56 micro-ops if HTT off
All micro-ops resident in Decoded Icache ( i.e. DSB), but not from MSROM
All micro-ops resident in DSB, including micro-ops from MSRROM
No more than 8 taken branches
Relaxed
Exclude CALL and RET
Exclude CALL and RET
Mismatched stack operation disqualify
Same
Many calculat ion- int ensive loops, searches and soft ware st ring m oves m at ch t hese charact erist ics. These loops exceed t he BPU predict ion capacit y and always t erm inat e in a branch m ispredict ion. Asse m bly/ Com pile r Coding Rule 2 3 . ( M H im pa ct , M H ge ne r a lit y) Break up a loop long sequence of inst ruct ions int o loops of short er inst ruct ion blocks of no m ore t han t he size of LSD. Asse m bly/ Com pile r Coding Rule 2 4 . ( M H im pa ct , M ge ne r a lit y) Avoid unrolling loops cont aining LCP st alls, if t he unrolled block exceeds t he size of LSD.
3.4.2.5
Exploit LSD Micro-op Emission Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge
The LSD holds m icro- ops t hat const ruct sm all “ infinit e” loops. Micro- ops from t he LSD are allocat ed in t he out- of- order engine. The loop in t he LSD ends wit h a t aken branch t o t he beginning of t he loop. The t aken branch at t he end of t he loop is always t he last m icro- op allocat ed in t he cycle. The inst ruct ion at t he beginning of t he loop is always allocat ed at t he next cycle. I f t he code perform ance is bound by front end bandwidt h, unused allocat ion slot s result in a bubble in allocat ion, and can cause perform ance degradat ion. Allocat ion bandwidt h in I nt el m icroarchit ect ure code nam e Sandy Bridge is four m icro- ops per cycle. Perform ance is best , when t he num ber of m icro- ops in t he LSD result in t he least num ber of unused allocat ion slot s. You can use loop unrolling t o cont rol t he num ber of m icro- ops t hat are in t he LSD. I n t he Exam ple 3- 17, t he code sum s all array elem ent s. The original code adds one elem ent per it erat ion. I t has t hree m icro- ops per it erat ion, all allocat ed in one cycle. Code t hroughput is one load per cycle. When unrolling t he loop once t here are five m icro- ops per it erat ion, which are allocat ed in t wo cycles. Code t hroughput is st ill one load per cycle. Therefore t here is no perform ance gain. When unrolling t he loop t wice t here are seven m icro- ops per it erat ion, st ill allocat ed in t wo cycles. Since t wo loads can be execut ed in each cycle t his code has a pot ent ial t hroughput of t hree load operat ions in t wo cycles. . Example 3-17. Unrolling Loops in LSD to Optimize Emission Bandwidth No Unrolling Unroll once lp: add eax, [rsi + 4* rcx] dec rcx jnz lp
3-18
lp: add eax, [rsi + 4* rcx] add eax, [rsi + 4* rcx +4] add rcx, -2 jnz lp
Unroll Twice lp: add eax, [rsi + 4* rcx] add eax, [rsi + 4* rcx +4] add eax, [rsi + 4* rcx + 8] add rcx, -3 jnz lp
GENERAL OPTIMIZATION GUIDELINES
3.4.2.6
Optimization for Decoded ICache
The decoded I Cache is a new feat ure in I nt el m icroarchit ect ure code nam e Sandy Bridge. Running t he code from t he Decoded I Cache has t wo advant ages:
• •
Higher bandwidt h of m icro- ops feeding t he out- of- order engine. The front end does not need t o decode t he code t hat is in t he Decoded I Cache. This saves power.
There is overhead in swit ching bet ween t he Decoded I Cache and t he legacy decode pipeline. I f your code swit ches frequent ly bet ween t he front end and t he Decoded I Cache, t he penalt y m ay be higher t han running only from t he legacy pipeline To ensure “ hot ” code is feeding from t he decoded I Cache:
•
• •
Make sure each hot code block is less t han about 500 inst ruct ions. Specifically, do not unroll t o m ore t han 500 inst ruct ions in a loop. This should enable Decoded I Cache residency even when hypert hreading is enabled. For applicat ions wit h very large blocks of calculat ions inside a loop, consider loop- fission: split t he loop int o m ult iple loops t hat fit in t he Decoded I Cache, rat her t han a single loop t hat overflows. I f an applicat ion can be sure t o run wit h only one t hread per core, it can increase hot code block size t o about 1000 inst ruct ions.
D e nse Re a d- M odify- W r it e Code The Decoded I Cache can hold only up t o 18 m icro- ops per each 32 byt e aligned m em ory chunk. Therefore, code wit h a high concent rat ion of inst ruct ions t hat are encoded in a sm all num ber of byt es, yet have m any m icro- ops, m ay overflow t he 18 m icro- op lim it at ion and not ent er t he Decoded I Cache. Readm odify- writ e ( RMW) inst ruct ions are a good exam ple of such inst ruct ions. RMW inst ruct ions accept one m em ory source operand, one regist er source operand, and use t he source m em ory operand as t he dest inat ion. The sam e funct ionalit y can be achieved by t wo or t hree inst ruct ions: t he first reads t he m em ory source operand, t he second perform s t he operat ion wit h t he second regist er source operand, and t he last writ es t he result back t o m em ory. These inst ruct ions usually result in t he sam e num ber of m icro- ops but use m ore byt es t o encode t he sam e funct ionalit y. One case where RMW inst ruct ions m ay be used ext ensively is when t he com piler opt im izes aggressively for code size. Here are som e possible solut ions t o fit t he hot code in t he Decoded I Cache:
• • •
Replace RMW inst ruct ions wit h t wo or t hree inst ruct ions t hat have t he sam e funct ionalit y. For exam ple, “ adc [ rdi] , rcx“ is only t hree byt es long; t he equivalent sequence “ adc rax, [ rdi] “ + “ m ov [ rdi] , rax“ has a foot print of six byt es. Align t he code so t hat t he dense part is broken down am ong t wo different 32- byt e chunks. This solut ion is useful when using a t ool t hat aligns code aut om at ically, and is indifferent t o code changes. Spread t he code by adding m ult iple byt e NOPs in t he loop. Not e t hat t his solut ion adds m icro- ops for execut ion.
Align Uncondit iona l Br a n ch e s for D e code d I Ca ch e For code ent ering t he Decoded I Cache, each uncondit ional branch is t he last m icro- op occupying a Decoded I Cache Way. Therefore, only t hree uncondit ional branches per a 32 byt e aligned chunk can ent er t he Decoded I Cache. Uncondit ional branches are frequent in j um p t ables and swit ch declarat ions. Below are exam ples for t hese const ruct s, and m et hods for writ ing t hem so t hat t hey fit in t he Decoded I Cache. Com pilers creat e j um p t ables for C+ + virt ual class m et hods or DLL dispat ch t ables. Each uncondit ional branch consum es five byt es; t herefore up t o seven of t hem can be associat ed wit h a 32- byt e chunk. Thus j um p t ables m ay not fit in t he Decoded I Cache if t he uncondit ional branches are t oo dense in each 32Byt e- aligned chunk. This can cause perform ance degradat ion for code execut ing before and aft er t he branch t able. The solut ion is t o add m ult i- byt e NOP inst ruct ions am ong t he branches in t he branch t able. This m ay increases code size and should be used caut iously. However, t hese NOPs are not execut ed and t herefore have no penalt y in lat er pipe st ages. 3-19
GENERAL OPTIMIZATION GUIDELINES
Swit ch- Case const ruct s represent s a sim ilar sit uat ion. Each evaluat ion of a case condit ion result s in an uncondit ional branch. The sam e solut ion of using m ult i- byt e NOP can apply for every t hree consecut ive uncondit ional branches t hat fit s inside an aligned 32- byt e chunk. Tw o Br a nch e s in a D e code d I Ca ch e W a y The Decoded I Cache can hold up t o t wo branches in a way. Dense branches in a 32 byt e aligned chunk, or t heir ordering wit h ot her inst ruct ions m ay prohibit all t he m icro- ops of t he inst ruct ions in t he chunk from ent ering t he Decoded I Cache. This does not happen oft en. When it does happen, you can space t he code wit h NOP inst ruct ions where appropriat e. Make sure t hat t hese NOP inst ruct ions are not part of hot code. Asse m bly/ Com pile r Coding Rule 2 5 . ( M im pa ct , M ge ne r a lit y) Avoid put t ing explicit references t o ESP in a sequence of st ack operat ions ( POP, PUSH, CALL, RET) .
3.4.2.7
Other Decoding Guidelines
Asse m bly/ Com pile r Coding Rule 2 6 . ( M L im pa ct , L ge ne r a lit y) Use sim ple inst ruct ions t hat are less t han eight byt es in lengt h. Asse m bly/ Com pile r Coding Rule 2 7 . ( M im pa ct , M H ge ne r a lit y) Avoid using prefixes t o change t he size of im m ediat e and displacem ent . Long inst ruct ions ( m ore t han seven byt es) m ay lim it t he num ber of decoded inst ruct ions per cycle. Each prefix adds one byt e t o t he lengt h of inst ruct ion, possibly lim it ing t he decoder ’s t hroughput . I n addit ion, m ult iple prefixes can only be decoded by t he first decoder. These prefixes also incur a delay when decoded. I f m ult iple prefixes or a prefix t hat changes t he size of an im m ediat e or displacem ent cannot be avoided, schedule t hem behind inst ruct ions t hat st all t he pipe for som e ot her reason.
3.5
OPTIMIZING THE EXECUTION CORE
The superscalar, out- of- order execut ion core( s) in recent generat ions of m icroarchit ect ures cont ain m ult iple execut ion hardware resources t hat can execut e m ult iple m icro- ops in parallel. These resources generally ensure t hat m icro- ops execut e efficient ly and proceed wit h fixed lat encies. General guidelines t o m ake use of t he available parallelism are:
• •
• •
Follow t he rules ( see Sect ion 3.4) t o m axim ize useful decode bandwidt h and front end t hroughput . These rules include favouring single m icro- op inst ruct ions and t aking advant age of m icro- fusion, St ack point er t racker and m acro- fusion. Maxim ize renam e bandwidt h. Guidelines are discussed in t his sect ion and include properly dealing wit h part ial regist ers, ROB read port s and inst ruct ions which causes side- effect s on flags. Scheduling recom m endat ions on sequences of inst ruct ions so t hat m ult iple dependency chains are alive in t he reservat ion st at ion ( RS) sim ult aneously, t hus ensuring t hat your code ut ilizes m axim um parallelism . Avoid hazards, m inim ize delays t hat m ay occur in t he execut ion core, allowing t he dispat ched m icroops t o m ake progress and be ready for ret irem ent quickly.
3.5.1
Instruction Selection
Som e execut ion unit s are not pipelined, t his m eans t hat m icro- ops cannot be dispat ched in consecut ive cycles and t he t hroughput is less t han one per cycle. I t is generally a good st art ing point t o select inst ruct ions by considering t he num ber of m icro- ops associat ed wit h each inst ruct ion, favoring in t he order of: single m icro- op inst ruct ions, sim ple inst ruct ion wit h less t hen 4 m icro- ops, and last inst ruct ion requiring m icrosequencer ROM ( m icro- ops which are execut ed out of t he m icrosequencer involve ext ra overhead) .
3-20
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Codin g Rule 2 8 . ( M im pa ct , H ge ne r a lit y) Favor single- m icro- operat ion inst ruct ions. Also favor inst ruct ion wit h short er lat encies. A com piler m ay be already doing a good j ob on inst ruct ion select ion. I f so, user int ervent ion usually is not necessary. Asse m bly/ Com pile r Codin g Rule 2 9 . ( M im pa ct , L ge ne r a lit y) Avoid prefixes, especially m ult iple non- 0F- prefixed opcodes. Asse m bly/ Com pile r Codin g Rule 3 0 . ( M im pa ct , L ge ne r a lit y) Do not use m any segm ent regist ers. Asse m bly/ Com pile r Codin g Rule 3 1 . ( M im pa ct , M ge ne r a lit y) Avoid using com plex inst ruct ions ( for exam ple, ent er, leave, or loop) t hat have m ore t han four µops and require m ult iple cycles t o decode. Use sequences of sim ple inst ruct ions inst ead. Asse m bly/ Com pile r Codin g Rule 3 2 . ( M H im pa ct , M ge ne r a lit y) Use push/ pop t o m anage st ack space and address adj ust m ent s bet ween funct ion calls/ ret urns inst ead of ent er/ leave. Using ent er inst ruct ion wit h non- zero im m ediat es can experience significant delays in t he pipeline in addit ion t o m ispredict ion. Theoret ically, arranging inst ruct ions sequence t o m at ch t he 4- 1- 1- 1 t em plat e applies t o processors based on I nt el Core m icroarchit ect ure. However, wit h m acro- fusion and m icro- fusion capabilit ies in t he front end, at t em pt s t o schedule inst ruct ion sequences using t he 4- 1- 1- 1 t em plat e will likely provide dim inishing ret urns. I nst ead, soft ware should follow t hese addit ional decoder guidelines:
•
•
I f you need t o use m ult iple m icro- op, non- m icrosequenced inst ruct ions, t ry t o separat e by a few single m icro- op inst ruct ions. The following inst ruct ions are exam ples of m ult iple m icro- op inst ruct ion not requiring m icro- sequencer: ADC/SBB CMOVcc Read-modify-write instructions I f a series of m ult iple m icro- op inst ruct ions cannot be separat ed, t ry breaking t he series int o a different equivalent inst ruct ion sequence. For exam ple, a series of read- m odify- writ e inst ruct ions m ay go fast er if sequenced as a series of read- m odify + st ore inst ruct ions. This st rat egy could im prove perform ance even if t he new code sequence is larger t han t he original one.
3.5.1.1
Use of the INC and DEC Instructions
The I NC and DEC inst ruct ions m odify only a subset of t he bit s in t he flag regist er. This creat es a dependence on all previous writ es of t he flag regist er. This is especially problem at ic when t hese inst ruct ions are on t he crit ical pat h because t hey are used t o change an address for a load on which m any ot her inst ruct ions depend. Asse m bly/ Com pile r Codin g Rule 3 3 . ( M im pa ct , H ge ne r a lit y) I NC and DEC inst ruct ions should be replaced wit h ADD or SUB inst ruct ions, because ADD and SUB overwrit e all flags, whereas I NC and DEC do not , t herefore creat ing false dependencies on earlier inst ruct ions t hat set t he flags.
3.5.1.2
Integer Divide
Typically, an int eger divide is preceded by a CWD or CDQ inst ruct ion. Depending on t he operand size, divide inst ruct ions use DX: AX or EDX: EAX for t he dividend. The CWD or CDQ inst ruct ions sign- ext end AX or EAX int o DX or EDX, respect ively. These inst ruct ions have denser encoding t han a shift and m ove would be, but t hey generat e t he sam e num ber of m icro- ops. I f AX or EAX is known t o be posit ive, replace t hese inst ruct ions wit h: xor dx, dx or xor edx, edx
3-21
GENERAL OPTIMIZATION GUIDELINES
Modern com pilers t ypically can t ransform high- level language expression involving int eger division where t he divisor is a known int eger const ant at com pile t im e int o a fast er sequence using I MUL inst ruct ion inst ead. Thus program m ers should m inim ize int eger division expression wit h divisor whose value can not be known at com pile t im e. Alt ernat ely, if cert ain known divisor value are favored over ot her unknown ranges, soft ware m ay consider isolat ing t he few favored, known divisor value int o const ant - divisor expressions. Sect ion 9.2.4 describes m ore det ail of using MUL/ I MUL t o replace int eger divisions.
3.5.1.3
Using LEA
I n I nt el m icroarchit ect ure code nam e Sandy Bridge, t here are t wo significant changes t o t he perform ance charact erist ics of LEA inst ruct ion:
•
LEA can be dispat ched via port 1 and 5 in m ost cases, doubling t he t hroughput over prior generat ions. However t his apply only t o LEA inst ruct ions wit h one or t wo source operands.
Example 3-18. Independent Two-Operand LEA Example mov mov mov loop: lea lea and and dec jg
•
edx, N eax, X ecx, Y
ecx, [ecx = ecx *2] eax, [eax = eax *5] ecx, 0xff eax, 0xff edx loop
For LEA inst ruct ions wit h t hree source operands and som e specific sit uat ions, inst ruct ion lat ency has increased t o 3 cycles, and m ust dispat ch via port 1: — LEA t hat has all t hree source operands: base, index, and offset . — LEA t hat uses base and index regist ers where t he base is EBP, RBP, or R13. — LEA t hat uses RI P relat ive addressing m ode. — LEA t hat uses 16- bit addressing m ode.
3-22
GENERAL OPTIMIZATION GUIDELINES
.
Example 3-19. Alternative to Three-Operand LEA 3 operand LEA is slower Two-operand LEA alternative
Alternative 2
#define K 1 uint32 an = 0; uint32 N= mi_N; mov ecx, N xor esi, esi; xor edx, edx; cmp ecx, 2; jb finished; dec ecx;
#define K 1 uint32 an = 0; uint32 N= mi_N; mov ecx, N xor esi, esi; xor edx, edx; cmp ecx, 2; jb finished; dec ecx;
#define K 1 uint32 an = 0; uint32 N= mi_N; mov ecx, N xor esi, esi; mov edx, K; cmp ecx, 2; jb finished; mov eax, 2 dec ecx;
loop1: mov edi, esi; lea esi, [K+esi+edx]; and esi, 0xFF; mov edx, edi; dec ecx; jnz loop1; finished: mov [an] ,esi;
loop1: mov edi, esi; lea esi, [K+edx]; lea esi, [esi+edx]; and esi, 0xFF; mov edx, edi; dec ecx; jnz loop1; finished: mov [an] ,esi;
loop1: mov edi, esi; lea esi, [esi+edx]; and esi, 0xFF; lea edx, [edi +K]; dec ecx; jnz loop1; finished: mov [an] ,esi;
I n som e cases wit h processor based on I nt el Net Burst m icroarchit ect ure, t he LEA inst ruct ion or a sequence of LEA, ADD, SUB and SHI FT inst ruct ions can replace const ant m ult iply inst ruct ions. The LEA inst ruct ion can also be used as a m ult iple operand addit ion inst ruct ion, for exam ple: LEA ECX, [EAX + EBX + 4 + A] Using LEA in t his way m ay avoid regist er usage by not t ying up regist ers for operands of arit hm et ic inst ruct ions. This use m ay also save code space. I f t he LEA inst ruct ion uses a shift by a const ant am ount t hen t he lat ency of t he sequence of µops is short er if adds are used inst ead of a shift , and t he LEA inst ruct ion m ay be replaced wit h an appropriat e sequence of µops. This, however, increases t he t ot al num ber of µops, leading t o a t rade- off. Asse m bly/ Com pile r Codin g Rule 3 4 . ( M L im pa ct , L ge ne r a lit y) I f an LEA inst ruct ion using t he scaled index is on t he crit ical pat h, a sequence wit h ADDs m ay be bet t er. I f code densit y and bandwidt h out of t he t race cache are t he crit ical fact or, t hen use t he LEA inst ruct ion.
3.5.1.4
ADC and SBB in Intel® Microarchitecture Code Name Sandy Bridge
The t hroughput of ADC and SBB in I nt el m icroarchit ect ure code nam e Sandy Bridge is 1 cycle, com pared t o 1.5- 2 cycles in prior generat ion. These t wo inst ruct ions are useful in num eric handling of int eger dat a t ypes t hat are wider t han t he m axim um widt h of nat ive hardware.
3-23
GENERAL OPTIMIZATION GUIDELINES
Example 3-20. Examples of 512-bit Additions //Add 64-bit to 512 Number lea rsi, gLongCounter lea rdi, gStepValue mov rax, [rdi] xor rcx, rcx oop_start: mov r10, [rsi+rcx] add r10, rax mov [rsi+rcx], r10
l
mov adc mov
r10, [rsi+rcx+8] r10, 0 [rsi+rcx+8], r10
mov adc mov mov adc mov
r10, [rsi+rcx+16] r10, 0 [rsi+rcx+16], r10 r10, [rsi+rcx+24] r10, 0 [rsi+rcx+24], r10
mov adc mov
r10, [rsi+rcx+32] r10, 0 [rsi+rcx+32], r10
mov r10, [rsi+rcx+40] adc r10, 0 mov [rsi+rcx+40], r10
mov r10, [rsi+rcx+48] adc r10, 0 mov [rsi+rcx+48], r10 mov r10, [rsi+rcx+56] adc r10, 0 mov [rsi+rcx+56], r10 add rcx, 64 cmp rcx, SIZE jnz loop_start
3.5.1.5
// 512-bit Addition loop1: mov rax, [StepValue] add rax, [LongCounter] mov LongCounter, rax mov rax, [StepValue+8] adc rax, [LongCounter+8] mov LongCounter+8, rax mov rax, [StepValue+16] adc rax, [LongCounter+16]
mov mov adc
LongCounter+16, rax rax, [StepValue+24] rax, [LongCounter+24]
mov mov adc
LongCounter+24, rax rax, [StepValue+32] rax, [LongCounter+32]
mov mov adc
LongCounter+32, rax rax, [StepValue+40] rax, [LongCounter+40]
mov mov adc
LongCounter+40, rax rax, [StepValue+48] rax, [LongCounter+48]
mov mov adc
LongCounter+48, rax rax, [StepValue+56] rax, [LongCounter+56]
mov dec jnz
LongCounter+56, rax rcx loop1
Bitwise Rotation
Bit wise rot at ion can choose bet ween rot at e wit h count specified in t he CL regist er, an im m ediat e const ant and by 1 bit . Generally, The rot at e by im m ediat e and rot at e by regist er inst ruct ions are slower t han rot at e by 1 bit . The rot at e by 1 inst ruct ion has t he sam e lat ency as a shift .
3-24
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Codin g Rule 3 5 . ( M L im pa ct , L ge ne r a lit y) Avoid ROTATE by regist er or ROTATE by im m ediat e inst ruct ions. I f possible, replace wit h a ROTATE by 1 inst ruct ion. I n I nt el m icroarchit ect ure code nam e Sandy Bridge, ROL/ ROR by im m ediat e has 1- cycle t hroughput , SHLD/ SHRD using t he sam e regist er as source and dest inat ion by an im m ediat e const ant has 1- cycle lat ency wit h 0.5 cycle t hroughput . The “ ROL/ ROR reg, im m 8” inst ruct ion has t wo m icro- ops wit h t he lat ency of 1- cycle for t he rot at e regist er result and 2- cycles for t he flags, if used. I n I nt el m icroarchit ect ure code nam e I vy Bridge, The “ ROL/ ROR reg, im m 8” inst ruct ion wit h im m ediat e great er t han 1, is one m icro- op wit h one- cycle lat ency when t he overflow flag result is used. When t he im m ediat e is one, dependency on t he overflow flag result of ROL/ ROR by a subsequent inst ruct ion will see t he ROL/ ROR inst ruct ion wit h t wo- cycle lat ency.
3.5.1.6
Variable Bit Count Rotation and Shift
I n I nt el m icroarchit ect ure code nam e Sandy Bridge, The “ ROL/ ROR/ SHL/ SHR reg, cl” inst ruct ion has t hree m icro- ops. When t he flag result is not needed, one of t hese m icro- ops m ay be discarded, providing bet t er perform ance in m any com m on usages. When t hese inst ruct ions updat e part ial flag result s t hat are subsequent ly used, t he full t hree m icro- ops flow m ust go t hrough t he execut ion and ret irem ent pipeline, experiencing slower perform ance. I n I nt el m icroarchit ect ure code nam e I vy Bridge, execut ing t he full t hree m icro- ops flow t o use t he updat ed part ial flag result has addit ional delay. Consider t he looped sequence below: loop: shl eax, cl add ebx, eax dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow jnz loop The DEC inst ruct ion does not m odify t he carry flag. Consequent ly, t he SHL EAX, CL inst ruct ion needs t o execut e t he t hree m icro- ops flow in subsequent it erat ions. The SUB inst ruct ion will updat e all flags. So replacing DEC wit h SUB will allow SHL EAX, CL t o execut e t he t wo m icro- ops flow.
3.5.1.7
Address Calculations
For com put ing addresses, use t he addressing m odes rat her t han general- purpose com put at ions. I nt ernally, m em ory reference inst ruct ions can have four operands:
• • • •
Relocat able load- t im e const ant . I m m ediat e const ant . Base regist er. Scaled index regist er.
Not e t hat t he lat ency and t hroughput of LEA wit h m ore t han t wo operands are slower ( see Sect ion 3.5.1.3) in I nt el m icroarchit ect ure code nam e Sandy Bridge. Addressing m odes t hat uses bot h base and index regist ers will consum e m ore read port resource in t he execut ion engine and m ay experience m ore st alls due t o availabilit y of read port resources. Soft ware should t ake care by select ing t he speedy version of address calculat ion. I n t he segm ent ed m odel, a segm ent regist er m ay const it ut e an addit ional operand in t he linear address calculat ion. I n m any cases, several int eger inst ruct ions can be elim inat ed by fully using t he operands of m em ory references.
3-25
GENERAL OPTIMIZATION GUIDELINES
3.5.1.8
Clearing Registers and Dependency Breaking Idioms
Code sequences t hat m odifies part ial regist er can experience som e delay in it s dependency chain, but can be avoided by using dependency breaking idiom s. I n processors based on I nt el Core m icroarchit ect ure, a num ber of inst ruct ions can help clear execut ion dependency when soft ware uses t hese inst ruct ion t o clear regist er cont ent t o zero. The inst ruct ions include: XOR REG, REG SUB REG, REG XORPS/PD XMMREG, XMMREG PXOR XMMREG, XMMREG SUBPS/PD XMMREG, XMMREG PSUBB/W/D/Q XMMREG, XMMREG I n processors based on I nt el m icroarchit ect ure code nam e Sandy Bridge, t he inst ruct ion list ed above plus equivalent AVX count er part s are also zero idiom s t hat can be used t o break dependency chains. Furt herm ore, t hey do not consum e an issue port or an execut ion unit . So using zero idiom s are preferable t han m oving 0’s int o t he regist er. The AVX equivalent zero idiom s are: VXORPS/PD XMMREG, XMMREG VXORPS/PD YMMREG, YMMREG VPXOR XMMREG, XMMREG VSUBPS/PD XMMREG, XMMREG VSUBPS/PD YMMREG, YMMREG VPSUBB/W/D/Q XMMREG, XMMREG I n I nt el Core Solo and I nt el Core Duo processors, t he XOR, SUB, XORPS, or PXOR inst ruct ions can be used t o clear execut ion dependencies on t he zero evaluat ion of t he dest inat ion regist er. The Pent ium 4 processor provides special support for XOR, SUB, and PXOR operat ions when execut ed wit hin t he sam e regist er. This recognizes t hat clearing a regist er does not depend on t he old value of t he regist er. The XORPS and XORPD inst ruct ions do not have t his special support . They cannot be used t o break dependence chains. Asse m bly/ Com pile r Codin g Rule 3 6 . ( M im pa ct , M L ge ne r a lit y) Use dependency- breaking- idiom inst ruct ions t o set a regist er t o 0, or t o break a false dependence chain result ing from re- use of regist ers. I n cont ext s where t he condit ion codes m ust be preserved, m ove 0 int o t he regist er inst ead. This requires m ore code space t han using XOR and SUB, but avoids set t ing t he condit ion codes. Exam ple 3- 21 of using pxor t o break dependency idiom on a XMM regist er when perform ing negat ion on t he elem ent s of an array. int a[4096], b[4096], c[4096]; For ( int i = 0; i < 4096; i++ ) C[i] = - ( a[i] + b[i] );
3-26
GENERAL OPTIMIZATION GUIDELINES
Example 3-21. Clearing Register to Break Dependency While Negating Array Elements Negation (-x = (x XOR (-1)) - (-1) without breaking Negation (-x = 0 -x) using PXOR reg, reg breaks dependency dependency Lea eax, a lea ecx, b lea edi, c xor edx, edx movdqa xmm7, allone lp:
lea eax, a lea ecx, b lea edi, c xor edx, edx lp:
movdqa xmm0, [eax + edx] paddd xmm0, [ecx + edx] pxor xmm0, xmm7 psubd xmm0, xmm7 movdqa [edi + edx], xmm0 add edx, 16 cmp edx, 4096 jl lp
movdqa xmm0, [eax + edx] paddd xmm0, [ecx + edx] pxor xmm7, xmm7 psubd xmm7, xmm0 movdqa [edi + edx], xmm7 add edx,16 cmp edx, 4096 jl lp
Asse m bly/ Com pile r Codin g Rule 3 7 . ( M im pa ct , M H ge ne r a lit y) Break dependences on port ions of regist ers bet ween inst ruct ions by operat ing on 32- bit regist ers inst ead of part ial regist ers. For m oves, t his can be accom plished wit h 32- bit m oves or by using MOVZX. Som et im es sign- ext ended sem ant ics can be m aint ained by zero- ext ending operands. For exam ple, t he C code in t he following st at em ent s does not need sign ext ension, nor does it need prefixes for operand size overrides: static short INT a, b; IF (a == b) { ... } Code for com paring t hese 16- bit operands m ight be: MOVZW EAX, [a] MOVZW EBX, [b] CMP EAX, EBX These circum st ances t end t o be com m on. However, t he t echnique will not work if t he com pare is for great er t han, less t han, great er t han or equal, and so on, or if t he values in eax or ebx are t o be used in anot her operat ion where sign ext ension is required. Asse m bly/ Com pile r Codin g Rule 3 8 . ( M im pa ct , M ge ne r a lit y) Try t o use zero ext ension or operat e on 32- bit operands inst ead of using m oves wit h sign ext ension. The t race cache can be packed m ore t ight ly when inst ruct ions wit h operands t hat can only be represent ed as 32 bit s are not adj acent . Asse m bly/ Com pile r Codin g Rule 3 9 . ( M L im pa ct , L ge ne r a lit y) Avoid placing inst ruct ions t hat use 32- bit im m ediat es which cannot be encoded as sign- ext ended 16- bit im m ediat es near each ot her. Try t o schedule µops t hat have no im m ediat e im m ediat ely before or aft er µops wit h 32- bit im m ediat es.
3.5.1.9
Compares
Use TEST when com paring a value in a regist er wit h zero. TEST essent ially ANDs operands t oget her wit hout writ ing t o a dest inat ion regist er. TEST is preferred over AND because AND produces an ext ra result regist er. TEST is bet t er t han CMP ..., 0 because t he inst ruct ion size is sm aller.
3-27
GENERAL OPTIMIZATION GUIDELINES
Use TEST when com paring t he result of a logical AND wit h an im m ediat e const ant for equalit y or inequalit y if t he regist er is EAX for cases such as: I F ( AVAR & 8) { } The TEST inst ruct ion can also be used t o det ect rollover of m odulo of a power of 2. For exam ple, t he C code: IF ( (AVAR % 16) == 0 ) { } can be im plem ent ed using: TEST JNZ
EAX, 0x0F AfterIf
Using t he TEST inst ruct ion bet ween t he inst ruct ion t hat m ay m odify part of t he flag regist er and t he inst ruct ion t hat uses t he flag regist er can also help prevent part ial flag regist er st all. Asse m bly/ Com pile r Coding Rule 4 0 . ( M L im pa ct , M ge n e r a lit y) Use t he TEST inst ruct ion inst ead of AND when t he result of t he logical AND is not used. This saves µops in execut ion. Use a TEST of a regist er wit h it self inst ead of a CMP of t he regist er t o zero, t his saves t he need t o encode t he zero and saves encoding space. Avoid com paring a const ant t o a m em ory operand. I t is preferable t o load t he m em ory operand and com pare t he const ant t o a regist er. Oft en a produced value m ust be com pared wit h zero, and t hen used in a branch. Because m ost I nt el archit ect ure inst ruct ions set t he condit ion codes as part of t heir execut ion, t he com pare inst ruct ion m ay be elim inat ed. Thus t he operat ion can be t est ed direct ly by a JCC inst ruct ion. The not able except ions are MOV and LEA. I n t hese cases, use TEST. Asse m bly/ Com pile r Coding Rule 4 1 . ( M L im pa ct , M ge ne r a lit y) Elim inat e unnecessary com pare wit h zero inst ruct ions by using t he appropriat e condit ional j um p inst ruct ion when t he flags are already set by a preceding arit hm et ic inst ruct ion. I f necessary, use a TEST inst ruct ion inst ead of a com pare. Be cert ain t hat any code t ransform at ions m ade do not int roduce problem s wit h overflow.
3.5.1.10
Using NOPs
Code generat ors generat e a no- operat ion ( NOP) t o align inst ruct ions. Exam ples of NOPs of different lengt hs in 32- bit m ode are shown below: 1-byte: XCHG EAX, EAX 2-byte: 66 NOP 3-byte: LEA REG, 0 (REG) (8-bit displacement) 4-byte: NOP DWORD PTR [ EAX + 0] (8-bit displacement) 5-byte: NOP DWORD PTR [ EAX + EAX* 1 + 0] (8-bit displacement) 6-byte: LEA REG, 0 (REG) (32-bit displacement) 7-byte: NOP DWORD PTR [ EAX + 0] (32-bit displacement) 8-byte: NOP DWORD PTR [ EAX + EAX* 1 + 0] (32-bit displacement) 9-byte: NOP WORD PTR [ EAX + EAX* 1 + 0] (32-bit displacement) These are all t rue NOPs, having no effect on t he st at e of t he m achine except t o advance t he EI P. Because NOPs require hardware resources t o decode and execut e, use t he fewest num ber t o achieve t he desired padding. The one byt e NOP: [ XCHG EAX,EAX] has special hardware support . Alt hough it st ill consum es a µop and it s accom panying resources, t he dependence upon t he old value of EAX is rem oved. This µop can be execut ed at t he earliest possible opport unit y, reducing t he num ber of out st anding inst ruct ions and is t he lowest cost NOP. The ot her NOPs have no special hardware support . Their input and out put regist ers are int erpret ed by t he hardware. Therefore, a code generat or should arrange t o use t he regist er cont aining t he oldest value as input , so t hat t he NOP will dispat ch and release RS resources at t he earliest possible opport unit y.
3-28
GENERAL OPTIMIZATION GUIDELINES
Try t o observe t he following NOP generat ion priorit y:
• • •
Select t he sm allest num ber of NOPs and pseudo- NOPs t o provide t he desired padding. Select NOPs t hat are least likely t o execut e on slower execut ion unit clust ers. Select t he regist er argum ent s of NOPs t o reduce dependencies.
3.5.1.11
Mixing SIMD Data Types
Previous m icroarchit ect ures ( before I nt el Core m icroarchit ect ure) do not have explicit rest rict ions on m ixing int eger and float ing- point ( FP) operat ions on XMM regist ers. For I nt el Core m icroarchit ect ure, m ixing int eger and float ing- point operat ions on t he cont ent of an XMM regist er can degrade perform ance. Soft ware should avoid m ixed- use of int eger/ FP operat ion on XMM regist ers. Specifically:
• • •
Use SI MD int eger operat ions t o feed SI MD int eger operat ions. Use PXOR for idiom . Use SI MD float ing- point operat ions t o feed SI MD float ing- point operat ions. Use XORPS for idiom . When float ing- point operat ions are bit wise equivalent , use PS dat a t ype inst ead of PD dat a t ype. MOVAPS and MOVAPD do t he sam e t hing, but MOVAPS t akes one less byt e t o encode t he inst ruct ion.
3.5.1.12
Spill Scheduling
The spill scheduling algorit hm used by a code generat or will be im pact ed by t he m em ory subsyst em . A spill scheduling algorit hm is an algorit hm t hat select s what values t o spill t o m em ory when t here are t oo m any live values t o fit in regist ers. Consider t he code in Exam ple 3- 22, where it is necessary t o spill eit her A, B, or C. Example 3-22. Spill Scheduling Code LOOP C := ... B := ... A := A + ... For m odern m icroarchit ect ures, using dependence dept h inform at ion in spill scheduling is even m ore im port ant t han in previous processors. The loop- carried dependence in A m akes it especially im port ant t hat A not be spilled. Not only would a st ore/ load be placed in t he dependence chain, but t here would also be a dat a- not- ready st all of t he load, cost ing furt her cycles. Asse m bly/ Com pile r Codin g Rule 4 2 . ( H im pa ct , M H ge ne r a lit y) For sm all loops, placing loop invariant s in m em ory is bet t er t han spilling loop- carried dependencies. A possibly count er- int uit ive result is t hat in such a sit uat ion it is bet t er t o put loop invariant s in m em ory t han in regist ers, since loop invariant s never have a load blocked by st ore dat a t hat is not ready.
3.5.1.13
Zero-Latency MOV Instructions
I n processors based on I nt el m icroarchit ect ure code nam e I vy Bridge, a subset of regist er- t o- regist er m ove operat ions are execut ed in t he front end ( sim ilar t o zero- idiom s, see Sect ion 3.5.1.8) . This conserves scheduling/ execut ion resources in t he out- of- order engine. Most form s of regist er- t o- regist er
3-29
GENERAL OPTIMIZATION GUIDELINES
MOV inst ruct ions can benefit from zero- lat ency MOV. Exam ple 3- 23 list t he det ails of t hose form s t hat qualify and a sm all set t hat do not .
Example 3-23. Zero-Latency MOV Instructions MOV instructions latency that can be eliminated MOV reg32, reg32 MOV reg64, reg64 MOVUPD/MOVAPD xmm, xmm MOVUPD/MOVAPD ymm, ymm MOVUPS?MOVAPS xmm, xmm MOVUPS/MOVAPS ymm, ymm MOVDQA/MOVDQU xmm, xmm MOVDQA/MOVDQU ymm, ymm MOVZX reg32, reg8 (if not AH/BH/CH/DH) MOVZX reg64, reg8 (if not AH/BH/CH/DH)
MOV instructions latency that cannot be eliminated MOV reg8, reg8 MOV reg16, reg16 MOVZX reg32, reg8 (if AH/BH/CH/DH) MOVZX reg64, reg8 (if AH/BH/CH/DH) MOVSX
Exam ple 3- 24 shows how t o process 8- bit int egers using MOVZX t o t ake advant age of zero- lat ency MOV enhancem ent . Consider X = ( X * 3^ N ) MOD 256; Y = ( Y * 3^ N ) MOD 256; When “ MOD 256” is im plem ent ed using t he “AND 0xff” t echnique, it s lat ency is exposed in t he resultdependency chain. Using a form of MOVZX on a t runcat ed byt e input , it can t ake advant age of zerolat ency MOV enhancem ent and gain about 45% in speed.
Example 3-24. Byte-Granular Data Computation Technique Use AND Reg32, 0xff Use MOVZX mov rsi, N mov rax, X mov rcx, Y loop: lea rcx, [rcx+rcx*2] lea rax, [rax+rax*4] and rcx, 0xff and rax, 0xff
mov rsi, N mov rax, X mov rcx, Y loop: lea rbx, [rcx+rcx*2] movzx, rcx, bl lea rbx, [rcx+rcx*2] movzx, rcx, bl
lea rcx, [rcx+rcx*2] lea rax, [rax+rax*4] and rcx, 0xff and rax, 0xff sub rsi, 2 jg loop
lea rdx, [rax+rax*4] movzx, rax, dl llea rdx, [rax+rax*4] movzx, rax, dl sub rsi, 2 jg loop
The effect iveness of coding a dense sequence of inst ruct ions t o rely on a zero- lat ency MOV inst ruct ion m ust also consider int ernal resource const raint s in t he m icroarchit ect ure.
3-30
GENERAL OPTIMIZATION GUIDELINES
Example 3-25. Re-ordering Sequence to Improve Effectiveness of Zero-Latency MOV Instructions Needing more internal resource for zero-latency MOVs Needing less internal resource for zero-latency MOVs mov rsi, N mov rax, X mov rcx, Y
mov rsi, N mov rax, X mov rcx, Y
loop: lea rbx, [rcx+rcx*2] movzx, rcx, bl lea rdx, [rax+rax*4] movzx, rax, dl lea rbx, [rcx+rcx*2] movzx, rcx, bl llea rdx, [rax+rax*4] movzx, rax, dl sub rsi, 2 jg loop
loop: lea rbx, [rcx+rcx*2] movzx, rcx, bl lea rbx, [rcx+rcx*2] movzx, rcx, bl lea rdx, [rax+rax*4] movzx, rax, dl llea rdx, [rax+rax*4] movzx, rax, dl sub rsi, 2 jg loop
I n Exam ple 3- 25, RBX/ RCX and RDX/ RAX are pairs of regist ers t hat are shared and cont inuously overwrit t en. I n t he right- hand sequence, regist ers are overwrit t en wit h new result s im m ediat ely, consum ing less int ernal resources provided by t he underlying m icroarchit ect ure. As a result , it is about 8% fast er t han t he left- hand sequence where int ernal resources could only support 50% of t he at t em pt t o t ake advant age of zero- lat ency MOV inst ruct ions.
3.5.2
Avoiding Stalls in Execution Core
Alt hough t he design of t he execut ion core is opt im ized t o m ake com m on cases execut es quickly. A m icroop m ay encount er various hazards, delays, or st alls while m aking forward progress from t he front end t o t he ROB and RS. The significant cases are:
• • • •
ROB Read Port St alls. Part ial Regist er Reference St alls. Part ial Updat es t o XMM Regist er St alls. Part ial Flag Regist er Reference St alls.
3.5.2.1
ROB Read Port Stalls
As a m icro- op is renam ed, it det erm ines whet her it s source operands have execut ed and been writ t en t o t he reorder buffer ( ROB) , or whet her t hey will be capt ured “ in flight ” in t he RS or in t he bypass net work. Typically, t he great m aj orit y of source operands are found t o be “ in flight ” during renam ing. Those t hat have been writ t en back t o t he ROB are read t hrough a set of read port s. Since t he I nt el Core m icroarchit ect ure is opt im ized for t he com m on case where t he operands are “ in flight ”, it does not provide a full set of read port s t o enable all renam ed m icro- ops t o read all sources from t he ROB in t he sam e cycle. When not all sources can be read, a m icro- op can st all in t he renam e st age unt il it can get access t o enough ROB read port s t o com plet e renam ing t he m icro- op. This st all is usually short- lived. Typically, a m icro- op will com plet e renam ing in t he next cycle, but it appears t o t he applicat ion as a loss of renam e bandwidt h.
3-31
GENERAL OPTIMIZATION GUIDELINES
Som e of t he soft ware- visible sit uat ions t hat can cause ROB read port st alls include:
• • •
Regist ers t hat have becom e cold and require a ROB read port because execut ion unit s are doing ot her independent calculat ions. Const ant s inside regist ers. Point er and index regist ers.
I n rare cases, ROB read port st alls m ay lead t o m ore significant perform ance degradat ions. There are a couple of heurist ics t hat can help prevent over- subscribing t he ROB read port s:
•
•
Keep com m on regist er usage clust ered t oget her. Mult iple references t o t he sam e writ t en- back regist er can be “ folded” inside t he out of order execut ion core. Keep short dependency chains int act . This pract ice ensures t hat t he regist ers will not have been writ t en back when t he new m icro- ops are writ t en t o t he RS.
These t wo scheduling heurist ics m ay conflict wit h ot her m ore com m on scheduling heurist ics. To reduce dem and on t he ROB read port , use t hese t wo heurist ics only if bot h t he following sit uat ions are m et :
• •
Short lat ency operat ions. I ndicat ions of act ual ROB read port st alls can be confirm ed by m easurem ent s of t he perform ance event ( t he relevant event is RAT_STALLS.ROB_READ_PORT, see Chapt er 19 of t he I nt el ® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) .
I f t he code has a long dependency chain, t hese t wo heurist ics should not be used because t hey can cause t he RS t o fill, causing dam age t hat out weighs t he posit ive effect s of reducing dem ands on t he ROB read port . St art ing wit h I nt el m icroarchit ect ure code nam e Sandy Bridge, ROB port st all no longer applies because dat a is read from t he physical regist er file.
3.5.2.2
Writeback Bus Conflicts
The writ eback bus inside t he execut ion engine is a com m on resource needed t o facilit at e out- of- order execut ion of m icro- ops in flight . When t he writ eback bus is needed at t he sam e t im e by t wo m icro- ops execut ing in t he sam e st ack of execut ion unit s ( see Table 2- 15) , t he younger m icro- op will have t o wait for t he writ eback bus t o be available. This sit uat ion t ypically will be m ore likely for short- lat ency inst ruct ions experience a delay when it m ight have been ot herwise ready for dispat ching int o t he execut ion engine. Consider a repeat ing sequence of independent float ing- point ADDs wit h a single- cycle MOV bound t o t he sam e dispat ch port . When t he MOV finds t he dispat ch port available, t he writ eback bus can be occupied by t he ADD. This delays t he MOV operat ion. I f t his problem is det ect ed, you can som et im es change t he inst ruct ion select ion t o use a different dispat ch port and reduce t he writ eback cont ent ion.
3.5.2.3
Bypass between Execution Domains
Float ing- point ( FP) loads have an ext ra cycle of lat ency. Moves bet ween FP and SI MD st acks have anot her addit ional cycle of lat ency. Exam ple: ADDPS XMM0, XMM1 PAND XMM0, XMM3 ADDPS XMM2, XMM0 The overall lat ency for t he above calculat ion is 9 cycles:
• • • •
3 cycles for each ADDPS inst ruct ion. 1 cycle for t he PAND inst ruct ion. 1 cycle t o bypass bet ween t he ADDPS float ing- point dom ain t o t he PAND int eger dom ain. 1 cycle t o m ove t he dat a from t he PAND int eger t o t he second float ing- point ADDPS dom ain.
3-32
GENERAL OPTIMIZATION GUIDELINES
To avoid t his penalt y, you should organize code t o m inim ize dom ain changes. Som et im es you cannot avoid bypasses. Account for bypass cycles when count ing t he overall lat ency of your code. I f your calculat ion is lat encybound, you can execut e m ore inst ruct ions in parallel or break dependency chains t o reduce t ot al lat ency. Code t hat has m any bypass dom ains and is com plet ely lat ency- bound m ay run slower on t he I nt el Core m icroarchit ect ure t han it did on previous m icroarchit ect ures.
3.5.2.4
Partial Register Stalls
General purpose regist ers can be accessed in granularit ies of byt es, words, doublewords; 64- bit m ode also support s quadword granularit y. Referencing a port ion of a regist er is referred t o as a part ial regist er reference. A part ial regist er st all happens when an inst ruct ion refers t o a regist er, port ions of which were previously m odified by ot her inst ruct ions. For exam ple, part ial regist er st alls occurs wit h a read t o AX while previous inst ruct ions st ored AL and AH, or a read t o EAX while previous inst ruct ion m odified AX. The delay of a part ial regist er st all is sm all in processors based on I nt el Core and Net Burst m icroarchit ect ures, and in Pent ium M processor ( wit h CPUI D signat ure fam ily 6, m odel 13) , I nt el Core Solo, and I nt el Core Duo processors. Pent ium M processors ( CPUI D signat ure wit h fam ily 6, m odel 9) and t he P6 fam ily incur a large penalt y. Not e t hat in I nt el 64 archit ect ure, an updat e t o t he lower 32 bit s of a 64 bit int eger regist er is archit ect urally defined t o zero ext end t he upper 32 bit s. While t his act ion m ay be logically viewed as a 32 bit updat e, it is really a 64 bit updat e ( and t herefore does not cause a part ial st all) . Referencing part ial regist ers frequent ly produces code sequences wit h eit her false or real dependencies. Exam ple 3- 18 dem onst rat es a series of false and real dependencies caused by referencing part ial regist ers. I f inst ruct ions 4 and 6 ( in Exam ple 3- 18) are changed t o use a m ovzx inst ruct ion inst ead of a m ov, t hen t he dependences of inst ruct ion 4 on 2 ( and t ransit ively 1 before it ) , and inst ruct ion 6 on 5 are broken. This creat es t wo independent chains of com put at ion inst ead of one serial one. Exam ple 3- 26 illust rat es t he use of MOVZX t o avoid a part ial regist er st all when packing t hree byt e values int o a regist er. Example 3-26. Avoiding Partial Register Stalls in Integer Code A Sequence Causing Partial Register Stall Alternate Sequence Using MOVZX to Avoid Delay mov al, byte ptr a[2] shl eax,16 mov ax, word ptr a movd mm0, eax ret
movzx eax, byte ptr a[2] shl eax, 16 movzx ecx, word ptr a or eax,ecx movd mm0, eax ret
I n I nt el m icroarchit ect ure code nam e Sandy Bridge, part ial regist er access is handled in hardware by insert ing a m icro- op t hat m erges t he part ial regist er wit h t he full regist er in t he following cases:
• •
Aft er a writ e t o one of t he regist ers AH, BH, CH or DH and before a following read of t he 2- , 4- or 8byt e form of t he sam e regist er. I n t hese cases a m erge m icro- op is insert ed. The insert ion consum es a full allocat ion cycle in which ot her m icro- ops cannot be allocat ed. Aft er a m icro- op wit h a dest inat ion regist er of 1 or 2 byt es, which is not a source of t he inst ruct ion ( or t he regist er's bigger form ) , and before a following read of a 2- ,4- or 8- byt e form of t he sam e regist er. I n t hese cases t he m erge m icro- op is part of t he flow. For exam ple:
•
MOV AX, [ BX] When you want t o load from m em ory t o a part ial regist er, consider using MOVZX or MOVSX t o avoid t he addit ional m erge m icro- op penalt y.
•
LEA
AX, [ BX+ CX] 3-33
GENERAL OPTIMIZATION GUIDELINES
For opt im al perform ance, use of zero idiom s, before t he use of t he regist er, elim inat es t he need for part ial regist er m erge m icro- ops.
3.5.2.5
Partial XMM Register Stalls
Part ial regist er st alls can also apply t o XMM regist ers. The following SSE and SSE2 inst ruct ions updat e only part of t he dest inat ion regist er: MOVL/HPD XMM, MEM64 MOVL/HPS XMM, MEM32 MOVSS/SD between registers Using t hese inst ruct ions creat es a dependency chain bet ween t he unm odified part of t he regist er and t he m odified part of t he regist er. This dependency chain can cause perform ance loss. Exam ple 3- 27 illust rat es t he use of MOVZX t o avoid a part ial regist er st all when packing t hree byt e values int o a regist er. Follow t hese recom m endat ions t o avoid st alls from part ial updat es t o XMM regist ers:
• • • •
Avoid using inst ruct ions which updat e only part of t he XMM regist er. I f a 64- bit load is needed, use t he MOVSD or MOVQ inst ruct ion. I f 2 64- bit loads are required t o t he sam e regist er from non cont inuous locat ions, use MOVSD/ MOVHPD inst ead of MOVLPD/ MOVHPD. When copying t he XMM regist er, use t he following inst ruct ions for full regist er copy, even if you only want t o copy som e of t he source regist er dat a: MOVAPS MOVAPD MOVDQA
Example 3-27. Avoiding Partial Register Stalls in SIMD Code Using movlpd for memory transactions and movsd Using movsd for memory and movapd between between register copies Causing Partial Register Stall register copies Avoid Delay mov edx, x mov ecx, count movlpd xmm3,_1_ movlpd xmm2,_1pt5_ align 16 lp:
lp: movlpd xmm0, [edx] addsd xmm0, xmm3 movsd xmm1, xmm2 subsd xmm1, [edx] mulsd xmm0, xmm1 movsd [edx], xmm0 add edx, 8 dec ecx jnz lp
3.5.2.6
mov edx, x mov ecx, count movsd xmm3,_1_ movsd xmm2, _1pt5_ align 16 movsd xmm0, [edx] addsd xmm0, xmm3 movapd xmm1, xmm2 subsd xmm1, [edx] mulsd xmm0, xmm1 movsd [edx], xmm0 add edx, 8 dec ecx jnz lp
Partial Flag Register Stalls
A “ part ial flag regist er st all” occurs when an inst ruct ion m odifies a part of t he flag regist er and t he following inst ruct ion is dependent on t he out com e of t he flags. This happens m ost oft en wit h shift inst ruct ions ( SAR, SAL, SHR, SHL) . The flags are not m odified in t he case of a zero shift count , but t he shift count is usually known only at execut ion t im e. The front end st alls unt il t he inst ruct ion is ret ired. 3-34
GENERAL OPTIMIZATION GUIDELINES
Ot her inst ruct ions t hat can m odify som e part of t he flag regist er include CMPXCHG8B, various rot at e inst ruct ions, STC, and STD. An exam ple of assem bly wit h a part ial flag regist er st all and alt ernat ive code wit hout t he st all is shown in Exam ple 3- 28. I n processors based on I nt el Core m icroarchit ect ure, shift im m ediat e by 1 is handled by special hardware such t hat it does not experience part ial flag st all. Example 3-28. Avoiding Partial Flag Register Stalls Partial Flag Register Stall xor eax, eax mov ecx, a sar ecx, 2 setz al ;SAR can update carry causing a stall
Avoiding Partial Flag Register Stall or eax, eax mov ecx, a sar ecx, 2 test ecx, ecx ; test always updates all flags setz al ;No partial reg or flag stall,
I n I nt el m icroarchit ect ure code nam e Sandy Bridge, t he cost of part ial flag access is replaced by t he insert ion of a m icro- op inst ead of a st all. However, it is st ill recom m ended t o use less of inst ruct ions t hat writ e only t o som e of t he flags ( such as I NC, DEC, SET CL) before inst ruct ions t hat can writ e flags condit ionally ( such as SHI FT CL) . Exam ple 3- 29 com pares t wo t echniques t o im plem ent t he addit ion of very large int egers ( e.g. 1024 bit s) . The alt ernat ive sequence on t he right side of Exam ple 3- 29 will be fast er t han t he left side on I nt el m icroarchit ect ure code nam e Sandy Bridge, but it will experience part ial flag st alls on prior m icroarchit ect ures.
Example 3-29. Partial Flag Register Accesses in Intel Microarchitecture Code Name Sandy Bridge Save partial flag register to avoid stall Simplified code sequence lea rsi, [A] lea rdi, [B] xor rax, rax mov rcx, 16 ; 16*64 =1024 bit lp_64bit: add rax, [rsi] adc rax, [rdi] mov [rdi], rax setc al ;save carry for next iteration movzx rax, al add rsi, 8 add rdi, 8 dec rcx jnz lp_64bit
3.5.2.7
lea rsi, [A] lea rdi, [B] xor rax, rax mov rcx, 16 lp_64bit: add rax, [rsi] adc rax, [rdi] mov [rdi], rax lea rsi, [rsi+8] lea rdi, [rdi+8] dec rcx jnz lp_64bit
Floating-Point/SIMD Operands
Moves t hat writ e a port ion of a regist er can int roduce unwant ed dependences. The MOVSD REG, REG inst ruct ion writ es only t he bot t om 64 bit s of a regist er, not all 128 bit s. This int roduces a dependence on t he preceding inst ruct ion t hat produces t he upper 64 bit s ( even if t hose bit s are not longer want ed) . The dependence inhibit s regist er renam ing, and t hereby reduces parallelism . Use MOVAPD as an alt ernat ive; it writ es all 128 bit s. Even t hough t his inst ruct ion has a longer lat ency, t he µops for MOVAPD use a different execut ion port and t his port is m ore likely t o be free. The change can im pact perform ance. There m ay be except ional cases where t he lat ency m at t ers m ore t han t he dependence or t he execut ion port .
3-35
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Coding Rule 4 3 . ( M im pa ct , M L ge ne r a lit y) Avoid int roducing dependences wit h part ial float ing- point regist er writ es, e.g. from t he MOVSD XMMREG1, XMMREG2 inst ruct ion. Use t he MOVAPD XMMREG1, XMMREG2 inst ruct ion inst ead. The MOVSD XMMREG, MEM inst ruct ion writ es all 128 bit s and breaks a dependence. The MOVUPD from m em ory inst ruct ion perform s t wo 64- bit loads, but requires addit ional µops t o adj ust t he address and com bine t he loads int o a single regist er. This sam e funct ionalit y can be obt ained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+ 8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer µops and can be packed int o t he t race cache m ore effect ively. The lat t er alt ernat ive has been found t o provide a several percent perform ance im provem ent in som e cases. I t s encoding requires m ore inst ruct ion byt es, but t his is seldom an issue for t he Pent ium 4 processor. The st ore version of MOVUPD is com plex and slow, so m uch so t hat t he sequence wit h t wo MOVSD and a UNPCKHPD should always be used. Asse m bly/ Com pile r Coding Rule 4 4 . ( M L im pa ct , L ge ne r a lit y) I nst ead of using MOVUPD XMMREG1, MEM for a unaligned 128- bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+ 8; UNPCKLPD XMMREG1, XMMREG2 . I f t he addit ional regist er is not available, t hen use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+ 8. Asse m bly/ Com pile r Coding Rule 4 5 . ( M im pa ct , M L ge ne r a lit y) I nst ead of using MOVUPD MEM, XMMREG1 for a st ore, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+ 8, XMMREG1 inst ead.
3.5.3
Vectorization
This sect ion provides a brief sum m ary of opt im izat ion issues relat ed t o vect orizat ion. There is m ore det ail in t he chapt ers t hat follow. Vect orizat ion is a program t ransform at ion t hat allows special hardware t o perform t he sam e operat ion on m ult iple dat a elem ent s at t he sam e t im e. Successive processor generat ions have provided vect or support t hrough t he MMX t echnology, St ream ing SI MD Ext ensions ( SSE) , St ream ing SI MD Ext ensions 2 ( SSE2) , St ream ing SI MD Ext ensions 3 ( SSE3) and Supplem ent al St ream ing SI MD Ext ensions 3 ( SSSE3) . Vect orizat ion is a special case of SI MD, a t erm defined in Flynn’s archit ect ure t axonom y t o denot e a single inst ruct ion st ream capable of operat ing on m ult iple dat a elem ent s in parallel. The num ber of elem ent s which can be operat ed on in parallel range from four single- precision float ing- point dat a elem ent s in St ream ing SI MD Ext ensions and t wo double- precision float ing- point dat a elem ent s in St ream ing SI MD Ext ensions 2 t o sixt een byt e operat ions in a 128- bit regist er in St ream ing SI MD Ext ensions 2. Thus, vect or lengt h ranges from 2 t o 16, depending on t he inst ruct ion ext ensions used and on t he dat a t ype. The I nt el C+ + Com piler support s vect orizat ion in t hree ways:
• • •
The com piler m ay be able t o generat e SI MD code wit hout int ervent ion from t he user. The can user insert pragm as t o help t he com piler realize t hat it can vect orize t he code. The user can writ e SI MD code explicit ly using int rinsics and C+ + classes.
To help enable t he com piler t o generat e SI MD code, avoid global point ers and global variables. These issues m ay be less t roublesom e if all m odules are com piled sim ult aneously, and whole- program opt im izat ion is used. Use r / Sour ce Coding Rule 2 . ( H im pa ct , M ge ne r a lit y) Use t he sm allest possible float ing- point or SI MD dat a t ype, t o enable m ore parallelism wit h t he use of a ( longer) SI MD vect or. For exam ple, use single precision inst ead of double precision where possible. Use r / Sour ce Coding Rule 3 . ( M im pa ct , M L ge ne r a lit y) Arrange t he nest ing of loops so t hat t he innerm ost nest ing level is free of int er- it erat ion dependencies. Especially avoid t he case where t he st ore of dat a in an earlier it erat ion happens lexically aft er t he load of t hat dat a in a fut ure it erat ion, som et hing which is called a lexically backward dependence. The int eger part of t he SI MD inst ruct ion set ext ensions cover 8- bit ,16- bit and 32- bit operands. Not all SI MD operat ions are support ed for 32 bit s, m eaning t hat som e source code will not be able t o be vect orized at all unless sm aller operands are used.
3-36
GENERAL OPTIMIZATION GUIDELINES
Use r / Sou r ce Codin g Rule 4 . ( M im pa ct , M L ge ne r a lit y) Avoid t he use of condit ional branches inside loops and consider using SSE inst ruct ions t o elim inat e branches. Use r / Sou r ce Codin g Ru le 5 . ( M im pa ct , M L ge ne r a lit y) Keep induct ion ( loop) variable expressions sim ple.
3.5.4
Optimization of Partially Vectorizable Code
Frequent ly, a program cont ains a m ixt ure of vect orizable code and som e rout ines t hat are non- vect orizable. A com m on sit uat ion of part ially vect orizable code involves a loop st ruct ure which include m ixt ures of vect orized code and unvect orizable code. This sit uat ion is depict ed in Figure 3- 1.
Packed SIMD Instruction
Unpacking
Unvectorizable Code
Serial Routine
Packing Packed SIMD Instruction
Figure 3-1. Generic Program Flow of Partially Vectorized Code I t generally consist s of five st ages wit hin t he loop:
• • • • •
Prolog. Unpacking vect orized dat a st ruct ure int o individual elem ent s. Calling a non- vect orizable rout ine t o process each elem ent serially. Packing individual result int o vect orized dat a st ruct ure. Epilog.
This sect ion discusses t echniques t hat can reduce t he cost and bot t leneck associat ed wit h t he packing/ unpacking st ages in t hese part ially vect orize code. Exam ple 3- 30 shows a reference code t em plat e t hat is represent at ive of part ially vect orizable coding sit uat ions t hat also experience perform ance issues. The unvect orizable port ion of code is represent ed generically by a sequence of calling a serial funct ion nam ed “ foo” m ult iple t im es. This generic exam ple is referred t o as “ shuffle wit h st ore forwarding”, because t he problem generally involves an unpacking st age t hat shuffles dat a elem ent s bet ween regist er and m em ory, followed by a packing st age t hat can experience st ore forwarding issue.
3-37
GENERAL OPTIMIZATION GUIDELINES
There are m ore t han one useful t echniques t hat can reduce t he st ore- forwarding bot t leneck bet ween t he serialized port ion and t he packing st age. The following sub- sect ions present s alt ernat e t echniques t o deal wit h t he packing, unpacking, and param et er passing t o serialized funct ion calls. Example 3-30. Reference Code Template for Partially Vectorizable Program // Prolog /////////////////////////////// push ebp mov ebp, esp // Unpacking //////////////////////////// sub ebp, 32 and ebp, 0xfffffff0 movaps [ebp], xmm0 // Serial operations on components /////// sub ebp, 4 mov eax, [ebp+4] mov [ebp], eax call foo mov [ebp+16+4], eax mov eax, [ebp+8] mov [ebp], eax call foo mov [ebp+16+4+4], eax mov eax, [ebp+12] mov [ebp], eax call foo mov [ebp+16+8+4], eax mov eax, [ebp+12+4] mov [ebp], eax call foo mov [ebp+16+12+4], eax // Packing /////////////////////////////// movaps xmm0, [ebp+16+4] // Epilog //////////////////////////////// pop ebp ret
3.5.4.1
Alternate Packing Techniques
The packing m et hod im plem ent ed in t he reference code of Exam ple 3- 30 will experience delay as it assem bles 4 doubleword result from m em ory int o an XMM regist er due t o st ore- forwarding rest rict ions.
3-38
GENERAL OPTIMIZATION GUIDELINES
Three alt ernat e t echniques for packing, using different SI MD inst ruct ion t o assem ble cont ent s in XMM regist ers are shown in Exam ple 3- 31. All t hree t echniques avoid st ore- forwarding delay by sat isfying t he rest rict ions on dat a sizes bet ween a preceding st ore and subsequent load operat ions. Example 3-31. Three Alternate Packing Methods for Avoiding Store Forwarding Difficulty Packing Method 1 Packing Method 2 Packing Method 3 movd xmm0, [ebp+16+4] movd xmm1, [ebp+16+8] movd xmm2, [ebp+16+12] movd xmm3, [ebp+12+16+4] punpckldq xmm0, xmm1 punpckldq xmm2, xmm3 punpckldq xmm0, xmm2
3.5.4.2
movd xmm0, [ebp+16+4] movd xmm1, [ebp+16+8] movd xmm2, [ebp+16+12] movd xmm3, [ebp+12+16+4] psllq xmm3, 32 orps xmm2, xmm3 psllq xmm1, 32 orps xmm0, xmm1movlhps xmm0, xmm2
movd xmm0, [ebp+16+4] movd xmm1, [ebp+16+8] movd xmm2, [ebp+16+12] movd xmm3, [ebp+12+16+4] movlhps xmm1,xmm3 psllq xmm1, 32 movlhps xmm0, xmm2 orps xmm0, xmm1
Simplifying Result Passing
I n Exam ple 3- 30, individual result s were passed t o t he packing st age by st oring t o cont iguous m em ory locat ions. I nst ead of using m em ory spills t o pass four result s, result passing m ay be accom plished by using eit her one or m ore regist ers. Using regist ers t o sim plify result passing and reduce m em ory spills can im prove perform ance by varying degrees depending on t he regist er pressure at runt im e. Exam ple 3- 32 shows t he coding sequence t hat uses four ext ra XMM regist ers t o reduce all m em ory spills of passing result s back t o t he parent rout ine. However, soft ware m ust observe t he following condit ions when using t his t echnique:
• •
There is no regist er short age. I f t he loop does not have m any st ores or loads but has m any com put at ions, t his t echnique does not help perform ance. This t echnique adds work t o t he com put at ional unit s, while t he st ore and loads port s are idle.
Example 3-32. Using Four Registers to Reduce Memory Spills and Simplify Result Passing mov eax, [ebp+4] mov [ebp], eax call foo movd xmm0, eax mov eax, [ebp+8] mov [ebp], eax call foo movd xmm1, eax mov eax, [ebp+12] mov [ebp], eax call foo movd xmm2, eax mov eax, [ebp+12+4] mov [ebp], eax call foo movd xmm3, eax
3-39
GENERAL OPTIMIZATION GUIDELINES
3.5.4.3
Stack Optimization
I n Exam ple 3- 30, an input param et er was copied in t urn ont o t he st ack and passed t o t he non- vect orizable rout ine for processing. The param et er passing from consecut ive m em ory locat ions can be sim plified by a t echnique shown in Exam ple 3- 33. Example 3-33. Stack Optimization Technique to Simplify Parameter Passing call foo mov [ebp+16], eax add ebp, 4 call foo mov [ebp+16], eax add ebp, 4 call foo mov [ebp+16], eax add ebp, 4 call foo St ack Opt im izat ion can only be used when:
•
•
The serial operat ions are funct ion calls. The funct ion “ foo” is declared as: INT FOO(INT A). The param et er is passed on t he st ack. The order of operat ion on t he com ponent s is from last t o first .
Not e t he call t o FOO and t he advance of EDP when passing t he vect or elem ent s t o FOO one by one from last t o first .
3.5.4.4
Tuning Considerations
Tuning considerat ions for sit uat ions represent ed by looping of Exam ple 3- 30 include:
•
Applying one of m ore of t he following com binat ions: — Choose an alt ernat e packing t echnique. — Consider a t echnique t o sim ply result- passing.
• •
— Consider t he st ack opt im izat ion t echnique t o sim plify param et er passing. Minim izing t he average num ber of cycles t o execut e one it erat ion of t he loop. Minim izing t he per- it erat ion cost of t he unpacking and packing operat ions.
The speed im provem ent by using t he t echniques discussed in t his sect ion will vary, depending on t he choice of com binat ions im plem ent ed and charact erist ics of t he non- vect orizable rout ine. For exam ple, if t he rout ine “ foo” is short ( represent at ive of t ight , short loops) , t he per- it erat ion cost of unpacking/ packing t end t o be sm aller t han sit uat ions where t he non- vect orizable code cont ain longer operat ion or m any dependencies. This is because m any it erat ions of short , t ight loop can be in flight in t he execut ion core, so t he per- it erat ion cost of packing and unpacking is only part ially exposed and appear t o cause very lit t le perform ance degradat ion. Evaluat ion of t he per- it erat ion cost of packing/ unpacking should be carried out in a m et hodical m anner over a select ed num ber of t est cases, where each case m ay im plem ent som e com binat ion of t he t echniques discussed in t his sect ion. The per- it erat ion cost can be est im at ed by:
• •
Evaluat ing t he average cycles t o execut e one it erat ion of t he t est case. Evaluat ing t he average cycles t o execut e one it erat ion of a base line loop sequence of non- vect orizable code.
3-40
GENERAL OPTIMIZATION GUIDELINES
Exam ple 3- 34 shows t he base line code sequence t hat can be used t o est im at e t he average cost of a loop t hat execut es non- vect orizable rout ines.
Example 3-34. Base Line Code Sequence to Estimate Loop Overhead push ebp mov ebp, esp sub ebp, 4 mov [ebp], edi call foo mov [ebp], edi call foo mov [ebp], edi call foo mov [ebp], edi call foo add ebp, 4 pop ebp ret The average per- it erat ion cost of packing/ unpacking can be derived from m easuring t he execut ion t im es of a large num ber of it erat ions by: ((Cycles to run TestCase) - (Cycles to run equivalent baseline sequence) ) / (Iteration count). For exam ple, using a sim ple funct ion t hat ret urns an input param et er ( represent at ive of t ight , short loops) , t he per- it erat ion cost of packing/ unpacking m ay range from slight ly m ore t han 7 cycles ( t he shuffle wit h st ore forwarding case, Exam ple 3- 30) t o ~ 0.9 cycles ( accom plished by several t est cases) . Across 27 t est cases ( consist ing of one of t he alt ernat e packing m et hods, no result- sim plificat ion/ sim plificat ion of eit her 1 or 4 result s, no st ack opt im izat ion or wit h st ack opt im izat ion) , t he average per- it erat ion cost of packing/ unpacking is about 1.7 cycles. Generally speaking, packing m et hod 2 and 3 ( see Exam ple 3- 31) t end t o be m ore robust t han packing m et hod 1; t he opt im al choice of sim plifying 1 or 4 result s will be affect ed by regist er pressure of t he runt im e and ot her relevant m icroarchit ect ural condit ions. Not e t hat t he num eric discussion of per- it erat ion cost of packing/ packing is illust rat ive only. I t will vary wit h t est cases using a different base line code sequence and will generally increase if t he non- vect orizable rout ine requires longer t im e t o execut e because t he num ber of loop it erat ions t hat can reside in flight in t he execut ion core decreases.
3-41
GENERAL OPTIMIZATION GUIDELINES
3.6
OPTIMIZING MEMORY ACCESSES
This sect ion discusses guidelines for opt im izing code and dat a m em ory accesses. The m ost im port ant recom m endat ions are:
• • • • • • • • •
Execut e load and st ore operat ions wit hin available execut ion bandwidt h. Enable forward progress of speculat ive execut ion. Enable st ore forwarding t o proceed. Align dat a, paying at t ent ion t o dat a layout and st ack alignm ent . Place code and dat a on separat e pages. Enhance dat a localit y. Use prefet ching and cacheabilit y cont rol inst ruct ions. Enhance code localit y and align branch t arget s. Take advant age of writ e com bining.
Alignm ent and forwarding problem s are am ong t he m ost com m on sources of large delays on processors based on I nt el Net Burst m icroarchit ect ure.
3.6.1
Load and Store Execution Bandwidth
Typically, loads and st ores are t he m ost frequent operat ions in a workload, up t o 40% of t he inst ruct ions in a workload carrying load or st ore int ent are not uncom m on. Each generat ion of m icroarchit ect ure provides m ult iple buffers t o support execut ing load and st ore operat ions while t here are inst ruct ions in flight . Soft ware can m axim ize m em ory perform ance by not exceeding t he issue or buffering lim it at ions of t he m achine. I n t he I nt el Core m icroarchit ect ure, only 20 st ores and 32 loads m ay be in flight at once. I n I nt el m icroarchit ect ure code nam e Nehalem , t here are 32 st ore buffers and 48 load buffers. Since only one load can issue per cycle, algorit hm s which operat e on t wo arrays are const rained t o one operat ion every ot her cycle unless you use program m ing t ricks t o reduce t he am ount of m em ory usage. I nt el Core Duo and I nt el Core Solo processors have less buffers. Nevert heless t he general heurist ic applies t o all of t hem .
3.6.1.1
Make Use of Load Bandwidth in Intel® Microarchitecture Code Name Sandy Bridge
While prior m icroarchit ect ure has one load port ( port 2) , I nt el m icroarchit ect ure code nam e Sandy Bridge can load from port 2 and port 3. Thus t wo load operat ions can be perform ed every cycle and doubling t he load t hroughput of t he code. This im proves code t hat reads a lot of dat a and does not need t o writ e out result s t o m em ory very oft en ( Port 3 also handles st ore- address operat ion) . To exploit t his bandwidt h, t he dat a has t o st ay in t he L1 dat a cache or it should be accessed sequent ially, enabling t he hardware prefet chers t o bring t he dat a t o t he L1 dat a cache in t im e. Consider t he following C code exam ple of adding all t he elem ent s of an array: int buff[ BUFF_SI ZE] ; int sum = 0;
for ( i= 0; i< BUFF_SI ZE; i+ + ) { sum + = buff[ i] ; } Alt ernat ive 1 is t he assem bly code generat ed by t he I nt el com piler for t his C code, using t he opt im izat ion flag for I nt el m icroarchit ect ure code nam e Nehalem . The com piler vect orizes execut ion using I nt el SSE inst ruct ions. I n t his code, each ADD operat ion uses t he result of t he previous ADD operat ion. This lim it s t he t hroughput t o one load and ADD operat ion per cycle. Alt ernat ive 2 is opt im ized for I nt el m icroarchi-
3-42
GENERAL OPTIMIZATION GUIDELINES
t ect ure code nam e Sandy Bridge by enabling it t o use t he addit ional load bandwidt h. The code rem oves t he dependency am ong ADD operat ions, by using t wo regist ers t o sum t he array values. Two load and t wo ADD operat ions can be execut ed every cycle.
Example 3-35. Optimize for Load Port Bandwidth in Intel Microarchitecture Code Name Sandy Bridge Register dependency inhibits PADD execution Reduce register dependency allow two load port to supply PADD execution xor pxor lea
eax, eax xmm0, xmm0 rsi, buff
loop_start: paddd xmm0, [rsi+4*rax] paddd xmm0, [rsi+4*rax+16] paddd xmm0, [rsi+4*rax+32] paddd xmm0, [rsi+4*rax+48] paddd xmm0, [rsi+4*rax+64] paddd xmm0, [rsi+4*rax+80] paddd xmm0, [rsi+4*rax+96] paddd xmm0, [rsi+4*rax+112] add eax, 32 cmp eax, BUFF_SIZE jl loop_start sum_partials: movdqa xmm1, xmm0 psrldq xmm1, 8 paddd xmm0, xmm1 movdqa xmm2, xmm0 psrldq xmm2, 4 paddd xmm0, xmm2 movd [sum], xmm0
3.6.1.2
xor pxor pxor lea
eax, eax xmm0, xmm0 xmm1, xmm1 rsi, buff
loop_start: paddd xmm0, [rsi+4*rax] paddd xmm1, [rsi+4*rax+16] paddd xmm0, [rsi+4*rax+32] paddd xmm1, [rsi+4*rax+48] paddd xmm0, [rsi+4*rax+64] paddd xmm1, [rsi+4*rax+80] paddd xmm0, [rsi+4*rax+96] paddd xmm1, [rsi+4*rax+112] add eax, 32 cmp eax, BUFF_SIZE jl loop_start sum_partials: paddd xmm0, xmm1 movdqa xmm1, xmm0 psrldq xmm1, 8 paddd xmm0, xmm1 movdqa xmm2, xmm0 psrldq xmm2, 4 paddd xmm0, xmm2 movd [sum], xmm0
L1D Cache Latency in Intel® Microarchitecture Code Name Sandy Bridge
Load lat ency from L1D cache m ay vary ( see Table 2- 19) . The best case if 4 cycles, which apply t o load operat ions t o general purpose regist ers using one of t he following:
• •
One regist er. A base regist er plus an offset t hat is sm aller t han 2048.
Consider t he point er- chasing code exam ple in Exam ple 3- 36.
3-43
GENERAL OPTIMIZATION GUIDELINES
Example 3-36. Index versus Pointers in Pointer-Chasing Code Traversing through indexes Traversing through pointers // C code example index = buffer.m_buff[index].next_index; // ASM example loop: shl rbx, 6 mov rbx, 0x20(rbx+rcx) dec rax cmp rax, -1 jne loop
// C code example node = node->pNext; // ASM example loop: mov rdx, [rdx] dec rax cmp rax, -1 jne loop
The left side im plem ent s point er chasing via t raversing an index. Com piler t hen generat es t he code shown below addressing m em ory using base+ index wit h an offset . The right side shows com piler generat ed code from point er de- referencing code and uses only a base regist er. The code on t he right side is fast er t han t he left side across I nt el m icroarchit ect ure code nam e Sandy Bridge and prior m icroarchit ect ure. However t he code t hat t raverses index will be slower on I nt el m icroarchit ect ure code nam e Sandy Bridge relat ive t o prior m icroarchit ect ure.
3.6.1.3
Handling L1D Cache Bank Conflict
I n I nt el m icroarchit ect ure code nam e Sandy Bridge, t he int ernal organizat ion of t he L1D cache m ay m anifest a sit uat ion when t wo load m icro- ops whose addresses have a bank conflict . When a bank conflict is present bet ween t wo load operat ions, t he m ore recent one will be delayed unt il t he conflict is resolved. A bank conflict happens when t wo sim ult aneous load operat ions have t he sam e bit 2- 5 of t heir linear address but t hey are not from t he sam e set in t he cache ( bit s 6 - 12) . Bank conflict s should be handled only if t he code is bound by load bandwidt h. Som e bank conflict s do not cause any perform ance degradat ion since t hey are hidden by ot her perform ance lim it ers. Elim inat ing such bank conflict s does not im prove perform ance. The following exam ple dem onst rat es bank conflict and how t o m odify t he code and avoid t hem . I t uses t wo source arrays wit h a size t hat is a m ult iple of cache line size. When loading an elem ent from A and t he count erpart elem ent from B t he elem ent s have t he sam e offset in t heir cache lines and t herefore a bank conflict m ay happen. Wit h t he Haswell m icroarchit ect ure, t he L1 DCache bank conflict issue does not apply.
3-44
GENERAL OPTIMIZATION GUIDELINES
. Example 3-37. Example of Bank Conflicts in L1D Cache and Remedy int A[128]; int B[128]; int C[128]; for (i=0;i Array” does not change during t he loop. Therefore, t he com piler cannot keep “ Pt r- > Array” in a regist er as an invariant and m ust read it again in every it erat ion. Alt hough t his sit uat ion can be fixed in soft ware by a rewrit ing t he code t o require t he address of t he point er is invariant , m em ory disam biguat ion provides perform ance gain wit hout rewrit ing t he code.
3-46
GENERAL OPTIMIZATION GUIDELINES
Example 3-39. Loads Blocked by Stores of Unknown Address C code Assembly sequence struct AA { AA ** array; }; void nullify_array ( AA *Ptr, DWORD Index, AA *ThisPtr ) { while ( Ptr->Array[--Index] != ThisPtr ) { Ptr->Array[Index] = NULL ; }; };
3.6.4
nullify_loop: mov dword ptr [eax], 0 mov edx, dword ptr [edi] sub ecx, 4 cmp dword ptr [ecx+edx], esi lea eax, [ecx+edx] jne nullify_loop
Alignment
Alignm ent of dat a concerns all kinds of variables:
• • • •
Dynam ically allocat ed variables. Mem bers of a dat a st ruct ure. Global or local variables. Param et ers passed on t he st ack.
Misaligned dat a access can incur significant perform ance penalt ies. This is part icularly t rue for cache line split s. The size of a cache line is 64 byt es in t he Pent ium 4 and ot her recent I nt el processors, including processors based on I nt el Core m icroarchit ect ure. An access t o dat a unaligned on 64- byt e boundary leads t o t wo m em ory accesses and requires several µops t o be execut ed ( inst ead of one) . Accesses t hat span 64- byt e boundaries are likely t o incur a large perform ance penalt y, t he cost of each st all generally are great er on m achines wit h longer pipelines. Double- precision float ing- point operands t hat are eight- byt e aligned have bet t er perform ance t han operands t hat are not eight- byt e aligned, since t hey are less likely t o incur penalt ies for cache and MOB split s. Float ing- point operat ion on a m em ory operands require t hat t he operand be loaded from m em ory. This incurs an addit ional µop, which can have a m inor negat ive im pact on front end bandwidt h. Addit ionally, m em ory operands m ay cause a dat a cache m iss, causing a penalt y. Asse m bly/ Com pile r Codin g Ru le 4 6 . ( H im pa ct , H ge ne r a lit y) Align dat a on nat ural operand size address boundaries. I f t he dat a will be accessed wit h vect or inst ruct ion loads and st ores, align t he dat a on 16- byt e boundaries. For best perform ance, align dat a as follows:
• • • • • •
Align 8- bit dat a at any address. Align 16- bit dat a t o be cont ained wit hin an aligned 4- byt e word. Align 32- bit dat a so t hat it s base address is a m ult iple of four. Align 64- bit dat a so t hat it s base address is a m ult iple of eight . Align 80- bit dat a so t hat it s base address is a m ult iple of sixt een. Align 128- bit dat a so t hat it s base address is a m ult iple of sixt een.
A 64- byt e or great er dat a st ruct ure or array should be aligned so t hat it s base address is a m ult iple of 64. Sort ing dat a in decreasing size order is one heurist ic for assist ing wit h nat ural alignm ent . As long as 16byt e boundaries ( and cache lines) are never crossed, nat ural alignm ent is not st rict ly necessary ( t hough it is an easy way t o enforce t his) .
3-47
GENERAL OPTIMIZATION GUIDELINES
Exam ple 3- 40 shows t he t ype of code t hat can cause a cache line split . The code loads t he addresses of t wo DWORD arrays. 029E70FEH is not a 4- byt e- aligned address, so a 4- byt e access at t his address will get 2 byt es from t he cache line t his address is cont ained in, and 2 byt es from t he cache line t hat st art s at 029E700H. On processors wit h 64- byt e cache lines, a sim ilar cache line split will occur every 8 it erat ions. Example 3-40. Code That Causes Cache Line Split mov mov Blockmove: mov mov mov mov add add sub jnz
esi, 029e70feh edi, 05be5260h eax, DWORD PTR [esi] ebx, DWORD PTR [esi+4] DWORD PTR [edi], eax DWORD PTR [edi+4], ebx esi, 8 edi, 8 edx, 1 Blockmove
Figure 3- 2 illust rat es t he sit uat ion of accessing a dat a elem ent t hat span across cache line boundaries.
Address 029e70c1h
Address 029e70feh
Cache Line 029e70c0h
Index 0
Cache Line 029e7100h
Index 0 cont'd
Index 1
Index 15
Index 16
Cache Line 029e7140h
Index 16 cont'd
Index 17
Index 31
Index 32
Figure 3-2. Cache Line Split in Accessing Elements in a Array
Alignm ent of code is less im port ant for processors based on I nt el Net Burst m icroarchit ect ure. Alignm ent of branch t arget s t o m axim ize bandwidt h of fet ching cached inst ruct ions is an issue only when not execut ing out of t he t race cache. Alignm ent of code can be an issue for t he Pent ium M, I nt el Core Duo and I nt el Core 2 Duo processors. Alignm ent of branch t arget s will im prove decoder t hroughput .
3.6.5
Store Forwarding
The processor ’s m em ory syst em only sends st ores t o m em ory ( including cache) aft er st ore ret irem ent . However, st ore dat a can be forwarded from a st ore t o a subsequent load from t he sam e address t o give a m uch short er st ore- load lat ency. There are t wo kinds of requirem ent s for st ore forwarding. I f t hese requirem ent s are violat ed, st ore forwarding cannot occur and t he load m ust get it s dat a from t he cache ( so t he st ore m ust writ e it s dat a back t o t he cache first ) . This incurs a penalt y t hat is largely relat ed t o pipeline dept h of t he underlying m icro- archit ect ure.
3-48
GENERAL OPTIMIZATION GUIDELINES
The first requirem ent pert ains t o t he size and alignm ent of t he st ore- forwarding dat a. This rest rict ion is likely t o have high im pact on overall applicat ion perform ance. Typically, a perform ance penalt y due t o violat ing t his rest rict ion can be prevent ed. The st ore- t o- load forwarding rest rict ions vary from one m icroarchit ect ure t o anot her. Several exam ples of coding pit falls t hat cause st ore- forwarding st alls and solut ions t o t hese pit falls are discussed in det ail in Sect ion 3.6.5.1, “ St ore- t o- Load- Forwarding Rest rict ion on Size and Alignm ent .” The second requirem ent is t he availabilit y of dat a, discussed in Sect ion 3.6.5.2, “ St ore- forwarding Rest rict ion on Dat a Availabilit y.” A good pract ice is t o elim inat e redundant load operat ions. I t m ay be possible t o keep a t em porary scalar variable in a regist er and never writ e it t o m em ory. Generally, such a variable m ust not be accessible using indirect point ers. Moving a variable t o a regist er elim inat es all loads and st ores of t hat variable and elim inat es pot ent ial problem s associat ed wit h st ore forwarding. However, it also increases regist er pressure. Load inst ruct ions t end t o st art chains of com put at ion. Since t he out- of- order engine is based on dat a dependence, load inst ruct ions play a significant role in t he engine’s abilit y t o execut e at a high rat e. Elim inat ing loads should be given a high priorit y. I f a variable does not change bet ween t he t im e when it is st ored and t he t im e when it is used again, t he regist er t hat was st ored can be copied or used direct ly. I f regist er pressure is t oo high, or an unseen funct ion is called before t he st ore and t he second load, it m ay not be possible t o elim inat e t he second load. Asse m bly/ Com pile r Codin g Rule 4 7 . ( H im pa ct , M ge ne r a lit y) Pass param et ers in regist ers inst ead of on t he st ack where possible. Passing argum ent s on t he st ack requires a st ore followed by a reload. While t his sequence is opt im ized in hardware by providing t he value t o t he load direct ly from t he m em ory order buffer wit hout t he need t o access t he dat a cache if perm it t ed by st ore- forwarding rest rict ions, float ing- point values incur a significant lat ency in forwarding. Passing float ing- point argum ent s in ( preferably XMM) regist ers should save t his long lat ency operat ion. Param et er passing convent ions m ay lim it t he choice of which param et ers are passed in regist ers which are passed on t he st ack. However, t hese lim it at ions m ay be overcom e if t he com piler has cont rol of t he com pilat ion of t he whole binary ( using whole- program opt im izat ion) .
3.6.5.1
Store-to-Load-Forwarding Restriction on Size and Alignment
Dat a size and alignm ent rest rict ions for st ore- forwarding apply t o processors based on I nt el Net Burst m icroarchit ect ure, I nt el Core m icroarchit ect ure, I nt el Core 2 Duo, I nt el Core Solo and Pent ium M processors. The perform ance penalt y for violat ing st ore- forwarding rest rict ions is less for short er- pipelined m achines t han for I nt el Net Burst m icroarchit ect ure. St ore- forwarding rest rict ions vary wit h each m icroarchit ect ure. I nt el Net Burst m icroarchit ect ure places m ore const raint s t han I nt el Core m icroarchit ect ure on code generat ion t o enable st ore- forwarding t o m ake progress inst ead of experiencing st alls. Fixing st ore- forwarding problem s for I nt el Net Burst m icroarchit ect ure generally also avoids problem s on Pent ium M, I nt el Core Duo and I nt el Core 2 Duo processors. The size and alignm ent rest rict ions for st ore- forwarding in processors based on I nt el Net Burst m icroarchit ect ure are illust rat ed in Figure 3- 3.
3-49
GENERAL OPTIMIZATION GUIDELINES
Load Aligned with Store W ill Forward (a) Sm all load after Large Store
(b) Size of Load >= Store
(c) Size of Load >= Store(s)
(d) 128-bit Forward Must Be 16-Byte Aligned
Non-Forwarding
Store
Penalty Load
Store
Penalty Load
Store
Penalty Load
Store
Penalty Load
16-Byte Boundary OM15155
Figure 3-3. Size and Alignment Restrictions in Store Forwarding
The following rules help sat isfy size and alignm ent rest rict ions for st ore forwarding: Asse m bly/ Com pile r Coding Rule 4 8 . ( H im pa ct , M ge ne r a lit y) A load t hat forwards from a st ore m ust have t he sam e address st art point and t herefore t he sam e alignm ent as t he st ore dat a. Asse m bly/ Com pile r Coding Rule 4 9 . ( H im pa ct , M ge ne r a lit y) The dat a of a load which is forwarded from a st ore m ust be com plet ely cont ained wit hin t he st ore dat a. A load t hat forwards from a st ore m ust wait for t he st ore’s dat a t o be writ t en t o t he st ore buffer before proceeding, but ot her, unrelat ed loads need not wait .
3-50
GENERAL OPTIMIZATION GUIDELINES
Asse m bly/ Com pile r Codin g Rule 5 0 . ( H im pa ct , M L ge ne r a lit y) I f it is necessary t o ext ract a nonaligned port ion of st ored dat a, read out t he sm allest aligned port ion t hat com plet ely cont ains t he dat a and shift / m ask t he dat a as necessary. This is bet t er t han incurring t he penalt ies of a failed st oreforward. Asse m bly/ Com pile r Coding Rule 5 1 . ( M H im pa ct , M L ge n e r a lit y) Avoid several sm all loads aft er large st ores t o t he sam e area of m em ory by using a single large read and regist er copies as needed. Exam ple 3- 41 depict s several st ore- forwarding sit uat ions in which sm all loads follow large st ores. The first t hree load operat ions illust rat e t he sit uat ions described in Rule 51. However, t he last load operat ion get s dat a from st ore- forwarding wit hout problem . Example 3-41. Situations Showing Small Loads After Large Store mov [EBP],‘abcd’ mov AL, [EBP] mov BL, [EBP + 1] mov CL, [EBP + 2] mov DL, [EBP + 3] mov AL, [EBP]
; Not blocked - same alignment ; Blocked ; Blocked ; Blocked ; Not blocked - same alignment ; n.b. passes older blocked loads
Exam ple 3- 42 illust rat es a st ore- forwarding sit uat ion in which a large load follows several sm all st ores. The dat a needed by t he load operat ion cannot be forwarded because all of t he dat a t hat needs t o be forwarded is not cont ained in t he st ore buffer. Avoid large loads aft er sm all st ores t o t he sam e area of m em ory. Example 3-42. Non-forwarding Example of Large Load After Small Store mov [EBP], ‘a’ mov [EBP + 1], ‘b’ mov [EBP + 2], ‘c’ mov [EBP + 3], ‘d’ mov EAX, [EBP] ; Blocked ; The first 4 small store can be consolidated into ; a single DWORD store to prevent this non-forwarding ; situation. Exam ple 3- 43 illust rat es a st alled st ore- forwarding sit uat ion t hat m ay appear in com piler generat ed code. Som et im es a com piler generat es code sim ilar t o t hat shown in Exam ple 3- 43 t o handle a spilled byt e t o t he st ack and convert t he byt e t o an int eger value. Example 3-43. A Non-forwarding Situation in Compiler Generated Code mov DWORD PTR [esp+10h], 00000000h mov BYTE PTR [esp+10h], bl mov eax, DWORD PTR [esp+10h] ; Stall and eax, 0xff ; Converting back to byte value
3-51
GENERAL OPTIMIZATION GUIDELINES
Exam ple 3- 44 offers t wo alt ernat ives t o avoid t he non- forwarding sit uat ion shown in Exam ple 3- 43. Example 3-44. Two Ways to Avoid Non-forwarding Situation in Example 3-43 ; A. Use MOVZ instruction to avoid large load after small ; store, when spills are ignored. movz eax, bl
; Replaces the last three instructions
; B. Use MOVZ instruction and handle spills to the stack mov DWORD PTR [esp+10h], 00000000h mov BYTE PTR [esp+10h], bl movz eax, BYTE PTR [esp+10h]
; Not blocked
When m oving dat a t hat is sm aller t han 64 bit s bet ween m em ory locat ions, 64- bit or 128- bit SI MD regist er m oves are m ore efficient ( if aligned) and can be used t o avoid unaligned loads. Alt hough float ing- point regist ers allow t he m ovem ent of 64 bit s at a t im e, float ing- point inst ruct ions should not be used for t his purpose, as dat a m ay be inadvert ent ly m odified. As an addit ional exam ple, consider t he cases in Exam ple 3- 45. Example 3-45. Large and Small Load Stalls ; A. Large load stall mov mov fld
mem, eax mem + 4, ebx mem
; Store dword to address “MEM" ; Store dword to address “MEM + 4" ; Load qword at address “MEM", stalls
; B. Small Load stall fstp mov mov
mem bx, mem+2 cx, mem+4
; Store qword to address “MEM" ; Load word at address “MEM + 2", stalls ; Load word at address “MEM + 4", stalls
I n t he first case ( A) , t here is a large load aft er a series of sm all st ores t o t he sam e area of m em ory ( beginning at m em ory address MEM) . The large load will st all. The FLD m ust wait for t he st ores t o writ e t o m em ory before it can access all t he dat a it requires. This st all can also occur wit h ot her dat a t ypes ( for exam ple, when byt es or words are st ored and t hen words or doublewords are read from t he sam e area of m em ory) . I n t he second case ( B) , t here is a series of sm all loads aft er a large st ore t o t he sam e area of m em ory ( beginning at m em ory address MEM) . The sm all loads will st all. The word loads m ust wait for t he quadword st ore t o writ e t o m em ory before t hey can access t he dat a t hey require. This st all can also occur wit h ot her dat a t ypes ( for exam ple, when doublewords or words are st ored and t hen words or byt es are read from t he sam e area of m em ory) . This can be avoided by m oving t he st ore as far from t he loads as possible.
3-52
GENERAL OPTIMIZATION GUIDELINES
St ore forwarding rest rict ions for processors based on I nt el Core m icroarchit ect ure is list ed in Table 3- 3. Table 3-3. Store Forwarding Restrictions of Processors Based on Intel Core Microarchitecture Store Forwarding Store Alignment Width of Store (bits) Load Alignment (byte) Width of Load (bits) Restriction To Natural size
16
word aligned
8, 16
not stalled
To Natural size
16
not word aligned
8
stalled
To Natural size
32
dword aligned
8, 32
not stalled
To Natural size
32
not dword aligned
8
stalled
To Natural size
32
word aligned
16
not stalled
To Natural size
32
not word aligned
16
stalled
To Natural size
64
qword aligned
8, 16, 64
not stalled
To Natural size
64
not qword aligned
8, 16
stalled
To Natural size
64
dword aligned
32
not stalled
To Natural size
64
not dword aligned
32
stalled
To Natural size
128
dqword aligned
8, 16, 128
not stalled
To Natural size
128
not dqword aligned
8, 16
stalled
To Natural size
128
dword aligned
32
not stalled
To Natural size
128
not dword aligned
32
stalled
To Natural size
128
qword aligned
64
not stalled
To Natural size
128
not qword aligned
64
stalled
Unaligned, start byte 1
32
byte 0 of store
8, 16, 32
not stalled
Unaligned, start byte 1
32
not byte 0 of store
8, 16
stalled
Unaligned, start byte 1
64
byte 0 of store
8, 16, 32
not stalled
Unaligned, start byte 1
64
not byte 0 of store
8, 16, 32
stalled
Unaligned, start byte 1
64
byte 0 of store
64
stalled
Unaligned, start byte 7
32
byte 0 of store
8
not stalled
Unaligned, start byte 7
32
not byte 0 of store
8
not stalled
Unaligned, start byte 7
32
don’t care
16, 32
stalled
Unaligned, start byte 7
64
don’t care
16, 32, 64
stalled
3.6.5.2
Store-forwarding Restriction on Data Availability
The value t o be st ored m ust be available before t he load operat ion can be com plet ed. I f t his rest rict ion is violat ed, t he execut ion of t he load will be delayed unt il t he dat a is available. This delay causes som e execut ion resources t o be used unnecessarily, and t hat can lead t o sizable but non- det erm inist ic delays. However, t he overall im pact of t his problem is m uch sm aller t han t hat from violat ing size and alignm ent requirem ent s.
3-53
GENERAL OPTIMIZATION GUIDELINES
I n m odern m icroarchit ect ures, hardware predict s when loads are dependent on and get t heir dat a forwarded from preceding st ores. These predict ions can significant ly im prove perform ance. However, if a load is scheduled t oo soon aft er t he st ore it depends on or if t he generat ion of t he dat a t o be st ored is delayed, t here can be a significant penalt y. There are several cases in which dat a is passed t hrough m em ory, and t he st ore m ay need t o be separat ed from t he load:
• • • • •
Spills, save and rest ore regist ers in a st ack fram e. Param et er passing. Global and volat ile variables. Type conversion bet ween int eger and float ing- point . When com pilers do not analyze code t hat is inlined, forcing variables t hat are involved in t he int erface wit h inlined code t o be in m em ory, creat ing m ore m em ory variables and prevent ing t he elim inat ion of redundant loads.
Asse m bly/ Com pile r Coding Rule 5 2 . ( H im pa ct , M H ge ne r a lit y) Where it is possible t o do so wit hout incurring ot her penalt ies, priorit ize t he allocat ion of variables t o regist ers, as in regist er allocat ion and for param et er passing, t o m inim ize t he likelihood and im pact of st ore- forwarding problem s. Try not t o st ore- forward dat a generat ed from a long lat ency inst ruct ion - for exam ple, MUL or DI V. Avoid st ore- forwarding dat a for variables wit h t he short est st ore- load dist ance. Avoid st oreforwarding dat a for variables wit h m any and/ or long dependence chains, and especially avoid including a st ore forward on a loop- carried dependence chain. Exam ple 3- 46 shows an exam ple of a loop- carried dependence chain. Example 3-46. Loop-carried Dependence Chain for ( i = 0; i < MAX; i++ ) { a[i] = b[i] * foo; foo = a[i] / 3; } // foo is a loop-carried dependence. Asse m bly/ Com pile r Coding Rule 5 3 . ( M im pa ct , M H ge ne r a lit y) Calculat e st ore addresses as early as possible t o avoid having st ores block loads.
3.6.6
Data Layout Optimizations
Use r / Sour ce Coding Rule 6 . ( H im pa ct , M ge ne r a lit y) Pad dat a st ruct ures defined in t he source code so t hat every dat a elem ent is aligned t o a nat ural operand size address boundary. I f t he operands are packed in a SI MD inst ruct ion, align t o t he packed elem ent size ( 64- bit or 128- bit ) . Align dat a by providing padding inside st ruct ures and arrays. Program m ers can reorganize st ruct ures and arrays t o m inim ize t he am ount of m em ory wast ed by padding. However, com pilers m ight not have t his freedom . The C program m ing language, for exam ple, specifies t he order in which st ruct ure elem ent s are allocat ed in m em ory. For m ore inform at ion, see Sect ion 4.4, “ St ack and Dat a Alignm ent ” .
3-54
GENERAL OPTIMIZATION GUIDELINES
Exam ple 3- 47 shows how a dat a st ruct ure could be rearranged t o reduce it s size. Example 3-47. Rearranging a Data Structure struct unpacked { /* Fits in 20 bytes due to padding */ int a; char b; int c; char d; int e; }; struct packed { /* Fits in 16 bytes */ int a; int c; int e; char b; char d; } Cache line size of 64 byt es can im pact st ream ing applicat ions ( for exam ple, m ult im edia) . These reference and use dat a only once before discarding it . Dat a accesses which sparsely ut ilize t he dat a wit hin a cache line can result in less efficient ut ilizat ion of syst em m em ory bandwidt h. For exam ple, arrays of st ruct ures can be decom posed int o several arrays t o achieve bet t er packing, as shown in Exam ple 3- 48.
Example 3-48. Decomposing an Array struct { /* 1600 bytes */ int a, c, e; char b, d; } array_of_struct [100]; struct { /* 1400 bytes */ int a[100], c[100], e[100]; char b[100], d[100]; } struct_of_array; struct { /* 1200 bytes */ int a, c, e; } hybrid_struct_of_array_ace[100]; struct { /* 200 bytes */ char b, d; } hybrid_struct_of_array_bd[100];
The efficiency of such opt im izat ions depends on usage pat t erns. I f t he elem ent s of t he st ruct ure are all accessed t oget her but t he access pat t ern of t he array is random , t hen ARRAY_OF_STRUCT avoids unnecessary prefet ch even t hough it wast es m em ory. However, if t he access pat t ern of t he array exhibit s localit y ( for exam ple, if t he array index is being swept t hrough) t hen processors wit h hardware prefet chers will prefet ch dat a from STRUCT_OF_ARRAY, even if t he elem ent s of t he st ruct ure are accessed t oget her.
3-55
GENERAL OPTIMIZATION GUIDELINES
When t he elem ent s of t he st ruct ure are not accessed wit h equal frequency, such as when elem ent A is accessed t en t im es m ore oft en t han t he ot her ent ries, t hen STRUCT_OF_ARRAY not only saves m em ory, but it also prevent s fet ching unnecessary dat a it em s B, C, D, and E. Using STRUCT_OF_ARRAY also enables t he use of t he SI MD dat a t ypes by t he program m er and t he com piler. Not e t hat STRUCT_OF_ARRAY can have t he disadvant age of requiring m ore independent m em ory st ream references. This can require t he use of m ore prefet ches and addit ional address generat ion calculat ions. I t can also have an im pact on DRAM page access efficiency. An alt ernat ive, HYBRI D_STRUCT_OF_ARRAY blends t he t wo approaches. I n t his case, only 2 separat e address st ream s are generat ed and referenced: 1 for HYBRI D_STRUCT_OF_ARRAY_ACE and 1 for HYBRI D_STRUCT_OF_ARRAY_BD. The second alt erat ive also prevent s fet ching unnecessary dat a — assum ing t hat ( 1) t he variables A, C and E are always used t oget her, and ( 2) t he variables B and D are always used t oget her, but not at t he sam e t im e as A, C and E. The hybrid approach ensures:
• • • •
Sim pler/ fewer address generat ions t han STRUCT_OF_ARRAY. Fewer st ream s, which reduces DRAM page m isses. Fewer prefet ches due t o fewer st ream s. Efficient cache line packing of dat a elem ent s t hat are used concurrent ly.
Asse m bly/ Com pile r Coding Rule 5 4 . ( H im pa ct , M ge ne r a lit y) Try t o arrange dat a st ruct ures such t hat t hey perm it sequent ial access. I f t he dat a is arranged int o a set of st ream s, t he aut om at ic hardware prefet cher can prefet ch dat a t hat will be needed by t he applicat ion, reducing t he effect ive m em ory lat ency. I f t he dat a is accessed in a nonsequent ial m anner, t he aut om at ic hardware prefet cher cannot prefet ch t he dat a. The prefet cher can recognize up t o eight concurrent st ream s. See Chapt er 7, “ Opt im izing Cache Usage,” for m ore inform at ion on t he hardware prefet cher. Use r / Sour ce Coding Rule 7 . ( M im pa ct , L ge ne r a lit y) Beware of false sharing wit hin a cache line ( 64 byt es) .
3.6.7
Stack Alignment
Perform ance penalt y of unaligned access t o t he st ack happens when a m em ory reference split s a cache line. This m eans t hat one out of eight spat ially consecut ive unaligned quadword accesses is always penalized, sim ilarly for one out of 4 consecut ive, non- aligned double- quadword accesses, et c. Aligning t he st ack m ay be beneficial any t im e t here are dat a obj ect s t hat exceed t he default st ack alignm ent of t he syst em . For exam ple, on 32/ 64bit Linux, and 64bit Windows, t he default st ack alignm ent is 16 byt es, while 32bit Windows is 4 byt es. Asse m bly/ Com pile r Codin g Ru le 5 5 . ( H im pa ct , M ge ne r a lit y) Make sure t hat t he st ack is aligned at t he largest m ult i- byt e granular dat a t ype boundary m at ching t he regist er widt h. Aligning t he st ack t ypically requires t he use of an addit ional regist er t o t rack across a padded area of unknown am ount . There is a t rade- off bet ween causing unaligned m em ory references t hat spanned across a cache line and causing ext ra general purpose regist er spills. The assem bly level t echnique t o im plem ent dynam ic st ack alignm ent m ay depend on com pilers, and specific OS environm ent . The reader m ay wish t o st udy t he assem bly out put from a com piler of int erest .
3-56
GENERAL OPTIMIZATION GUIDELINES
Example 3-49. Examples of Dynamical Stack Alignment // 32-bit environment push ebp ; save ebp mov ebp, esp ; ebp now points to incoming parameters andl esp, $- ;align esp to N byte boundary sub esp, $; reserve space for new stack frame . ; parameters must be referenced off of ebp mov esp, ebp ; restore esp pop ebp ; restore ebp // 64-bit environment sub esp, $ mov r13, $ andl r13, $- ; r13 point to aligned section in stack . ;use r13 as base for aligned data
I f for som e reason it is not possible t o align t he st ack for 64- bit s, t he rout ine should access t he param et er and save it int o a regist er or known aligned st orage, t hus incurring t he penalt y only once.
3.6.8
Capacity Limits and Aliasing in Caches
There are cases in which addresses wit h a given st ride will com pet e for som e resource in t he m em ory hierarchy. Typically, caches are im plem ent ed t o have m ult iple ways of set associat ivit y, wit h each way consist ing of m ult iple set s of cache lines ( or sect ors in som e cases) . Mult iple m em ory references t hat com pet e for t he sam e set of each way in a cache can cause a capacit y issue. There are aliasing condit ions t hat apply t o specific m icroarchit ect ures. Not e t hat first- level cache lines are 64 byt es. Thus, t he least significant 6 bit s are not considered in alias com parisons. For processors based on I nt el Net Burst m icroarchit ect ure, dat a is loaded int o t he second level cache in a sect or of 128 byt es, so t he least significant 7 bit s are not considered in alias com parisons.
3.6.8.1
Capacity Limits in Set-Associative Caches
Capacit y lim it s m ay be reached if t he num ber of out st anding m em ory references t hat are m apped t o t he sam e set in each way of a given cache exceeds t he num ber of ways of t hat cache. The condit ions t hat apply t o t he first- level dat a cache and second level cache are list ed below:
•
L1 Se t Conflict s — Mult iple references m ap t o t he sam e first- level cache set . The conflict ing condit ion is a st ride det erm ined by t he size of t he cache in byt es, divided by t he num ber of ways. These com pet ing m em ory references can cause excessive cache m isses only if t he num ber of out st anding m em ory references exceeds t he num ber of ways in t he working set : — On Pent ium 4 and I nt el Xeon processors wit h a CPUI D signat ure of fam ily encoding 15, m odel encoding of 0, 1, or 2; t here will be an excess of first- level cache m isses for m ore t han 4 sim ult aneous com pet ing m em ory references t o addresses wit h 2- KByt e m odulus. — On Pent ium 4 and I nt el Xeon processors wit h a CPUI D signat ure of fam ily encoding 15, m odel encoding 3; t here will be an excess of first- level cache m isses for m ore t han 8 sim ult aneous com pet ing references t o addresses t hat are apart by 2- KByt e m odulus.
3-57
GENERAL OPTIMIZATION GUIDELINES
•
— On I nt el Core 2 Duo, I nt el Core Duo, I nt el Core Solo, and Pent ium M processors, t here will be an excess of first- level cache m isses for m ore t han 8 sim ult aneous references t o addresses t hat are apart by 4- KByt e m odulus. L2 Se t Conflict s — Mult iple references m ap t o t he sam e second- level cache set . The conflict ing condit ion is also det erm ined by t he size of t he cache or t he num ber of ways: — On Pent ium 4 and I nt el Xeon processors, t here will be an excess of second- level cache m isses for m ore t han 8 sim ult aneous com pet ing references. The st ride sizes t hat can cause capacit y issues are 32 KByt es, 64 KByt es, or 128 KByt es, depending of t he size of t he second level cache. — On Pent ium M processors, t he st ride sizes t hat can cause capacit y issues are 128 KByt es or 256 KByt es, depending of t he size of t he second level cache. On I nt el Core 2 Duo, I nt el Core Duo, I nt el Core Solo processors, st ride size of 256 KByt es can cause capacit y issue if t he num ber of sim ult aneous accesses exceeded t he way associat ivit y of t he L2 cache.
3.6.8.2
Aliasing Cases in the Pentium® M, Intel® Core™ Solo, Intel® Core™ Duo and Intel® Core™ 2 Duo Processors
Pent ium M, I nt el Core Solo, I nt el Core Duo and I nt el Core 2 Duo processors have t he following aliasing case:
•
St or e for w a r ding — I f a st ore t o an address is followed by a load from t he sam e address, t he load will not proceed unt il t he st ore dat a is available. I f a st ore is followed by a load and t heir addresses differ by a m ult iple of 4 KByt es, t he load st alls unt il t he st ore operat ion com plet es.
Asse m bly/ Com pile r Coding Rule 5 6 . ( H im pa ct , M ge ne r a lit y) Avoid having a st ore followed by a non- dependent load wit h addresses t hat differ by a m ult iple of 4 KByt es. Also, lay out dat a or order com put at ion t o avoid having cache lines t hat have linear addresses t hat are a m ult iple of 64 KByt es apart in t he sam e working set . Avoid having m ore t han 4 cache lines t hat are som e m ult iple of 2 KByt es apart in t he sam e first - level cache working set , and avoid having m ore t han 8 cache lines t hat are som e m ult iple of 4 KByt es apart in t he sam e first - level cache working set . When declaring m ult iple arrays t hat are referenced wit h t he sam e index and are each a m ult iple of 64 KByt es ( as can happen wit h STRUCT_OF_ARRAY dat a layout s) , pad t hem t o avoid declaring t hem cont iguously. Padding can be accom plished by eit her int ervening declarat ions of ot her variables or by art ificially increasing t he dim ension. Use r / Sour ce Coding Ru le 8 . ( H im pa ct , M L ge n e r a lit y) Consider using a special m em ory allocat ion library wit h address offset capabilit y t o avoid aliasing. One way t o im plem ent a m em ory allocat or t o avoid aliasing is t o allocat e m ore t han enough space and pad. For exam ple, allocat e st ruct ures t hat are 68 KB inst ead of 64 KByt es t o avoid t he 64- KByt e aliasing, or have t he allocat or pad and ret urn random offset s t hat are a m ult iple of 128 Byt es ( t he size of a cache line) . Use r / Sour ce Coding Rule 9 . ( M im pa ct , M ge ne r a lit y) When padding variable declarat ions t o avoid aliasing, t he great est benefit com es from avoiding aliasing on second- level cache lines, suggest ing an offset of 128 byt es or m ore. 4- KByt e m em ory aliasing occurs when t he code accesses t wo different m em ory locat ions wit h a 4- KByt e offset bet ween t hem . The 4- KByt e aliasing sit uat ion can m anifest in a m em ory copy rout ine where t he addresses of t he source buffer and dest inat ion buffer m aint ain a const ant offset and t he const ant offset happens t o be a m ult iple of t he byt e increm ent from one it erat ion t o t he next . Exam ple 3- 50 shows a rout ine t hat copies 16 byt es of m em ory in each it erat ion of a loop. I f t he offset s ( m odular 4096) bet ween source buffer ( EAX) and dest inat ion buffer ( EDX) differ by 16, 32, 48, 64, 80; loads have t o wait unt il st ores have been ret ired before t hey can cont inue. For exam ple at offset 16, t he load of t he next it erat ion is 4- KByt e aliased current it erat ion st ore, t herefore t he loop m ust wait unt il t he st ore operat ion com plet es, m aking t he ent ire loop serialized. The am ount of t im e needed t o wait decreases wit h larger offset unt il offset of 96 resolves t he issue ( as t here is no pending st ores by t he t im e of t he load wit h sam e address) .
3-58
GENERAL OPTIMIZATION GUIDELINES
The I nt el Core m icroarchit ect ure provides a perform ance m onit oring event ( see LOAD_BLOCK.OVERLAP_STORE in I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) t hat allows soft ware t uning effort t o det ect t he occurrence of aliasing condit ions. Example 3-50. Aliasing Between Loads and Stores Across Loop Iterations LP: movaps xmm0, [eax+ecx] movaps [edx+ecx], xmm0 add ecx, 16 jnz lp
3.6.9
Mixing Code and Data
The aggressive prefet ching and pre- decoding of inst ruct ions by I nt el processors have t wo relat ed effect s:
•
•
Self- m odifying code works correct ly, according t o t he I nt el archit ect ure processor requirem ent s, but incurs a significant perform ance penalt y. Avoid self- m odifying code if possible. Placing writ able dat a in t he code segm ent m ight be im possible t o dist inguish from self- m odifying code. Writ able dat a in t he code segm ent m ight suffer t he sam e perform ance penalt y as selfm odifying code.
Asse m bly/ Com pile r Coding Rule 5 7 . ( M im pa ct , L ge ne r a lit y) I f ( hopefully read- only) dat a m ust occur on t he sam e page as code, avoid placing it im m ediat ely aft er an indirect j um p. For exam ple, follow an indirect j um p wit h it s m ost ly likely t arget , and place t he dat a aft er an uncondit ional branch. Tu n in g Su gge st ion 1 . I n rare cases, a perform ance problem m ay be caused by execut ing dat a on a code page as inst ruct ions. This is very likely t o happen when execut ion is following an indirect branch t hat is not resident in t he t race cache. I f t his is clearly causing a perform ance problem , t ry m oving t he dat a elsewhere, or insert ing an illegal opcode or a PAUSE inst ruct ion im m ediat ely aft er t he indirect branch. Not e t hat t he lat t er t wo alt ernat ives m ay degrade perform ance in som e circum st ances. Asse m bly/ Com pile r Codin g Rule 5 8 . ( H im pa ct , L ge ne r a lit y) Always put code and dat a on separat e pages. Avoid self- m odifying code wherever possible. I f code is t o be m odified, t ry t o do it all at once and m ake sure t he code t hat perform s t he m odificat ions and t he code being m odified are on separat e 4- KByt e pages or on separat e aligned 1- KByt e subpages.
3.6.9.1
Self-modifying Code
Self- m odifying code ( SMC) t hat ran correct ly on Pent ium III processors and prior im plem ent at ions will run correct ly on subsequent im plem ent at ions. SMC and cross- m odifying code ( when m ult iple processor s in a m ult iprocessor syst em are writ ing t o a code page) should be avoided when high perform ance is desired. Soft ware should avoid writ ing t o a code page in t he sam e 1- KByt e subpage t hat is being execut ed or fet ching code in t he sam e 2- KByt e subpage of t hat is being writ t en. I n addit ion, sharing a page cont aining direct ly or speculat ively execut ed code wit h anot her processor as a dat a page can t rigger an SMC condit ion t hat causes t he ent ire pipeline of t he m achine and t he t race cache t o be cleared. This is due t o t he self- m odifying code condit ion. Dynam ic code need not cause t he SMC condit ion if t he code writ t en fills up a dat a page before t hat page is accessed as code. Dynam ically- m odified code ( for exam ple, from t arget fix- ups) is likely t o suffer from t he SMC condit ion and should be avoided where possible. Avoid t he condit ion by int roducing indirect branches and using dat a t ables on dat a pages ( not code pages) using regist er- indirect calls.
3-59
GENERAL OPTIMIZATION GUIDELINES
3.6.9.2
Position Independent Code
Posit ion independent code oft en needs t o obt ain t he value of t he inst ruct ion point er. Exam ple 3- 51a shows one t echnique t o put t he value of I P int o t he ECX regist er by issuing a CALL wit hout a m at ching RET. Exam ple 3- 51b shows an alt ernat ive t echnique t o put t he value of I P int o t he ECX regist er using a m at ched pair of CALL/ RET.
Example 3-51. Instruction Pointer Query Techniques a) Using call without return to obtain IP does not corrupt the RSB call _label; return address pushed is the IP of next instruction _label: pop ECX; IP of this instruction is now put into ECX b) Using matched call/ret pair call _lblcx; ... ; ECX now contains IP of this instruction ... _lblcx mov ecx, [esp]; ret
3.6.10
Write Combining
Writ e com bining ( WC) im proves perform ance in t wo ways:
• •
On a writ e m iss t o t he first- level cache, it allows m ult iple st ores t o t he sam e cache line t o occur before t hat cache line is read for ownership ( RFO) from furt her out in t he cache/ m em ory hierarchy. Then t he rest of line is read, and t he byt es t hat have not been writ t en are com bined wit h t he unm odified byt es in t he ret urned line. Writ e com bining allows m ult iple writ es t o be assem bled and writ t en furt her out in t he cache hierarchy as a unit . This saves port and bus t raffic. Saving t raffic is part icularly im port ant for avoiding part ial writ es t o uncached m em ory.
There are six writ e- com bining buffers ( on Pent ium 4 and I nt el Xeon processors wit h a CPUI D signat ure of fam ily encoding 15, m odel encoding 3; t here are 8 writ e- com bining buffers) . Two of t hese buffers m ay be writ t en out t o higher cache levels and freed up for use on ot her writ e m isses. Only four writ ecom bining buffers are guarant eed t o be available for sim ult aneous use. Writ e com bining applies t o m em ory t ype WC; it does not apply t o m em ory t ype UC. There are six writ e- com bining buffers in each processor core in I nt el Core Duo and I nt el Core Solo processors. Processors based on I nt el Core m icroarchit ect ure have eight writ e- com bining buffers in each core. St art ing wit h I nt el m icroarchit ect ure code nam e Nehalem , t here are 10 buffers available for writ ecom bining. Asse m bly/ Com pile r Coding Rule 5 9 . ( H im pa ct , L ge ne r a lit y) I f an inner loop writ es t o m ore t han four arrays ( four dist inct cache lines) , apply loop fission t o break up t he body of t he loop such t hat only four arrays are being writ t en t o in each it erat ion of each of t he result ing loops. Writ e com bining buffers are used for st ores of all m em ory t ypes. They are part icularly im port ant for writ es t o uncached m em ory: writ es t o different part s of t he sam e cache line can be grouped int o a single, full- cache- line bus t ransact ion inst ead of going across t he bus ( since t hey are not cached) as several part ial writ es. Avoiding part ial writ es can have a significant im pact on bus bandwidt h- bound graphics applicat ions, where graphics buffers are in uncached m em ory. Separat ing writ es t o uncached m em ory and writ es t o writ eback m em ory int o separat e phases can assure t hat t he writ e com bining buffers can fill before get t ing evict ed by ot her writ e t raffic. Elim inat ing part ial writ e t ransact ions has been found t o have 3-60
GENERAL OPTIMIZATION GUIDELINES
perform ance im pact on t he order of 20% for som e applicat ions. Because t he cache lines are 64 byt es, a writ e t o t he bus for 63 byt es will result in 8 part ial bus t ransact ions. When coding funct ions t hat execut e sim ult aneously on t wo t hreads, reducing t he num ber of writ es t hat are allowed in an inner loop will help t ake full advant age of writ e- com bining st or e buffers. For writ ecom bining buffer recom m endat ions for Hyper-Threading Technology, see Chapt er 8, “ Mult icore and Hyper-Threading Technology.” St ore ordering and visibilit y are also im port ant issues for writ e com bining. When a writ e t o a writ ecom bining buffer for a previously- unwrit t en cache line occurs, t here will be a read- for- ownership ( RFO) . I f a subsequent writ e happens t o anot her writ e- com bining buffer, a separat e RFO m ay be caused for t hat cache line. Subsequent writ es t o t he first cache line and writ e- com bining buffer will be delayed unt il t he second RFO has been serviced t o guarant ee properly ordered visibilit y of t he writ es. I f t he m em ory t ype for t he writ es is writ e- com bining, t here will be no RFO since t he line is not cached, and t here is no such delay. For det ails on writ e- com bining, see Chapt er 7, “ Opt im izing Cache Usage.”
3.6.11
Locality Enhancement
Localit y enhancem ent can reduce dat a t raffic originat ing from an out er- level sub- syst em in t he cache/ m em ory hierarchy. This is t o address t he fact t hat t he access- cost in t erm s of cycle- count from an out er level will be m ore expensive t han from an inner level. Typically, t he cycle- cost of accessing a given cache level ( or m em ory syst em ) varies across different m icroarchit ect ures, processor im plem ent at ions, and plat form com ponent s. I t m ay be sufficient t o recognize t he relat ive dat a access cost t rend by localit y rat her t han t o follow a large t able of num eric values of cycle- cost s, list ed per localit y, per processor/ platform im plem ent at ions, et c. The general t rend is t ypically t hat access cost from an out er sub- syst em m ay be approxim at ely 3- 10X m ore expensive t han accessing dat a from t he im m ediat e inner level in t he cache/ m em ory hierarchy, assum ing sim ilar degrees of dat a access parallelism . Thus localit y enhancem ent should st art wit h charact erizing t he dom inant dat a t raffic localit y. Sect ion A, “Applicat ion Perform ance Tools,” describes som e t echniques t hat can be used t o det erm ine t he dom inant dat a t raffic localit y for any workload. Even if cache m iss rat es of t he last level cache m ay be low relat ive t o t he num ber of cache references, processors t ypically spend a sizable port ion of t heir execut ion t im e wait ing for cache m isses t o be serviced. Reducing cache m isses by enhancing a program ’s localit y is a key opt im izat ion. This can t ake several form s:
•
• •
Blocking t o it erat e over a port ion of an array t hat will fit in t he cache ( wit h t he purpose t hat subsequent references t o t he dat a- block [ or t ile] will be cache hit references) . Loop int erchange t o avoid crossing cache lines or page boundaries. Loop skewing t o m ake accesses cont iguous.
Localit y enhancem ent t o t he last level cache can be accom plished wit h sequencing t he dat a access pat t ern t o t ake advant age of hardware prefet ching. This can also t ake several form s:
• •
Transform at ion of a sparsely populat ed m ult i- dim ensional array int o a one- dim ension array such t hat m em ory references occur in a sequent ial, sm all- st ride pat t ern t hat is friendly t o t he hardware prefet ch ( see Sect ion 2.3.5.4, “ Dat a Prefet ching” ) . Opt im al t ile size and shape select ion can furt her im prove t em poral dat a localit y by increasing hit rat es int o t he last level cache and reduce m em ory t raffic result ing from t he act ions of hardware prefet ching ( see Sect ion 7.5.11, “ Hardware Prefet ching and Cache Blocking Techniques” ) .
I t is im port ant t o avoid operat ions t hat work against localit y- enhancing t echniques. Using t he lock prefix heavily can incur large delays when accessing m em ory, regardless of whet her t he dat a is in t he cache or in syst em m em ory. Use r / Sou r ce Codin g Ru le 1 0 . ( H im pa ct , H ge ne r a lit y) Opt im izat ion t echniques such as blocking, loop int erchange, loop skewing, and packing are best done by t he com piler. Opt im ize dat a st ruct ures eit her t o fit in one- half of t he first - level cache or in t he second- level cache; t urn on loop opt im izat ions in t he com piler t o enhance localit y for nest ed loops.
3-61
GENERAL OPTIMIZATION GUIDELINES
Opt im izing for one- half of t he first- level cache will bring t he great est perform ance benefit in t erm s of cycle- cost per dat a access. I f one- half of t he first- level cache is t oo sm all t o be pract ical, opt im ize for t he second- level cache. Opt im izing for a point in bet ween ( for exam ple, for t he ent ire first- level cache) will likely not bring a subst ant ial im provem ent over opt im izing for t he second- level cache.
3.6.12
Minimizing Bus Latency
Each bus t ransact ion includes t he overhead of m aking request s and arbit rat ions. The average lat ency of bus read and bus writ e t ransact ions will be longer if reads and writ es alt ernat e. Segm ent ing reads and writ es int o phases can reduce t he average lat ency of bus t ransact ions. This is because t he num ber of incidences of successive t ransact ions involving a read following a writ e, or a writ e following a read, are reduced. Use r / Sour ce Coding Ru le 1 1 . ( M im pa ct , M L ge n e r a lit y) I f t here is a blend of reads and writ es on t he bus, changing t he code t o separat e t hese bus t ransact ions int o read phases and writ e phases can help perform ance. Not e, however, t hat t he order of read and writ e operat ions on t he bus is not t he sam e as it appears in t he program . Bus lat ency for fet ching a cache line of dat a can vary as a funct ion of t he access st ride of dat a references. I n general, bus lat ency will increase in response t o increasing values of t he st ride of successive cache m isses. I ndependent ly, bus lat ency will also increase as a funct ion of increasing bus queue dept hs ( t he num ber of out st anding bus request s of a given t ransact ion t ype) . The com binat ion of t hese t wo t rends can be highly non- linear, in t hat bus lat ency of large- st ride, bandwidt h- sensit ive sit uat ions are such t hat effect ive t hroughput of t he bus syst em for dat a- parallel accesses can be significant ly less t han t he effect ive t hroughput of sm all- st ride, bandwidt h- sensit ive sit uat ions. To m inim ize t he per- access cost of m em ory t raffic or am ort ize raw m em ory lat ency effect ively, soft ware should cont rol it s cache m iss pat t ern t o favor higher concent rat ion of sm aller- st ride cache m isses. Use r / Sour ce Coding Rule 1 2 . ( H im pa ct , H ge ne r a lit y) To achieve effect ive am ort izat ion of bus lat ency, soft ware should favor dat a access pat t erns t hat result in higher concent rat ions of cache m iss pat t erns, wit h cache m iss st rides t hat are significant ly sm aller t han half t he hardware prefet ch t rigger t hreshold.
3.6.13
Non-Temporal Store Bus Traffic
Peak syst em bus bandwidt h is shared by several t ypes of bus act ivit ies, including reads ( from m em ory) , reads for ownership ( of a cache line) , and writ es. The dat a t ransfer rat e for bus writ e t ransact ions is higher if 64 byt es are writ t en out t o t he bus at a t im e. Typically, bus writ es t o Writ eback ( WB) m em ory m ust share t he syst em bus bandwidt h wit h read- forownership ( RFO) t raffic. Non- t em poral st ores do not require RFO t raffic; t hey do require care in m anaging t he access pat t erns in order t o ensure 64 byt es are evict ed at once ( rat her t han evict ing several 8- byt e chunks) .
3-62
GENERAL OPTIMIZATION GUIDELINES
Alt hough t he dat a bandwidt h of full 64- byt e bus writ es due t o non- t em poral st ores is t wice t hat of bus writ es t o WB m em ory, t ransferring 8- byt e chunks wast es bus request bandwidt h and delivers significant ly lower dat a bandwidt h. This difference is depict ed in Exam ples 3- 52 and 3- 53. Example 3-52. Using Non-temporal Stores and 64-byte Bus Write Transactions #define STRIDESIZE 256 lea ecx, p64byte_Aligned mov edx, ARRAY_LEN xor eax, eax slloop: movntps XMMWORD ptr [ecx + eax], xmm0 movntps XMMWORD ptr [ecx + eax+16], xmm0 movntps XMMWORD ptr [ecx + eax+32], xmm0 movntps XMMWORD ptr [ecx + eax+48], xmm0 ; 64 bytes is written in one bus transaction add eax, STRIDESIZE cmp eax, edx jl slloop
Example 3-53. On-temporal Stores and Partial Bus Write Transactions #define STRIDESIZE 256 Lea ecx, p64byte_Aligned Mov edx, ARRAY_LEN Xor eax, eax slloop: movntps XMMWORD ptr [ecx + eax], xmm0 movntps XMMWORD ptr [ecx + eax+16], xmm0 movntps XMMWORD ptr [ecx + eax+32], xmm0 ; Storing 48 bytes results in 6 bus partial transactions add eax, STRIDESIZE cmp eax, edx jl slloop
3.7
PREFETCHING
Recent I nt el processor fam ilies em ploy several prefet ching m echanism s t o accelerat e t he m ovem ent of dat a or code and im prove perform ance:
• • •
Hardware inst ruct ion prefet cher. Soft ware prefet ch for dat a. Hardware prefet ch for cache lines of dat a or inst ruct ions.
3.7.1
Hardware Instruction Fetching and Software Prefetching
Soft ware prefet ching requires a program m er t o use PREFETCH hint inst ruct ions and ant icipat e som e suitable t im ing and locat ion of cache m isses.
3-63
GENERAL OPTIMIZATION GUIDELINES
Soft ware PREFETCH operat ions work t he sam e way as do load from m em ory operat ions, wit h t he following except ions:
• • •
Soft ware PREFETCH inst ruct ions ret ire aft er virt ual t o physical address t ranslat ion is com plet ed. I f an except ion, such as page fault , is required t o prefet ch t he dat a, t hen t he soft ware prefet ch inst ruct ion ret ires wit hout prefet ching dat a. Avoid specifying a NULL address for soft ware prefet ches.
3.7.2
Hardware Prefetching for First-Level Data Cache
The hardware prefet ching m echanism for L1 in I nt el Core m icroarchit ect ure is discussed in Sect ion 2.4.4.2. Exam ple 3- 54 depict s a t echnique t o t rigger hardware prefet ch. The code dem onst rat es t raversing a linked list and perform ing som e com put at ional work on 2 m em bers of each elem ent t hat reside in 2 different cache lines. Each elem ent is of size 192 byt es. The t ot al size of all elem ent s is larger t han can be fit t ed in t he L2 cache.
Example 3-54. Using DCU Hardware Prefetch Original code
Modified sequence benefit from prefetch
mov ebx, DWORD PTR [First] xor eax, eax scan_list: mov eax, [ebx+4] mov ecx, 60
mov ebx, DWORD PTR [First] xor eax, eax scan_list: mov eax, [ebx+4] mov eax, [ebx+4] mov eax, [ebx+4] mov ecx, 60
do_some_work_1: add eax, eax and eax, 6 sub ecx, 1 jnz do_some_work_1 mov eax, [ebx+64] mov ecx, 30 do_some_work_2: add eax, eax and eax, 6 sub ecx, 1 jnz do_some_work_2
do_some_work_1: add eax, eax and eax, 6 sub ecx, 1 jnz do_some_work_1 mov eax, [ebx+64] mov ecx, 30 do_some_work_2: add eax, eax and eax, 6 sub ecx, 1 jnz do_some_work_2
mov ebx, [ebx] test ebx, ebx jnz scan_list
mov ebx, [ebx] test ebx, ebx jnz scan_list
The addit ional inst ruct ions t o load dat a from one m em ber in t he m odified sequence can t rigger t he DCU hardware prefet ch m echanism s t o prefet ch dat a in t he next cache line, enabling t he work on t he second m em ber t o com plet e sooner. Soft ware can gain from t he first- level dat a cache prefet chers in t wo cases:
• •
I f dat a is not in t he second- level cache, t he first- level dat a cache prefet cher enables early t rigger of t he second- level cache prefet cher. I f dat a is in t he second- level cache and not in t he first- level dat a cache, t hen t he first- level dat a cache prefet cher t riggers earlier dat a bring- up of sequent ial cache line t o t he first- level dat a cache.
3-64
GENERAL OPTIMIZATION GUIDELINES
There are sit uat ions t hat soft ware should pay at t ent ion t o a pot ent ial side effect of t riggering unnecessary DCU hardware prefet ches. I f a large dat a st ruct ure wit h m any m em bers spanning m any cache lines is accessed in ways t hat only a few of it s m em bers are act ually referenced, but t here are m ult iple pair accesses t o t he sam e cache line. The DCU hardware prefet cher can t rigger fet ching of cache lines t hat are not needed. I n Exam ple , references t o t he “ Pt s” array and “Alt Pt s” will t rigger DCU prefet ch t o fet ch addit ional cache lines t hat won’t be needed. I f significant negat ive perform ance im pact is det ect ed due t o DCU hardware prefet ch on a port ion of t he code, soft ware can t ry t o reduce t he size of t hat cont em poraneous working set t o be less t han half of t he L2 cache.
Example 3-55. Avoid Causing DCU Hardware Prefetch to Fetch Un-needed Lines while ( CurrBond != NULL ) { MyATOM *a1 = CurrBond->At1 ; MyATOM *a2 = CurrBond->At2 ; if ( a1->CurrStep LastStep && a2->CurrStep LastStep ) { a1->CurrStep++ ; a2->CurrStep++ ; double ux = a1->Pts[0].x - a2->Pts[0].x ; double uy = a1->Pts[0].y - a2->Pts[0].y ; double uz = a1->Pts[0].z - a2->Pts[0].z ; a1->AuxPts[0].x += ux ; a1->AuxPts[0].y += uy ; a1->AuxPts[0].z += uz ; a2->AuxPts[0].x += ux ; a2->AuxPts[0].y += uy ; a2->AuxPts[0].z += uz ; }; CurrBond = CurrBond->Next ; };
To fully benefit from t hese prefet chers, organize and access t he dat a using one of t he following m et hods: Met hod 1:
• •
Organize t he dat a so consecut ive accesses can usually be found in t he sam e 4- KByt e page. Access t he dat a in const ant st rides forward or backward I P Prefet cher.
Met hod 2:
• •
Organize t he dat a in consecut ive lines. Access t he dat a in increasing addresses, in sequent ial cache lines.
Exam ple dem onst rat es accesses t o sequent ial cache lines t hat can benefit from t he first- level cache prefet cher.
3-65
GENERAL OPTIMIZATION GUIDELINES
Example 3-56. Technique For Using L1 Hardware Prefetch unsigned int *p1, j, a, b; for (j = 0; j < num; j += 16) { a = p1[j]; b = p1[j+1]; // Use these two values } By elevat ing t he load operat ions from m em ory t o t he beginning of each it erat ion, it is likely t hat a significant part of t he lat ency of t he pair cache line t ransfer from m em ory t o t he second- level cache will be in parallel wit h t he t ransfer of t he first cache line. The I P prefet cher uses only t he lower 8 bit s of t he address t o dist inguish a specific address. I f t he code size of a loop is bigger t han 256 byt es, t wo loads m ay appear sim ilar in t he lowest 8 bit s and t he I P prefet cher will be rest rict ed. Therefore, if you have a loop bigger t han 256 byt es, m ake sure t hat no t wo loads have t he sam e lowest 8 bit s in order t o use t he I P prefet cher.
3.7.3
Hardware Prefetching for Second-Level Cache
The I nt el Core m icroarchit ect ure cont ains t wo second- level cache prefet chers:
•
•
St r e a m e r — Loads dat a or inst ruct ions from m em ory t o t he second- level cache. To use t he st ream er, organize t he dat a or inst ruct ions in blocks of 128 byt es, aligned on 128 byt es. The first access t o one of t he t wo cache lines in t his block while it is in m em ory t riggers t he st ream er t o prefet ch t he pair line. To soft ware, t he L2 st ream er ’s funct ionalit y is sim ilar t o t he adj acent cache line prefet ch m echanism found in processors based on I nt el Net Burst m icroarchit ect ure. D a t a pr e fe t ch logic ( D PL) — DPL and L2 St ream er are t riggered only by writ eback m em ory t ype. They prefet ch only inside page boundary ( 4 KByt es) . Bot h L2 prefet chers can be t riggered by soft ware prefet ch inst ruct ions and by prefet ch request from DCU prefet chers. DPL can also be t riggered by read for ownership ( RFO) operat ions. The L2 St ream er can also be t riggered by DPL request s for L2 cache m isses.
Soft ware can gain from organizing dat a bot h according t o t he inst ruct ion point er and according t o line st rides. For exam ple, for m at rix calculat ions, colum ns can be prefet ched by I P- based prefet ches, and rows can be prefet ched by DPL and t he L2 st ream er.
3.7.4
Cacheability Instructions
SSE2 provides addit ional cacheabilit y inst ruct ions t hat ext end t hose provided in SSE. The new cacheabilit y inst ruct ions include:
• • •
New st ream ing st ore inst ruct ions. New cache line flush inst ruct ion. New m em ory fencing inst ruct ions.
For m ore inform at ion, see Chapt er 7, “ Opt im izing Cache Usage.”
3.7.5
REP Prefix and Data Movement
The REP prefix is com m only used wit h st ring m ove inst ruct ions for m em ory relat ed library funct ions such as MEMCPY ( using REP MOVSD) or MEMSET ( using REP STOS) . These STRI NG/ MOV inst ruct ions wit h t he REP prefixes are im plem ent ed in MS- ROM and have several im plem ent at ion variant s wit h different perform ance levels.
3-66
GENERAL OPTIMIZATION GUIDELINES
The specific variant of t he im plem ent at ion is chosen at execut ion t im e based on dat a layout , alignm ent and t he count er ( ECX) value. For exam ple, MOVSB/ STOSB wit h t he REP prefix should be used wit h count er value less t han or equal t o t hree for best perform ance. St ring MOVE/STORE inst ruct ions have m ult iple dat a granularit ies. For efficient dat a m ovem ent , larger dat a granularit ies are preferable. This m eans bet t er efficiency can be achieved by decom posing an arbit rary count er value int o a num ber of doublewords plus single byt e m oves wit h a count value less t han or equal t o 3. Because soft ware can use SI MD dat a m ovem ent inst ruct ions t o m ove 16 byt es at a t im e, t he following paragraphs discuss general guidelines for designing and im plem ent ing high- perform ance library funct ions such as MEMCPY( ) , MEMSET( ) , and MEMMOVE( ) . Four fact ors are t o be considered:
•
•
•
•
Thr ough put pe r it e r a t ion — I f t wo pieces of code have approxim at ely ident ical pat h lengt hs, efficiency favors choosing t he inst ruct ion t hat m oves larger pieces of dat a per it erat ion. Also, sm aller code size per it erat ion will in general reduce overhead and im prove t hroughput . Som et im es, t his m ay involve a com parison of t he relat ive overhead of an it erat ive loop st ruct ure versus using REP prefix for it erat ion. Addr e ss a lign m e nt — Dat a m ovem ent inst ruct ions wit h highest t hroughput usually have alignm ent rest rict ions, or t hey operat e m ore efficient ly if t he dest inat ion address is aligned t o it s nat ural dat a size. Specifically, 16- byt e m oves need t o ensure t he dest inat ion address is aligned t o 16- byt e boundaries, and 8- byt es m oves perform bet t er if t he dest inat ion address is aligned t o 8- byt e boundaries. Frequent ly, m oving at doubleword granularit y perform s bet t er wit h addresses t hat are 8byt e aligned. REP st r ing m ove vs. SI M D m ove — I m plem ent ing general- purpose m em ory funct ions using SI MD ext ensions usually requires adding som e prolog code t o ensure t he availabilit y of SI MD inst ruct ions, pream ble code t o facilit at e aligned dat a m ovem ent requirem ent s at runt im e. Throughput com parison m ust also t ake int o considerat ion t he overhead of t he prolog when considering a REP st ring im plem ent at ion versus a SI MD approach. Ca che e vict ion — I f t he am ount of dat a t o be processed by a m em ory rout ine approaches half t he size of t he last level on- die cache, t em poral localit y of t he cache m ay suffer. Using st ream ing st ore inst ruct ions ( for exam ple: MOVNTQ, MOVNTDQ) can m inim ize t he effect of flushing t he cache. The t hreshold t o st art using a st ream ing st ore depends on t he size of t he last level cache. Det erm ine t he size using t he det erm inist ic cache param et er leaf of CPUI D. Techniques for using st ream ing st ores for im plem ent ing a MEMSET( ) - t ype library m ust also consider t hat t he applicat ion can benefit from t his t echnique only if it has no im m ediat e need t o reference t he t arget addresses. This assum pt ion is easily upheld when t est ing a st ream ing- st ore im plem ent at ion on a m icro- benchm ark configurat ion, but violat ed in a full- scale applicat ion sit uat ion.
When applying general heurist ics t o t he design of general- purpose, high- perform ance library rout ines, t he following guidelines can are useful when opt im izing an arbit rary count er value N and address alignm ent . Different t echniques m ay be necessary for opt im al perform ance, depending on t he m agnit ude of N:
•
•
•
When N is less t han som e sm all count ( where t he sm all count t hreshold will vary bet ween m icroarchit ect ures - - em pirically, 8 m ay be a good value when opt im izing for I nt el Net Burst m icroarchit ect ure) , each case can be coded direct ly wit hout t he overhead of a looping st ruct ure. For exam ple, 11 byt es can be processed using t wo MOVSD inst ruct ions explicit ly and a MOVSB wit h REP count er equaling 3. When N is not sm all but st ill less t han som e t hreshold value ( which m ay vary for different m icroarchit ect ures, but can be det erm ined em pirically) , an SI MD im plem ent at ion using run- t im e CPUI D and alignm ent prolog will likely deliver less t hroughput due t o t he overhead of t he prolog. A REP st ring im plem ent at ion should favor using a REP st ring of doublewords. To im prove address alignm ent , a sm all piece of prolog code using MOVSB/ STOSB wit h a count less t han 4 can be used t o peel off t he non- aligned dat a m oves before st art ing t o use MOVSD/ STOSD. When N is less t han half t he size of last level cache, t hroughput considerat ion m ay favor eit her: — An approach using a REP st ring wit h t he largest dat a granularit y because a REP st ring has lit t le overhead for loop it erat ion, and t he branch m ispredict ion overhead in t he prolog/ epilogue code t o handle address alignm ent is am ort ized over m any it erat ions.
3-67
GENERAL OPTIMIZATION GUIDELINES
— An it erat ive approach using t he inst ruct ion wit h largest dat a granularit y, where t he overhead for SI MD feat ure det ect ion, it erat ion overhead, and prolog/ epilogue for alignm ent cont rol can be m inim ized. The t rade- off bet ween t hese approaches m ay depend on t he m icroarchit ect ure.
•
An exam ple of MEMSET( ) im plem ent ed using st osd for arbit rary count er value wit h t he dest inat ion address aligned t o doubleword boundary in 32- bit m ode is shown in Exam ple 3- 57. When N is larger t han half t he size of t he last level cache, using 16- byt e granularit y st ream ing st ores wit h prolog/ epilog for address alignm ent will likely be m ore efficient , if t he dest inat ion addresses will not be referenced im m ediat ely aft erwards.
Example 3-57. REP STOSD with Arbitrary Count Size and 4-Byte-Aligned Destination A ‘C’ example of Memset() Equivalent Implementation Using REP STOSD void memset(void *dst,int c,size_t size) { char *d = (char *)dst; size_t i; for (i=0;i batchsize) mode1 = 0; } } void consumer_thread() { int mode2 = 0; int iter_num = workamount - batchsize; while (iter_num--) { WaitForSignal(&signal1); consume(buffs[mode2],count); // placeholder function Signal(&end1,1); mode2++; if (mode2 > batchsize) mode2 = 0; } for (i=0;i batchsize) mode2 = 0; } }
8-21
MULTICORE AND HYPER-THREADING TECHNOLOGY
8.6.3
Eliminate 64-KByte Aliased Data Accesses
The 64- KByt e aliasing condit ion is discussed in Chapt er 3. Mem ory accesses t hat sat isfy t he 64- KByt e aliasing condit ion can cause excessive evict ions of t he first- level dat a cache. Elim inat ing 64- KByt e aliased dat a accesses originat ing from each t hread helps im prove frequency scaling in general. Furt herm ore, it enables t he first- level dat a cache t o perform efficient ly when HT Technology is fully ut ilized by soft ware applicat ions. Use r / Sour ce Coding Rule 3 1 . ( H im pa ct , H ge ne r a lit y) Minim ize dat a access pat t erns t hat are offset by m ult iples of 64 KByt es in each t hread. The presence of 64- KByt e aliased dat a access can be det ect ed using Pent ium 4 processor perform ance m onit oring event s. Appendix B includes an updat ed list of Pent ium 4 processor perform ance m et rics. These m et rics are based on event s accessed using t he I nt el VTune Perform ance Analyzer. Perform ance penalt ies associat ed wit h 64- KByt e aliasing are applicable m ainly t o current processor im plem ent at ions of HT Technology or I nt el Net Burst m icroarchit ect ure. The next sect ion discusses m em ory opt im izat ion t echniques t hat are applicable t o m ult it hreaded applicat ions running on processors support ing HT Technology.
8.7
FRONT END OPTIMIZATION
For dual- core processors where t he second- level unified cache is shared by t wo processor cores ( I nt el Core Duo processor and processors based on I nt el Core m icroarchit ect ure) , m ult i- t hreaded soft ware should consider t he increase in code working set due t o t wo t hreads fet ching code from t he unified cache as part of front end and cache opt im izat ion. For quad- core processors based on I nt el Core m icroarchit ect ure, t he considerat ions t hat applies t o I nt el Core 2 Duo processors also apply t o quad- core processors.
8.7.1
Avoid Excessive Loop Unrolling
Unrolling loops can reduce t he num ber of branches and im prove t he branch predict abilit y of applicat ion code. Loop unrolling is discussed in det ail in Chapt er 3. Loop unrolling m ust be used j udiciously. Be sure t o consider t he benefit of im proved branch predict abilit y and t he cost of under- ut ilizat ion of t he loop st ream det ect or ( LSD) . Use r / Sour ce Coding Rule 3 2 . ( M im pa ct , L ge ne r a lit y) Avoid excessive loop unrolling t o ensure t he LSD is operat ing efficient ly.
8.8
AFFINITIES AND MANAGING SHARED PLATFORM RESOURCES
Modern OSes provide eit her API and/ or dat a const ruct s ( e.g. affinit y m asks) t hat allow applicat ions t o m anage cert ain shared resources , e.g. logical processors, Non- Uniform Mem ory Access ( NUMA) m em ory sub- syst em s. Before m ult it hreaded soft ware considers using affinit y API s, it should consider t he recom m endat ions in Table 8- 2.
8-22
MULTICORE AND HYPER-THREADING TECHNOLOGY
Table 8-2. Design-Time Resource Management Choices Runtime Environment
Thread Scheduling/Processor Affinity Consideration
Memory Affinity Consideration
A single-threaded application
Support OS scheduler objectives on system response and throughput by letting OS scheduler manage scheduling. OS provides facilities for end user to optimize runtime specific environment.
Not relevant; let OS do its job.
A multi-threaded application requiring:
Rely on OS default scheduler policy.
Rely on OS default scheduler policy.
Hard-coded affinity-binding will likely harm system response and throughput; and/or in some cases hurting application performance.
Use API that could provide transparent NUMA benefit without managing NUMA explicitly.
If application-customized thread binding policy is considered, a cooperative approach with OS scheduler should be taken instead of hard-coded thread affinity binding policy. For example, the use of SetThreadIdealProcessor() can provide a floating base to anchor a next-freecore binding policy for localityoptimized application binding policy, and cooperate with default OS policy.
Use API that could provide transparent NUMA benefit without managing NUMA explicitly.
Application-customized thread binding policy can be more efficient than default OS policy. Use performance event to help optimize locality and cache transfer opportunities.
Application-customized memory affinity binding policy can be more efficient than default OS policy. Use performance event to diagnose nonlocal memory access issues related to either OS or custom policy
i) less than all processor resource in the system, ii) share system resource with other concurrent applications, iii) other concurrent applications may have higher priority. A multi-threaded application requiring i) foreground and higher priority, ii) uses less than all processor resource in the system, iii) share system resource with other concurrent applications, iv) but other concurrent applications have lower priority. A multi-threaded application runs in foreground, requiring all processor resource in the system and not sharing system resource with concurrent applications; MPIbased multi-threading.
A multi-threaded application that employs its own explicit thread affinitybinding policy should deploy with some form of opt-in choice granted by the end-user or administrator. For example, permission to deploy explicit thread affinity-binding policy can be activated after permission is granted after installation.
Use performance event to diagnose non-local memory access issue if default OS policy cause performance issue.
8-23
MULTICORE AND HYPER-THREADING TECHNOLOGY
8.8.1
Topology Enumeration of Shared Resources
Whet her m ult it hreaded soft ware ride on OS scheduling policy or need t o use affinit y API s for cust om ized resource m anagem ent , underst anding t he t opology of t he shared plat form resource is essent ial. The processor t opology of logical processors ( SMT) , processor cores, and physical processors in t he plat form can enum erat ed using inform at ion provided by CPUI D. This is discussed in Chapt er 8, “ Mult ipleProcessor Managem ent ” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3A. A whit e paper and reference code is also available from I nt el.
8.8.2
Non-Uniform Memory Access
Plat form s using t wo or m ore I nt el Xeon processors based on I nt el m icroarchit ect ure code nam e Nehalem support non- uniform m em ory access ( NUMA) t opology because each physical processor provides it s own local m em ory cont roller. NUMA offers syst em m em ory bandwidt h t hat can scale wit h t he num ber of physical processors. Syst em m em ory lat ency will exhibit asym m et ric behavior depending on t he m em ory t ransact ion occurring locally in t he sam e socket or rem ot ely from anot her socket . Addit ionally, OSspecific const ruct and/ or im plem ent at ion behavior m ay present addit ional com plexit y at t he API level t hat t he m ult i- t hreaded soft ware m ay need t o pay at t ent ion t o m em ory allocat ion/ init ializat ion in a NUMA environm ent . Generally, lat ency sensit ive workload would favor m em ory t raffic t o st ay local over rem ot e. I f m ult iple t hreads shares a buffer, t he program m er will need t o pay at t ent ion t o OS- specific behavior of m em ory allocat ion/ init ializat ion on a NUMA syst em . Bandwidt h sensit ive workloads will find it convenient t o em ploy a dat a com posit ion t hreading m odel and aggregat es applicat ion t hreads execut ing in each socket t o favor local t raffic on a per- socket basis t o achieve overall bandwidt h scalable wit h t he num ber of physical processors. The OS const ruct t hat provides t he program m ing int erface t o m anage local/ rem ot e NUMA t raffic is referred t o as m em ory affinit y. Because OS m anages t he m apping bet ween physical address ( populat ed by syst em RAM) t o linear address ( accessed by applicat ion soft ware) ; and paging allows dynam ic reassignm ent of a physical page t o m ap t o different linear address dynam ically, proper use of m em ory affinit y will require a great deal of OS- specific knowledge. To sim plify applicat ion program m ing, OS m ay im plem ent cert ain API s and physical/ linear address m apping t o t ake advant age of NUMA charact erist ics t ransparent ly in cert ain sit uat ions. One com m on t echnique is for OS t o delay com m it of physical m em ory page assignm ent unt il t he first m em ory reference on t hat physical page is accessed in t he linear address space by an applicat ion t hread. This m eans t hat t he allocat ion of a m em ory buffer in t he linear address space by an applicat ion t hread does not necessarily det erm ine which socket will service local m em ory t raffic when t he m em ory allocat ion API ret urns t o t he program . However, t he m em ory allocat ion API t hat support s t his level of NUMA t ransparency varies across different OSes. For exam ple, t he port able C- language API “ m alloc” provides som e degree of t ransparency on Linux* , whereas t he API “ Virt ualAlloc” behave sim ilarly on Windows* . Different OSes m ay also provide m em ory allocat ion API s t hat require explicit NUMA inform at ion, such t hat t he m apping bet ween linear address t o local/ rem ot e m em ory t raffic are fixed at allocat ion. Exam ple 8- 9 shows an exam ple t hat m ult i- t hreaded applicat ion could undert ake t he least am ount of effort dealing wit h OS- specific API s and t o t ake advant age of NUMA hardware capabilit y. This parallel
8-24
MULTICORE AND HYPER-THREADING TECHNOLOGY
approach t o m em ory buffer init ializat ion is conducive t o having each worker t hread keep m em ory t raffic local on NUMA syst em s.
Example 8-9. Parallel Memory Initialization Technique Using OpenMP and NUMA #ifdef _LINUX // Linux implements malloc to commit physical page at first touch/access buf1 = (char *) malloc(DIM*(sizeof (double))+1024); buf2 = (char *) malloc(DIM*(sizeof (double))+1024); buf3 = (char *) malloc(DIM*(sizeof (double))+1024); #endif #ifdef windows // Windows implements malloc to commit physical page at allocation, so use VirtualAlloc buf1 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect); buf2 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect); buf3 = (char *) VirtualAlloc(NULL, DIM*(sizeof (double))+1024, fAllocType, fProtect); #endif (continue) a = (double *) buf1; b = (double *) buf2; c = (double *) buf3; #pragma omp parallel { // use OpenMP threads to execute each iteration of the loop // number of OpenMP threads can be specified by default or via environment variable #pragma omp for private(num) // each loop iteration is dispatched to execute in different OpenMP threads using private iterator for(num=0;num > N) + ( ( R* C2/ Divisor) > > N) . I f “ divisor“ is known at com pile t im e, ( C2/ Divisor) can be pre- com put ed int o a congruent const ant Cx = Ceil( C2/ divisor) , t hen t he quot ient can com put ed by an int eger m ult iple, followed by a shift : Q = ( Dividend * Cx ) > > N; R = Dividend - ( ( Dividend * Cx ) > > N) * divisor; The 128- bit I DI V/ DI V inst ruct ions rest rict t he range of divisor, quot ient , and rem ainder t o be wit hin 64bit s t o avoid causing num erical except ions. This present s a challenge for sit uat ions wit h eit her of t he
9-2
64-BIT MODE CODING GUIDELINES
t hree having a value near t he upper bounds of 64- bit and for dividend values nearing t he upper bound of 128 bit s. This challenge can be overcom e wit h choosing a larger shift count N, and ext ending t he ( Dividend * Cx) operat ion from 128- bit range t o t he next com put ing- efficient range. For exam ple, if ( Dividend * Cx) is great er and 128 bit s and N is great er t han 63 bit s, one can t ake advant age of com put ing bit s 191: 64 of t he 192- bit result s using 128- bit MUL wit hout im plem ent ing a full 192- bit m ult iply. A convenient way t o choose t he congruent const ant Cx is as follows:
• •
I f t he range of dividend is wit hin 64 bit s: Nm in ~ BSR( Divisor) + 63. I n sit uat ions of disparat e dynam ic range of quot ient / rem ainder relat ive t o t he range of divisor, raise N accordingly so t hat quot ient / rem ainder can be com put ed efficient ly.
Consider t he com put at ion of quot ient / rem ainder com put at ion for t he divisor 10^ 16 on unsigned dividends near t he range of 64- bit s. Exam ple 9- 1 illust rat es using “ MUL r64” inst ruct ion t o handle 64- bit dividend wit h 64- bit divisors.
Example 9-1. Compute 64-bit Quotient and Remainder with 64-bit Divisor _Cx10to16: DD DD _tento16: DD DD mov mov mov mul mov shr mov mov mul sub jae sub mov mul sub remain: mov mov
; Congruent constant for 10^16 with shift count ‘N’ = 117 0c44de15ch ; floor ( (2^117 / 10^16) + 1) 0e69594beh ; Optimize length of Cx to reduce # of 128-bit multiplies ; 10^16 6fc10000h 002386f2h r9, qword ptr [rcx] ; load 64-bit dividend value rax, r9 rsi, _Cx10to16 ; Congruent Constant for 10^16 with shift count 117 [rsi] ; 128-bit multiply r10, qword ptr 8[rsi] ; load divisor 10^16 rdx, 53; ; r8, rdx rax, r8 r10 r9, rax; remain r8, 1 rax, r8 r10 r9, rax; rdx, r8 rax, r9
; 128-bit multiply ; ; this may be off by one due to round up ; 128-bit multiply ; ; quotient ; remainder
Exam ple 9- 2 shows a sim ilar t echnique t o handle 128- bit dividend wit h 64- bit divisors.
9-3
64-BIT MODE CODING GUIDELINES
Example 9-2. Quotient and Remainder of 128-bit Dividend with 64-bit Divisor mov mov mov mul xor mov shr mov mul add adc shr shl or mov mov mov mul sub sbb jb sub mov mul sub sbb remain: mov neg
rax, qword ptr [rcx] ; load bits 63:0 of 128-bit dividend from memory rsi, _Cx10to16 ; Congruent Constant for 10^16 with shift count 117 r9, qword ptr [rsi] ; load Congruent Constant r9 ; 128-bit multiply r11, r11 ; clear accumulator rax, qword ptr 8[rcx] ; load bits 127:64 of 128-bit dividend rdx, 53; ; r10, rdx ; initialize bits 127:64 of 192 b it result r9 ; Accumulate to bits 191:128 rax, r10; ; rdx, r11; ; rax, 53; ; rdx, 11; ; rdx, rax; ; r8, qword ptr 8[rsi] ; load Divisor 10^16 r9, rdx; ; approximate quotient, may be off by 1 rax, r8 r9 ; will quotient * divisor > dividend? rdx, qword ptr 8[rcx] ; rax, qword ptr [rcx] ; remain r9, 1 ; this may be off by one due to round up rax, r8 ; retrieve Divisor 10^16 r9 ; final quotient * divisor rax, qword ptr [rcx] ; rdx, qword ptr 8[rcx] ; rdx, r9 rax
; quotient ; remainder
The t echniques illust rat ed in Exam ple 9- 1 and Exam ple 9- 2 can increase t he speed of rem ainder/ quot ient calculat ion of 128- bit dividends t o at or below t he cost of a 32- bit int eger division. Ext ending t he t echnique above t o deal wit h divisor great er t han 64- bit s is relat ively st raight forward. One opt im izat ion wort h considering is t o choose shift count N > 128 bit s. This can reduce t he num ber of 128bit MUL needed t o com put e t he relevant upper bit s of ( Dividend * Cx) .
9.2.5
Sign Extension to Full 64-Bits
When in 64- bit m ode, processors based on I nt el Net Burst m icroarchit ect ure can sign- ext end t o 64 bit s in a single m icro- op. I n 64- bit m ode, when t he dest inat ion is 32 bit s, t he upper 32 bit s m ust be zeroed. Zeroing t he upper 32 bit s requires an ext ra m icro- op and is less opt im al t han sign ext ending t o 64 bit s. While sign ext ending t o 64 bit s m akes t he inst ruct ion one byt e longer, it reduces t he num ber of m icroops t hat t he t race cache has t o st ore, im proving perform ance. For exam ple, t o sign- ext end a byt e int o ESI , use: movsx rsi, BYTE PTR[rax]
9-4
64-BIT MODE CODING GUIDELINES
inst ead of: movsx esi, BYTE PTR[rax] I f t he next inst ruct ion uses t he 32- bit form of esi regist er, t he result will be t he sam e. This opt im izat ion can also be used t o break an unint ended dependency. For exam ple, if a program writ es a 16- bit value t o a regist er and t hen writ es t he regist er wit h an 8- bit value, if bit s 15: 8 of t he dest inat ion are not needed, use t he sign- ext ended version of writ es when available. For exam ple: mov r8w, r9w; Requires a merge to preserve ; bits 63:15. mov r8b, r10b; Requires a merge to preserve bits 63:8 Can be replaced wit h: movsx r8, r9w ; If bits 63:8 do not need to be ; preserved. movsx r8, r10b ; If bits 63:8 do not need to ; be preserved. I n t he above exam ple, t he m oves t o R8W and R8B bot h require a m erge t o preserve t he rest of t he bit s in t he regist er. There is an im plicit real dependency on R8 bet ween t he 'MOV R8W, R9W' and 'MOV R8B, R10B'. Using MOVSX breaks t he real dependency and leaves only t he out put dependency, which t he processor can elim inat e t hrough renam ing. For processors based on I nt el Core m icroarchit ect ure, zeroing t he upper 32 bit s is fast er t han signext end t o 64 bit s. For processors based on I nt el m icroarchit ect ure code nam e Nehalem , zeroing or signext end t he upper bit s is single m icro- op.
9.3
ALTERNATE CODING RULES FOR 64-BIT MODE
9.3.1
Use 64-Bit Registers Instead of Two 32-Bit Registers for 64-Bit Arithmetic Result
Legacy 32- bit m ode offers t he abilit y t o support ext ended precision int eger arit hm et ic ( such as 64- bit arit hm et ic) . However, 64- bit m ode offers nat ive support for 64- bit arit hm et ic. When 64- bit int egers are desired, use t he 64- bit form s of arit hm et ic inst ruct ions. I n 32- bit legacy m ode, get t ing a 64- bit result from a 32- bit by 32- bit int eger m ult iply requires t hree regist ers; t he result is st obbred in 32- bit chunks in t he EDX: EAX pair. When t he inst ruct ion is available in 64- bit m ode, using t he 32- bit version of t he inst ruct ion is not t he opt im al im plem ent at ion if a 64- bit result is desired. Use t he ext ended regist ers. For exam ple, t he following code sequence loads t he 32- bit values sign- ext ended int o t he 64- bit regist ers and perform s a m ult iply: movsx rax, DWORD PTR[x] movsx rcx, DWORD PTR[y] imul rax, rcx The 64- bit version above is m ore efficient t han using t he following 32- bit version: mov eax, DWORD PTR[x] mov ecx, DWORD PTR[y] imul ecx I n t he 32- bit case above, EAX is required t o be a source. The result ends up in t he EDX: EAX pair inst ead of in a single 64- bit regist er.
9-5
64-BIT MODE CODING GUIDELINES
Asse m bly/ Com pile r Coding Rule 6 8 . ( M L im pa ct , M ge ne r a lit y) Use t he 64- bit versions of m ult iply for 32- bit int eger m ult iplies t hat require a 64 bit result . To add t wo 64- bit num bers in 32- bit legacy m ode, t he add inst ruct ion followed by t he addc inst ruct ion is used. For exam ple, t o add t wo 64- bit variables ( X and Y) , t he following four inst ruct ions could be used: mov eax, DWORD PTR[X] mov edx, DWORD PTR[X+4] add eax, DWORD PTR[Y] adc edx, DWORD PTR[Y+4] The result will end up in t he t wo- regist er EDX: EAX. I n 64- bit m ode, t he above sequence can be reduced t o t he following: mov rax, QWORD PTR[X] add rax, QWORD PTR[Y] The result is st ored in rax. One regist er is required inst ead of t wo. Asse m bly/ Com pile r Coding Rule 6 9 . ( M L im pa ct , M ge ne r a lit y) Use t he 64- bit versions of add for 64- bit adds.
9.3.2
CVTSI2SS and CVTSI2SD
I n processors based on I nt el Core m icroarchit ect ure and lat er, CVTSI 2SS and CVTSI 2SD are im proved significant ly over t hose in I nt el Net Burst m icroarchit ect ure, in t erm s of lat ency and t hroughput . The im provem ent s applies equally t o 64- bit and 32- bit versions.
9.3.3
Using Software Prefetch
I nt el recom m ends t hat soft ware developers follow t he recom m endat ions in Chapt er 3 and Chapt er 7 when considering t he choice of organizing dat a access pat t erns t o t ake advant age of t he hardware prefet cher ( versus using soft ware prefet ch) . Asse m bly/ Com pile r Coding Rule 7 0 . ( L im pa ct , L ge ne r a lit y) I f soft ware prefet ch inst ruct ions are necessary, use t he prefet ch inst ruct ions provided by SSE.
9-6
CHAPTER 10 SSE4.2 AND SIMD PROGRAMMING FOR TEXTPROCESSING/LEXING/PARSING St ring/ t ext processing spans a discipline t hat oft en em ploys t echniques different from t radit ional SI MD int eger vect or processing. Much of t he t radit ional st ring/ t ext algorit hm s are charact er based, where charact ers m ay be represent ed by encodings ( or code point s) of fixed or variable byt e sizes. Text ual dat a represent s a vast am ount of raw dat a and oft en carrying cont ext ual inform at ion. The cont ext ual inform at ion em bedded in raw t ext ual dat a oft en requires algorit hm ic processing dealing wit h a wide range of at t ribut es, such as charact er values, charact er posit ions, charact er encoding form at s, subset t ing of charact er set s, st rings of explicit or im plicit lengt hs, t okens, delim it ers; cont ext ual obj ect s m ay be represent ed by sequent ial charact ers wit hin a pre- defined charact er subset s ( e.g. decim al- valued st rings) ; t ext ual st ream s m ay cont ain em bedded st at e t ransit ions separat ing obj ect s of different cont ext s ( e.g. t ag- delim it ed fields) . Tradit ional I nt eger SI MD vect or inst ruct ions m ay, in som e sim pler sit uat ions, be successful t o speed up sim ple st ring processing funct ions. SSE4.2 includes four new inst ruct ions t hat offer advances t o com put at ional algorit hm s t arget ing st ring/ t ext processing, lexing and parsing of eit her unst ruct ured or st ruct ured t ext ual dat a.
10.1
SSE4.2 STRING AND TEXT INSTRUCTIONS
SSE4.2 provides four inst ruct ions, PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM t hat can accelerat e st ring and t ext processing by com bining t he efficiency of SI MD program m ing t echniques and t he lexical prim it ives t hat are em bedded in t hese 4 inst ruct ions. Sim ple exam ples of t hese inst ruct ions include st ring lengt h det erm inat ion, direct st ring com parison, st ring case handling, delim it er/ t oken processing, locat ing word boundaries, locat ing sub- st ring m at ches in large t ext blocks. Sophist icat ed applicat ion of SSE4.2 can accelerat e XML parsing and Schem a validat ion. Processor ’s support for SSE4.2 is indicat ed by t he feat ure flag value ret urned in ECX [ bit 20] aft er execut ing CPUI D inst ruct ion wit h EAX input value of 1 ( i.e. SSE4.2 is support ed if CPUI D.01H: ECX.SSE4_2 [ bit 20] = 1) . Therefore, soft ware m ust verify CPUI D.01H: ECX.SSE4_2 [ bit 20] is set before using t hese 4 inst ruct ions. ( Verifying CPUI D.01H: ECX.SSE4_2 = 1 is also required before using PCMPGTQ or CRC32. Verifying CPUI D.01H: ECX.POPCNT[ Bit 23] = 1 is required before using t he POPCNT inst ruct ion.) These st ring/ t ext processing inst ruct ions work by perform ing up t o 256 com parison operat ions on t ext fragm ent s. Each t ext fragm ent can be 16 byt es. They can handle fragm ent s of different form at s: eit her byt e or word elem ent s. Each of t hese four inst ruct ions can be configured t o perform four t ypes of parallel com parison operat ion on t wo t ext fragm ent s. The aggregat ed int erm ediat e result of a parallel com parison of t wo t ext fragm ent s becom e a bit pat t erns: 16 bit s for processing byt e elem ent s or 8 bit s for word elem ent s. These inst ruct ion provide addit ional flexibilit y, using bit fields in t he im m ediat e operand of t he inst ruct ion synt ax, t o configure an unary t ransform at ion ( polarit y) on t he first int erm ediat e result . Last ly, t he inst ruct ion’s im m ediat e operand offers a out put select ion cont rol t o furt her configure t he flexibilit y of t he final result produced by t he inst ruct ion. The rich configurabilit y of t hese inst ruct ion is sum m arized in Figure 10- 1.
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
PCMPxSTRy XMM1, XMM2/M128, imm
Data Format Imm[1:0]: 0
127
00b: unsigned bytes 01b: unsigned words 10b: signed bytes 11b: signed words Fragment2 of words
Fragment1 0
15|7
Polarity
IntRes2 Output Select 0
Imm[6]
31
Imm[3:2] Compare 15|7
Imm[5:4]
0
IntRes1
Imm[6]
Index Result
Mask Result XMM0
ECX
Figure 10-1. SSE4.2 String/Text Instruction Immediate Operand Control
The PCMPxSTRI inst ruct ions produce final result as an int eger index in ECX, t he PCMPxSTRM inst ruct ions produce final result as a bit m ask in t he XMM0 regist er. The PCMPI STRy inst ruct ions support processing st ring/ t ext fragm ent s using im plicit lengt h cont rol via null t erm inat ion for handling st ring/ t ext of unknown size. t he PCMPESTRy inst ruct ions support explicit lengt h cont rol via EDX: EAX regist er pair t o specify t he lengt h t ext fragm ent s in t he source operands. The first int erm ediat e result , I nt Res1, is an aggregat ed result of bit pat t erns from parallel com parison operat ions done on pairs of dat a elem ent s from each t ext fragm ent , according t o t he im m [ 3: 2] bit field encoding, see Table 10- 1. Table 10-1. SSE4.2 String/Text Instructions Compare Operation on N-elements Imm[3:2]
Name
IntRes1[i] is TRUE if
Potential Usage
00B
Equal Any
Element i in fragment2 matches any element j in fragment1
Tokenization, XML parser
01B
Ranges
Element i in fragment2 is within any range pairs specified in fragment1
Subsetting, Case handling, XML parser, Schema validation
10B
Equal Each
Element i in fragment2 matches element i in fragment1
Strcmp()
11B
Equal Ordered
Element i and subsequent, consecutive valid elements in fragment2 match fully or partially with fragment1 starting from element 0
Substring Searches, KMP, Strstr()
I nput dat a elem ent form at select ion using im m [ 1: 0] can support signed or unsigned byt e/ word elem ent s. The bit field im m [ 5: 4] allows applying a unary t ransform at ion on I nt Res1, see Table 10- 2.
10-2
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
Table 10-2. SSE4.2 String/Text Instructions Unary Transformation on IntRes1 Imm[5:4]
Name
IntRes2[i] =
Potential Usage
00B
No Change
IntRes1[i]
01B
Invert
-IntRes1[i]
10B
No Change
IntRes1[i]
11B
Mask Negative
IntRes1[i] if element i of fragment2 is invalid, otherwise IntRes1[i]
The out put select ion field, im m [ 6] is described in Table 10- 3. Table 10-3. SSE4.2 String/Text Instructions Output Selection Imm[6] Imm[6]
Instruction
Final Result
Potential Usage
0B
PCMPxSTRI
ECX = offset of least significant bit set in IntRes2 if IntRes2 != 0, otherwise ECX = number of data element per 16 bytes
0B
PCMPxSTRM
XMM0 = ZeroExtend(IntRes2);
1B
PCMPxSTRI
ECX = offset of most significant bit set in IntRes2 if IntRes2 != 0, otherwise ECX = number of data element per 16 bytes
1B
PCMPxSTRM
Data element i of XMM0 = SignExtend(IntRes2[i]);
The com parison operat ion on each dat a elem ent pair is defined in Table 10- 4. Table 10- 4 defines t he t ype of com parison operat ion bet ween valid dat a elem ent s ( last row of Table 10- 4) and boundary condit ions when t he fragm ent in a source operand m ay cont ain invalid dat a elem ent s ( rows 1 t hrough 3 of Table 10- 4) . Arit hm et ic com parison are perform ed only if bot h dat a elem ent s are valid elem ent in fragm ent 1 and fragm ent 2, as shown in row 4 of Table 10- 4. Table 10-4. SSE4.2 String/Text Instructions Element-Pair Comparison Definition Fragment1 Element
Fragment2 Element
Imm[3:2]= 00B, Equal Any
Imm[3:2]= 01B, Ranges
Imm[3:2]= 10B, Equal Each
Imm[3:2]= 11B, Equal Ordered
Invalid
Invalid
Force False
Force False
Force True
Force True
Invalid
Valid
Force False
Force False
Force False
Force True
Valid
Invalid
Force False
Force False
Force False
Force False
Valid
Valid
Compare
Compare
Compare
Compare
The st ring and t ext processing inst ruct ion provides several aid t o handle end- of- st ring sit uat ions, see Table 10- 5. Addit ionally, t he PCMPxSTRy inst ruct ions are designed t o not require 16- byt e alignm ent t o sim plify t ext processing requirem ent s. Table 10-5. SSE4.2 String/Text Instructions Eflags Behavior EFLAGs
Description
Potential Usage
CF
Reset if IntRes2 = 0; Otherwise set
When CF=0, ECX= #of data element to scan next
ZF
Reset if entire 16-byte fragment2 is valid
likely end-of-string
SF
Reset if entire 16-byte fragment1 is valid
OF
IntRes2[0];
10-3
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
10.1.1
CRC32
CRC32 inst ruct ion com put es t he 32- bit cyclic redundancy checksum signat ure for byt e/ word/ dword or qword st ream of dat a. I t can also be used as a hash funct ion. For exam ple, a dict ionary uses hash indices t o de- reference st rings. CRC32 inst ruct ion can be easily adapt ed for use in t his sit uat ion. Exam ple 10- 1 shows a st raight forward hash funct ion t hat can be used t o evaluat e t he hash index of a st ring t o populat e a hash t able. Typically, t he hash index is derived from t he hash value by t aking t he rem ainder of t he hash value m odulo t he size of a hash t able.
Example 10-1. A Hash Function Examples unsigned int hash_str(unsigned char* pStr) { unsigned int hVal = (unsigned int)(*pStr++); while (*pStr) { hVal = (hashVal * CONST_A) + (hVal >> 24) + (unsigned int)(*pStr++); } return hVal; }
CRC32 inst ruct ion can be use t o derive an alt ernat e hash funct ion. Exam ple 10- 2 t akes advant age t he 32- bit granular CRC32 inst ruct ion t o updat e signat ure value of t he input dat a st ream . For st ring of sm all t o m oderat e sizes, using t he hardware accelerat ed CRC32 can be t wice as fast as Exam ple 10- 1.
Example 10-2. Hash Function Using CRC32 static unsigned cn_7e = 0x7efefeff, Cn_81 = 0x81010100; unsigned int hash_str_32_crc32x(unsigned char* pStr) { unsigned *pDW = (unsigned *) &pStr[1]; unsigned short *pWd = (unsigned short *) &pStr[1]; unsigned int tmp, hVal = (unsigned int)(*pStr); if( !pStr[1]) ; else { tmp = ((pDW[0] +cn_7e ) ^(pDW[0]^ -1)) & Cn_81; while ( !tmp ) // loop until there is byte in *pDW had 0x00 { hVal = _mm_crc32_u32 (hVal, *pDW ++); tmp = ((pDW[0] +cn_7e ) ^(pDW[0]^ -1)) & Cn_81; }; if(!pDW[0]); else if(pDW[0] < 0x100) { // finish last byte that’s non-zero hVal = _mm_crc32_u8 (hVal, pDW[0]); }
10-4
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
Example 10-2. Hash Function Using CRC32 (Contd.) else if(pDW[0] < 0x10000) { // finish last two byte that’s non-zero hVal = _mm_crc32_u16 (hVal, pDW[0]); } else { // finish last three byte that’s non-zero hVal = _mm_crc32_u32 (hVal, pDW[0]); } } return hVal; }
10.2
USING SSE4.2 STRING AND TEXT INSTRUCTIONS
St ring libraries provided by high- level languages or as part of syst em library are used in a wide range of sit uat ions across applicat ions and privileged syst em soft ware. These sit uat ions can be accelerat ed using a replacem ent st ring library t hat im plem ent s PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM. Alt hough syst em - provided st ring library provides st andardized st ring handling funct ionalit y and int erfaces, m ost sit uat ions dealing wit h st ruct ured docum ent processing requires considerable m ore sophist icat ion, opt im izat ion, and services not available from syst em - provided st ring libraries. For exam ple, st ruct ured docum ent processing soft ware oft en archit ect different class obj ect s t o provide building block funct ionalit y t o service specific needs of t he applicat ion. Oft en applicat ion m ay choose t o disperse equivalent st ring library services int o separat e classes ( st ring, lexer, parser) or int egrat e m em ory m anagem ent capabilit y int o st ring handling/ lexing/ parsing obj ect s. PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM inst ruct ions are general- purpose prim it ives t hat software can use t o build replacem ent st ring libraries or build class hierarchy t o provide lexing/ parsing services for st ruct ured docum ent processing. XML parsing and schem a validat ion are exam ples of t he lat t er sit uat ions. Unst ruct ured, raw t ext / st ring dat a consist of charact ers, and have no nat ural alignm ent preferences. Therefore, PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM inst ruct ions are archit ect ed t o not require t he 16- Byt e alignm ent rest rict ions of ot her 128- bit SI MD int eger vect or processing inst ruct ions. Wit h respect t o m em ory alignm ent , PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM support unaligned m em ory loads like ot her unaligned 128- bit m em ory access inst ruct ions, e.g. MOVDQU. Unaligned m em ory accesses m ay encount er special sit uat ions t hat require addit ional coding t echniques, depending on t he code running in ring 3 applicat ion space or in privileged space. Specifically, an unaligned 16- byt e load m ay cross page boundary. Sect ion 10.2.1 discusses a t echnique t hat applicat ion code can use. Sect ion 10.2.2 discusses t he sit uat ion st ring library funct ions needs t o deal wit h. Sect ion 10.3 gives det ailed exam ples of using PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM inst ruct ions t o im plem ent equivalent funct ionalit y of several st ring library funct ions in sit uat ions t hat applicat ion code has cont rol over m em ory buffer allocat ion.
10.2.1
Unaligned Memory Access and Buffer Size Management
I n applicat ion code, t he size requirem ent s for m em ory buffer allocat ion should consider unaligned SI MD m em ory sem ant ics and applicat ion usage. For cert ain t ypes of applicat ion usage, it m ay be desirable t o m ake dist inct ions bet ween valid buffer range lim it versus valid applicat ion dat a size ( e.g. a video fram e) . The form er m ust be great er or equal t o t he lat t er.
10-5
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
To support algorit hm s requiring unaligned 128- bit SI MD m em ory accesses, m em ory buffer allocat ion by a caller funct ion should consider adding som e pad space so t hat a callee funct ion can safely use t he address point er safely wit h unaligned 128- bit SI MD m em ory operat ions. The m inim al padding size should be t he widt h of t he SI MD regist er t hat m ight be used in conj unct ion wit h unaligned SI MD m em ory access.
10.2.2
Unaligned Memory Access and String Library
St ring library funct ions m ay be used by applicat ion code or privileged code. St ring library funct ions m ust be careful not t o violat e m em ory access right s. Therefore, a replacem ent st ring library t hat em ploy SI MD unaligned access m ust em ploy special t echniques t o ensure no m em ory access violat ion occur. Sect ion 10.3.6 provides an exam ple of a replacem ent st ring library funct ion im plem ent ed wit h SSE4.2 and dem onst rat es a t echnique t o use 128- bit unaligned m em ory access wit hout unint ent ionally crossing page boundary.
10.3
SSE4.2 APPLICATION CODING GUIDELINE AND EXAMPLES
Soft ware im plem ent ing SSE4.2 inst ruct ion m ust use CPUI D feat ure flag m echanism t o verify processor ’s support for SSE4.2. Det ails can be found in CHAPTER 12 of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1 and in CPUI D of CHAPTER 3 in I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A. I n t he following sect ions, we use several exam ples in st ring/ t ext processing of progressive com plexit y t o illust rat es t he basic t echniques of adapt ing t he SI MD approach t o im plem ent st ring/ t ext processing using PCMPxSTRy inst ruct ions in SSE4.2. For sim plicit y, we will consider st ring/ t ext in byt e dat a form at in sit uat ions t hat caller funct ions have allocat ed sufficient buffer size t o support unaligned 128- bit SI MD loads from m em ory wit hout encount ering side- effect s of cross page boundaries.
10.3.1
Null Character Identification (Strlen equivalent)
The m ost widely used st ring funct ion is probably st rlen( ) . One can view t he lexing requirem ent of st rlen( ) is t o ident ify t he null charact er in a t ext block of unknown size ( end of st ring condit ion) . Brut e- force, byt e- granular im plem ent at ion fet ches dat a inefficient ly by loading one byt e at a t im e. Opt im ized im plem ent at ion using general- purpose inst ruct ions can t ake advant age of dword operat ions in 32- bit environm ent ( and qword operat ions in 64- bit environm ent ) t o reduce t he num ber of it erat ions. A 32- bit assem bly im plem ent at ion of st rlen( ) is shown Exam ple 10- 3. The peak execut ion t hroughput of handling EOS condit ion is det erm ined by eight ALU inst ruct ions in t he m ain loop.
Example 10-3. Strlen() Using General-Purpose Instructions int strlen_asm(const char* s1) {int len = 0; _asm{ mov ecx, s1 test ecx, 3 ; test addr aligned to dword je short _main_loop1 ; dword aligned loads would be faster _malign_str1: mov al, byte ptr [ecx] ; read one byte at a time add ecx, 1 test al, al ; if we find a null, go calculate the length je short _byte3a (continue)
10-6
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
Example 10-3. Strlen() Using General-Purpose Instructions (Contd.) test ecx, 3; test if addr is now aligned to dword jne short _malign_str1; if not, repeat align16 _main_loop1:; read each 4-byte block and check for a NULL char in the dword mov eax, [ecx]; read 4 byte to reduce loop count mov edx, 7efefeffh add edx, eax xor eax, -1 xor eax, edx add ecx, 4; increment address pointer by 4 test eax, 81010100h ; if no null code in 4-byte stream, do the next 4 bytes je short _main_loop1 ; there is a null char in the dword we just read, ; since we already advanced pointer ecx by 4, and the dword is lost mov eax, [ecx -4]; re-read the dword that contain at least a null char test al, al ; if byte0 is null je short _byte0a; the least significant byte is null test ah, ah ; if byte1 is null je short _byte1a test eax, 00ff0000h; if byte2 is null je short _byte2a test eax, 00ff000000h; if byte3 is null je short _byte3a jmp short _main_loop1 _byte3a: ; we already found the null, but pointer already advanced by 1 lea eax, [ecx-1]; load effective address corresponding to null code mov ecx, s1 sub eax, ecx; difference between null code and start address jmp short _resulta _byte2a: lea eax, [ecx-2] mov ecx, s1 sub eax, ecx jmp short _resulta _byte1a: lea eax, [ecx-3] mov ecx, s1 sub eax, ecx jmp short _resulta _byte0a: lea eax, [ecx-4] mov ecx, s1 sub eax, ecx _resulta: mov len, eax; store result } return len; } The equivalent funct ionalit y of EOS ident ificat ion can be im plem ent ed using PCMPI STRI . Exam ple 10- 4 shows a sim plist ic SSE4.2 im plem ent at ion t o scan a t ext block by loading 16- byt e t ext fragm ent s and locat e t he null t erm inat ion charact er. Exam ple 10- 5 shows t he opt im ized SSE4.2 im plem ent at ion t hat dem onst rat es t he im port ance of using m em ory disam biguat ion t o im prove inst ruct ion- level parallelism .
10-7
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
Example 10-4. Sub-optimal PCMPISTRI Implementation of EOS handling static char ssch2[16]= {0x1, 0xff, 0x00, }; // range values for non-null characters int strlen_un_optimized(const char* s1) {int len = 0; _asm{ mov eax, s1 movdquxmm2, ssch2 ; load character pair as range (0x01 to 0xff) xor ecx, ecx ; initial offset to 0 (continue) _loopc: add eax, ecx ; update addr pointer to start of text fragment pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx ; if there is a null char in the 16Byte fragment at [eax], zf will be set. ; if all 16 bytes of the fragment are non-null characters, ECX will return 16, jnz short _loopc; xmm1 has no null code, ecx has 16, continue search ; we have a null code in xmm1, ecx has the offset of the null code i add eax, ecx ; add ecx to the address of the last fragment2/xmm1 mov edx, s1; retrieve effective address of the input string sub eax, edx;the string length mov len, eax; store result } return len; }
The code sequence shown in Exam ple 10- 4 has a loop consist ing of t hree inst ruct ions. From a perform ance t uning perspect ive, t he loop it erat ion has loop- carry dependency because address updat e is done using t he result ( ECX value) of a previous loop it erat ion. This loop- carry dependency deprives t he out- oforder engine’s capabilit y t o have m ult iple it erat ions of t he inst ruct ion sequence m aking forward progress. The lat ency of m em ory loads, t he lat ency of t hese inst ruct ions, any bypass delay could not be am ort ized by OOO execut ion in t he presence of loop- carry dependency. A sim ple opt im izat ion t echnique t o elim inat e loop- carry dependency is shown in Exam ple 10- 5. Using m em ory disam biguat ion t echnique t o elim inat e loop- carry dependency, t he cum ulat ive lat ency exposure of t he 3- inst ruct ion sequence of Exam ple 10- 5 is am ort ized over m ult iple it erat ions, t he net cost of execut ing each it erat ion ( handling 16 byt es) is less t hen 3 cycles. I n cont rast , handling 4 byt es of st ring dat a using 8 ALU inst ruct ions in Exam ple 10- 3 will also t ake a lit t le less t han 3 cycles per it erat ion. Whereas each it erat ion of t he code sequence in Exam ple 10- 4 will t ake m ore t han 10 cycles because of loop- carry dependency.
Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency int strlen_sse4_2(const char* s1) {int len = 0; _asm{ mov eax, s1 movdquxmm2, ssch2 ; load character pair as range (0x01 to 0xff) xor ecx, ecx ; initial offset to 0 sub eax, 16 ; address arithmetic to eliminate extra instruction and a branch
10-8
SSE4.2 AND SIMD PROGRAMMING FOR TEXT- PROCESSING/LEXING/PARSING
Example 10-5. Strlen() Using PCMPISTRI without Loop-Carry Dependency (Contd.) _loopc: add eax, 16 ; adjust address pointer and disambiguate load address for each iteration pcmpistri xmm2, [eax], 14h; unsigned bytes, ranges, invert, lsb index returned to ecx ; if there is a null char in [eax] fragment, zf will be set. ; if all 16 bytes of the fragment are non-null characters, ECX will return 16, jnz short _loopc ; ECX will be 16 if there is no null byte in [eax], so we disambiguate _endofstring: add eax, ecx ; add ecx to the address of the last fragment mov edx, s1; retrieve effective address of the input string sub eax, edx;the string length mov len, eax; store result } return len; }
SSE4 .2 Codin g Rule 5 . ( H im pa ct , H ge ne r a lit y) Loop- carry dependency t hat depends on t he ECX result of PCMPESTRI / PCMPESTRM/ PCMPI STRI / PCMPI STRM for address adj ust m ent m ust be m inim ized. I solat e code pat hs t hat expect ECX result will be 16 ( byt es) or 8 ( words) , replace t hese values of ECX wit h const ant s in address adj ust m ent expressions t o t ake advant age of m em ory disam biguat ion hardware.
10.3.2
White-Space-Like Character Identification
Charact er- granular- based t ext processing algorit hm s have developed t echniques t o handle specific t asks t o rem edy t he efficiency issue of charact er- granular approaches. One such t echnique is using look- up t ables for charact er subset classificat ion. For exam ple, som e applicat ion m ay need t o separat e alphanum eric charact ers from whit e- space- like charact ers. More t han one charact er m ay be t reat ed as whit espace charact ers. Exam ple 10- 6 illust rat es a sim ple sit uat ion of ident ifying whit e- space- like charact ers for t he purpose of m arking t he beginning and end of consecut ive non- whit e- space charact ers.
Example 10-6. WordCnt() Using C and Byte-Scanning Technique // Counting words involves locating the boundary of contiguous non-whitespace characters. // Different software may choose its own mapping of white space character set. // This example employs a simple definition for tutorial purpose: // Non-whitespace character set will consider: A-Z, a-z, 0-9, and the apostrophe mark ' // The example uses a simple technique to map characters into bit patterns of square waves // we can simply count the number of falling edges static char alphnrange[16]= {0x27, 0x27, 0x30, 0x39, 0x41, 0x5a, 0x61, 0x7a, 0x0}; static char alp_map8[32] = {0x0, 0x0, 0x0, 0x0, 0x80,0x0,0xff, 0x3,0xfe, 0xff, 0xff, 0x7, 0xfe, 0xff, 0xff, 0x7}; // 32 byte lookup table, 1s map to bit patterns of alpha numerics in alphnrange int wordcnt_c(const char* s1) {int i, j, cnt = 0; char cc, cc2; char flg[3]; // capture the a wavelet to locate a falling edge cc2 = cc = s1[0]; // use the compacted bit pattern to consolidate multiple comparisons into one look up if( alp_map8[cc>>3] & ( 1>3] & ( 1 3) ] |= (1 >3] & (1
vmovups xmm0, mem vinsertf128 ymm0, ymm0, mem+16, 1
Convert 32-byte stores as follows: vmovups mem, ymm0 -> vmovups mem, xmm0 vextractf128 mem+16, ymm0, 1 The following intrinsics are available to handle unaligned 32-byte memory operating using 16-byte memory accesses: _mm256_loadu2_m128 ( float const * addr_hi, float const * addr_lo); _mm256_loadu2_m128d ( double const * addr_hi, double const * addr_lo); _mm256_loadu2_m128 i( __m128i const * addr_hi, __m128i const * addr_lo); _mm256_storeu2_m128 ( float * addr_hi, float * addr_lo, __m256 a); _mm256_storeu2_m128d ( double * addr_hi, double * addr_lo, __m256d a); _mm256_storeu2_m128 i( __m128i * addr_hi, __m128i * addr_lo, __m256i a);
Exam ple 11- 12 shows t wo im plem ent at ions for SAXPY wit h unaligned addresses. Alt ernat ive 1 uses 32 byt e loads and alt ernat ive 2 uses 16 byt e loads. These code sam ples are execut ed wit h t wo source buffers, src1, src2, at 4 byt e offset from 32- Byt e alignm ent , and a dest inat ion buffer, DST, t hat is 32- Byt e aligned. Using t wo 16- byt e m em ory operat ions in lieu of 32- byt e m em ory access perform s fast er.
Example 11-12. SAXPY Implementations for Unaligned Data Addresses AVX with 32-byte memory operation AVX using two 16-byte memory operations mov rax, src1 mov rbx, src2 mov rcx, dst mov rdx, len xor rdi, rdi vbroadcastss ymm0, alpha start_loop: vmovups ymm1, [rax + rdi] vmulps ymm1, ymm1, ymm0 vmovups ymm2, [rbx + rdi] vaddps ymm1, ymm1, ymm2 vmovups [rcx + rdi], ymm1
vmovups ymm1, [rax+rdi+32] vmulps ymm1, ymm1, ymm0 vmovups ymm2, [rbx+rdi+32] vaddps ymm1, ymm1, ymm2 vmovups [rcx+rdi+32], ymm1 add cmp jl
rdi, 64 rdi, rdx start_loop
mov rax, src1 mov rbx, src2 mov rcx, dst mov rdx, len xor rdi, rdi vbroadcastss ymm0, alpha start_loop: vmovups xmm2, [rax+rdi] vinsertf128 ymm2, ymm2, [rax+rdi+16], 1 vmulps ymm1, ymm0, ymm2 vmovups xmm2, [ rbx + rdi] vinsertf128 ymm2, ymm2, [rbx+rdi+16], 1 vaddps ymm1, ymm1, ymm2 vaddps ymm1, ymm1, ymm2 vmovaps [rcx+rdi], ymm1 vmovups xmm2, [rax+rdi+32] vinsertf128 ymm2, ymm2, [rax+rdi+48], 1 vmulps ymm1, ymm0, ymm2 vmovups xmm2, [rbx+rdi+32] vinsertf128 ymm2, ymm2, [rbx+rdi+48], 1 vaddps ymm1, ymm1, ymm2 vmovups [rcx+rdi+32], ymm1 add rdi, 64 cmp rdi, rdx jl start_loop
11-21
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Asse m bly/ Com pile r Codin g Ru le 7 4 . ( M im pa ct , H ge ne r a lit y) Align dat a t o 32- byt e boundary when possible. Prefer st ore alignm ent over load alignm ent .
11.6.3
Prefer Aligned Stores Over Aligned Loads
There are cases where it is possible t o align only a subset of t he processed dat a buffers. I n t hese cases, aligning dat a buffers used for st ore operat ions usually yields bet t er perform ance t han aligning dat a buffers used for load operat ions. Unaligned st ores are likely t o cause great er perform ance degradat ion t han unaligned loads, since t here is a very high penalt y on st ores t o a split cache- line t hat crosses pages. This penalt y is est im at ed at 150 cycles. Loads t hat cross a page boundary are execut ed at ret irem ent . I n Exam ple 11- 12, unaligned st ore address can affect SAXPY perform ance for 3 unaligned addresses t o about one quart er of t he aligned case.
11.7
L1D CACHE LINE REPLACEMENTS
When a load m isses t he L1D Cache, a cache line wit h t he request ed dat a is brought from a higher m em ory hierarchy level. I n m em ory int ensive code where t he L1 DCache is always act ive, replacing a cache line in t he L1 DCache m ay delay ot her loads. I n I nt el m icroarchit ect ure code nam e Sandy Bridge and I vy Bridge, t he penalt y for 32- Byt e loads m ay be higher t han t he penalt y for 16- Byt e loads. Therefore, m em ory int ensive I nt el AVX code wit h 32- Byt e loads and wit h dat a set larger t han t he L1 DCache m ay be slower t han sim ilar code wit h 16- Byt e loads. When Exam ple 11- 12 is run wit h a dat a set t hat resides in t he L2 Cache, t he 16- byt e m em ory access im plem ent at ion is slight ly fast er t han t he 32- byt e m em ory operat ion. Be aware t hat t he relat ive m erit of 16- Byt e m em ory accesses versus 32- byt e m em ory access is im plem ent at ion specific across generat ions of m icroarchit ect ures. Wit h I nt el m icroarchit ect ure code nam e Haswell, t he L1 DCache can support t wo 32- byt e fet ch each cycle, t his cache line replacem ent concern does not apply.
11.8
4K ALIASING
4- KByt e m em ory aliasing occurs when t he code st ores t o one m em ory locat ion and short ly aft er t hat it loads from a different m em ory locat ion wit h a 4- KByt e offset bet ween t hem . For exam ple, a load t o linear address 0x400020 follows a st ore t o linear address 0x401020. The load and st ore have t he sam e value for bit s 5 - 11 of t heir addresses and t he accessed byt e offset s should have part ial or com plet e overlap. 4K aliasing m ay have a five- cycle penalt y on t he load lat ency. This penalt y m ay be significant when 4K aliasing happens repeat edly and t he loads are on t he crit ical pat h. I f t he load spans t wo cache lines it m ight be delayed unt il t he conflict ing st ore is com m it t ed t o t he cache. Therefore 4K aliasing t hat happens on repeat ed unaligned I nt el AVX loads incurs a higher perform ance penalt y. To det ect 4K aliasing, use t he LD_BLOCKS_PARTI AL.ADDRESS_ALI AS event t hat count s t he num ber of t im es I nt el AVX loads were blocked due t o 4K aliasing. To resolve 4K aliasing, t ry t he following m et hods in t he following order:
• • •
Align dat a t o 32 Byt es. Change offset s bet ween input and out put buffers if possible. Use 16- Byt e m em ory accesses on m em ory which is not 32- Byt e aligned.
11-22
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11.9
CONDITIONAL SIMD PACKED LOADS AND STORES
The VMASKMOV inst ruct ion condit ionally m oves packed dat a elem ent s t o/ from m em ory, depending on t he m ask bit s associat ed wit h each dat a elem ent . The m ask bit for each dat a elem ent is t he m ost significant bit of t he corresponding elem ent in t he m ask regist er. When perform ing a m ask load, t he ret urned value is 0 for elem ent s which have a corresponding m ask value of 0. The m ask st ore inst ruct ion writ es t o m em ory only t he elem ent s wit h a corresponding m ask value of 1, while preserving m em ory values for elem ent s wit h a corresponding m ask value of 0. Fault s can occur only for m em ory accesses t hat are required by t he m ask. Fault s do not occur due t o referencing any m em ory locat ion if t he corresponding m ask bit value for t hat m em ory locat ion is zero. For exam ple, no fault s are det ect ed if t he m ask bit s are all zero. The following figure shows an exam ple for a m ask load and a m ask st ore which does not cause a fault . I n t his exam ple, t he m ask regist er for t he load operat ion is ym m 1 and t he m ask regist er for t he st ore operat ion is ym m 2. When using m asked load or st ore consider t he following:
•
•
•
The address of a VMASKMOV st ore is considered as resolved only aft er t he m ask is known. Loads t hat follow a m asked st ore can be blocked unt il t he m ask value is known ( unless relieved by t he m em ory disam biguat or) . I f t he m ask is not all 1 or all 0, loads t hat depend on t he m asked st ore have t o wait unt il t he st ore dat a is writ t en t o t he cache. I f t he m ask is all 1 t he dat a can be forwarded from t he m asked st ore t o t he dependent loads. I f t he m ask is all 0 t he loads do not depend on t he m asked st ore.
Masked loads including an illegal address range do not result in an except ion if t he range is under a zero m ask value. However, t he processor m ay t ake a m ult i- hundred- cycle “ assist ” t o det erm ine t hat no part of t he illegal range have a one m ask value. This assist m ay occur even when t he m ask is “ zero” and it seem s obvious t o t he program m er t hat t he load should not be execut ed.
When using VMASKMOV, consider t he following:
• • • • •
Use VMASKMOV only in cases where VMOVUPS cannot be used. Use VMASKMOV on 32Byt e aligned addresses if possible. I f possible use valid address range for m asked loads, even if t he illegal part is m asked wit h zeros. Det erm ine t he m ask as early as possible. Avoid st ore- forwarding issues by perform ing loads prior t o a VMASKMOV st ore if possible.
11-23
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
•
Be aware of m ask values t hat would cause t he VMASKMOV inst ruct ion t o require assist ( if an assist is required, t he lat ency of VMASKMOV t o load dat a will increase dram at ically) : — Load dat a using VMASKMOV wit h a m ask value select ing 0 elem ent s from an illegal address will require an assist . — Load dat a using VMASKMOV wit h a m ask value select ing 0 elem ent s from a legal address expressed in som e addressing form ( e.g. [ base+ index] , disp[ base+ index] ) will require an assist .
Wit h processors based on t he Skylake m icroarchit ect ure, t he perform ance charact erist ics of VMASKMOV inst ruct ions have t he following not able it em s:
• •
Loads t hat follow a m asked st ore is not longer blocked unt il t he m ask value is known. St ore dat a using VMASKMOV wit h a m ask value perm it t ing 0 elem ent s t o be writ t en t o an illegal address will require an assist .
11.9.1
Conditional Loops
VMASKMOV enables vect orizat ion of loops t hat cont ain condit ional code. There are t wo m ain benefit s in using VMASKMOV over t he scalar im plem ent at ion in t hese cases:
• •
VMASKMOV code is vect orized. Branch m ispredict ions are elim inat ed.
Below is a condit ional loop C code:
Example 11-13. Loop with Conditional Expression for(int i = 0; i < miBufferWidth; i++) { if(A[i]>0) { B[i] = (E[i]*C[i]); } else { B[i] = (E[i]*D[i]); } }
Example 11-14. Handling Loop Conditional with VMASKMOV Scalar AVX using VMASKMOV float* pA = A; float* pB = B; float* pC = C; float* pD = D; float* pE = E; uint64 len = (uint64) (miBufferWidth)*sizeof(float); __asm { mov rax, pA mov rbx, pB mov rcx, pC mov rdx, pD mov rsi, pE mov r8, len
11-24
float* pA = A; float* pB = B; float* pC = C; float* pD = D; float* pE = E; uint64 len = (uint64) (miBufferWidth)*sizeof(float); __asm { mov rax, pA mov rbx, pB mov rcx, pC mov rdx, pD mov rsi, pE mov r8, len
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-14. Handling Loop Conditional with VMASKMOV (Contd.) Scalar AVX using VMASKMOV //xmm8 all zeros vxorps xmm8, xmm8, xmm8 xor r9,r9 loop1: vmovss xmm1, [rax+r9] vcomiss xmm1, xmm8 jbe a_le a_gt: vmovss xmm4, [rcx+r9] jmp mul a_le: vmovss xmm4, [rdx+r9] mul: vmulss xmm4, xmm4, [rsi+r9] vmovss [rbx+r9], xmm4 add r9, 4 cmp r9, r8 jl loop1 }
//ymm8 all zeros vxorps ymm8, ymm8, ymm8 //ymm9 all ones vcmpps ymm9, ymm8, ymm8, 0 xor r9,r9 loop1: vmovups ymm1, [rax+r9] vcmpps ymm2, ymm8, ymm1, 1 vmaskmovps ymm4, ymm2, [rcx+r9] vxorps ymm2, ymm2, ymm9 vmaskmovps ymm5, ymm2, [rdx+r9] vorps ymm4, ymm4, ymm5 vmulps ymm4,ymm4, [rsi+r9] vmovups [rbx+r9], ymm4 add r9, 32 cmp r9, r8 jl loop1 }
The perform ance of t he left side of Exam ple 11- 14 is sensit ive t o branch m is- predict ions and can be an order of m agnit ude slower t han t he VMASKMOV exam ple which has no dat a- dependent branches.
11.10
MIXING INTEGER AND FLOATING-POINT CODE
I nt eger SI MD funct ionalit ies in I nt el AVX inst ruct ions are lim it ed t o 128- bit . There are som e algorit hm t hat uses m ixed int eger SI MD and float ing- point SI MD inst ruct ions. Therefore, port ing such legacy 128bit code int o 256- bit AVX code requires special at t ent ion. For exam ple, PALI NGR ( Packed Align Right ) is an int eger SI MD inst ruct ion t hat is useful arranging dat a elem ent s for int eger and float ing- point code. But VPALI NGR inst ruct ion does not have a corresponding 256- bit inst ruct ion in AVX. There are t wo approaches t o consider when port ing legacy code consist ing of m ost ly float ing- point wit h som e int eger operat ions int o 256- bit AVX code:
•
•
Locat e a 256- bit AVX alt ernat ive t o replace t he critical128- bit I nt eger SI MD inst ruct ions if such an AVX inst ruct ions exist . This is m ore likely t o be t rue wit h int eger SI MD inst ruct ion t hat re- arranges dat a elem ent s. Mix 128- bit AVX and 256- bit AVX inst ruct ions.
The perform ance gain from t hese t wo approaches m ay vary. Where possible, use m et hod ( 1) , since t his m et hod ut ilizes t he full 256- bit vect or widt h. I n case t he code is m ost ly int eger, convert t he code from 128- bit SSE t o 128 bit AVX inst ruct ions and gain from t he Non dest ruct ive Source ( NDS) feat ure.
Example 11-15. Three-Tap Filter in C Code for(int i = 0; i < len -2; i++) { pOut[i] = A[i]*coeff[0]+A[i+1]*coeff[1]+A[i+2]*coeff[2];{B[i] = (E[i]*D[i]); }
11-25
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-16. Three-Tap Filter with 128-bit Mixed Integer and FP SIMD xor ebx, ebx mov rcx, len mov rdi, inPtr mov rsi, outPtr mov r15, coeffs movss xmm2, [r15] //load coeff 0 shufps xmm2, xmm2, 0 //broadcast coeff 0 movss xmm1, [r15+4] //load coeff 1 shufps xmm1, xmm1, 0 //broadcast coeff 1 movss xmm0, [r15+8] //coeff 2 shufps xmm0, xmm0, 0 //broadcast coeff 2 movaps xmm5, [rdi] //xmm5={A[n+3],A[n+2],A[n+1],A[n]} loop_start: movaps xmm6, [rdi+16] //xmm6={A[n+7],A[n+6],A[n+5],A[n+4]} movaps xmm7, xmm6 movaps xmm8, xmm6 add rdi, 16 //inPtr+=32 add rbx, 4 //loop counter palignr xmm7, xmm5, 4 //xmm7={A[n+4],A[n+3],A[n+2],A[n+1]} palignr xmm8, xmm5, 8 //xmm8={A[n+5],A[n+4],A[n+3],A[n+2]} mulps xmm5, xmm2 //xmm5={C0*A[n+3],C0*A[n+2],C0*A[n+1], C0*A[n]}
mulps xmm7, xmm1 //xmm7={C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]} mulps xmm8, xmm0 //xmm8={C2*A[n+5],C2*A[n+4] C2*A[n+3],C2*A[n+2]} addps xmm7 ,xmm5 addps xmm7, xmm8 movaps [rsi], xmm7 movaps xmm5, xmm6 add rsi, 16 //outPtr+=16 cmp rbx, rcx jl loop_start
Example 11-17. 256-bit AVX Three-Tap Filter Code with VSHUFPS xor ebx, ebx mov rcx, len mov rdi, inPtr mov rsi, outPtr mov r15, coeffs vbroadcastss ymm2, [r15] //load and broadcast coeff 0 vbroadcastss ymm1, [r15+4] //load and broadcast coeff 1 vbroadcastss ymm0, [r15+8] //load and broadcast coeff 2
11-26
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-17. 256-bit AVX Three-Tap Filter Code with VSHUFPS (Contd.) loop_start: vmovaps ymm5, [rdi]
// Ymm5={A[n+7],A[n+6],A[n+5],A[n+4]; // A[n+3],A[n+2],A[n+1] , A[n]} vshufps ymm6,ymm5,[rdi+16],0x4e // ymm6={A[n+9],A[n+8],A[n+7],A[n+6]; // A[n+5],A[n+4],A[n+3],A[n+2]} vshufps ymm7,ymm5,ymm6,0x99 // ymm7={A[n+8],A[n+7],A[n+6],A[n+5]; // A[n+4],A[n+3],A[n+2],A[n+1]} vmulps ymm3,ymm5,ymm2// ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5],C0*A[n+4]; // C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]} vmulps ymm9,ymm7,ymm1 // ymm9={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5]; // C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]} vmulps ymm4,ymm6,ymm0 // ymm4={C2*A[n+9],C2*A[n+8],C2*A[n+7],C2*A[n+6]; // C2*A[n+5],C2*A[n+4],C2*A[n+3],C2*A[n+2]} vaddps ymm8 ,ymm3,ymm4 vaddps ymm10, ymm8, ymm9 vmovaps [rsi], ymm10 add rdi, 32 //inPtr+=32 add rbx, 8 //loop counter add rsi, 32 //outPtr+=32 cmp rbx, rcx jl loop_start
Example 11-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code xor ebx, ebx mov rcx, len mov rdi, inPtr mov rsi, outPtr mov r15, coeffs vbroadcastss ymm2, [r15] //load and broadcast coeff 0 vbroadcastss ymm1, [r15+4] //load and broadcast coeff 1 vbroadcastss ymm0, [r15+8] //load and broadcast coeff 2 vmovaps xmm3, [rdi] //xmm3={A[n+3],A[n+2],A[n+1],A[n]} vmovaps xmm4, [rdi+16] //xmm4={A[n+7],A[n+6],A[n+5],A[n+4]} vmovaps xmm5, [rdi+32] //xmm5={A[n+11], A[n+10],A[n+9],A[n+8]} loop_start: vinsertf128 ymm3, ymm3, xmm4, 1 // ymm3={A[n+7],A[n+6],A[n+5],A[n+4]; // A[n+3], A[n+2],A[n+1],A[n]} vpalignr xmm6, xmm4, xmm3, 4 // xmm6={A[n+4],A[n+3],A[n+2],A[n+1]} vpalignr xmm7, xmm5, xmm4, 4 // xmm7={A[n+8],A[n+7],A[n+6],A[n+5]} vinsertf128 ymm6,ymm6,xmm7,1 // ymm6={A[n+8],A[n+7],A[n+6],A[n+5]; // A[n+4],A[n+3],A[n+2],A[n+1]} vpalignr xmm8,xmm4,xmm3,8 // xmm8={A[n+5],A[n+4],A[n+3],A[n+2]} vpalignr xmm9, xmm5, xmm4, 8 // xmm9={A[n+9],A[n+8],A[n+7],A[n+6]} vinsertf128 ymm8, ymm8, xmm9,1 // ymm8={A[n+9],A[n+8],A[n+7],A[n+6]; // A[n+5],A[n+4],A[n+3],A[n+2]} vmulps ymm3,ymm3,ymm2 // Ymm3={C0*A[n+7],C0*A[n+6],C0*A[n+5], C0*A[n+4]; // C0*A[n+3],C0*A[n+2],C0*A[n+1],C0*A[n]} vmulps ymm6,ymm6,ymm1 // Ymm9={C1*A[n+8],C1*A[n+7],C1*A[n+6],C1*A[n+5]; // C1*A[n+4],C1*A[n+3],C1*A[n+2],C1*A[n+1]}
11-27
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-18. Three-Tap Filter Code with Mixed 256-bit AVX and 128-bit AVX Code (Contd.) vmulps ymm8,ymm8,ymm0 // Ymm4={C2*A[n+9],C2*A[n+8],C2*A[n+7],C2*A[n+6]; // C2*A[n+5],C2*A[n+4],C2*A[n+3],C2*A[n+2]} vaddps ymm3 ,ymm3,ymm6 vaddps ymm3, ymm3, ymm8 vmovaps [rsi], ymm3 vmovaps xmm3, xmm5 add rdi, 32 //inPtr+=32 add rbx, 8 //loop counter add rsi, 32 //outPtr+=32 cmp rbx, rcx jl loop_start
Exam ple 11- 17 uses 256- bit VSHUFPS t o replace t he PALI GNR in 128- bit m ixed SSE code. This speeds up alm ost 70% over t he 128- bit m ixed SSE code of Exam ple 11- 16 and slight ly ahead of Exam ple 11- 18. For code t hat includes int eger inst ruct ions and is writ t en wit h 256- bit I nt el AVX inst ruct ions, replace t he int eger inst ruct ion wit h float ing- point inst ruct ions t hat have sim ilar funct ionalit y and perform ance. I f t here is no sim ilar float ing- point inst ruct ion, consider using a 128- bit I nt el AVX inst ruct ion t o perform t he required int eger operat ion.
11.11
HANDLING PORT 5 PRESSURE
Port 5 in I nt el m icroarchit ect ure code nam e Sandy Bridge includes shuffle unit s and it frequent ly becom es a perform ance bot t leneck. Som et im es it is possible t o replace shuffle inst ruct ions t hat dispat ch only on port 5, wit h different inst ruct ions and im prove perform ance by reducing port 5 pressure. For m ore inform at ion, see Table 2- 15.
11.11.1 Replace Shuffles with Blends There are a few cases where shuffles such as VSHUFPS or VPERM2F128 can be replaced by blend inst ruct ions. I nt el AVX shuffles are execut ed only on port 5, while blends are also execut ed on port 0. Therefore, replacing shuffles wit h blends could reduce port 5 pressure. The following figure shows how a VSHUFPS is im plem ent ed using VBLENDPS.
11-28
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
The following exam ple shows t wo im plem ent at ions of an 8x8 Mat rix t ranspose. I n bot h cases, t he bot t leneck is Port 5 pressure. Alt ernat ive 1 uses 12 vshufps inst ruct ions t hat are execut ed only on port 5. Alt ernat ive 2 replaces eight of t he vshufps inst ruct ions wit h t he vblendps inst ruct ion which can be execut ed on Port 0.
Example 11-19. 8x8 Matrix Transpose - Replace Shuffles with Blends 256-bit AVX using VSHUFPS AVX replacing VSHUFPS with VBLENDPS movrcx, inpBuf movrdx, outBuf movr10, NumOfLoops movrbx, rdx loop1: vmovaps ymm9, [rcx] vmovaps ymm10, [rcx+32] vmovaps ymm11, [rcx+64] vmovaps ymm12, [rcx+96] vmovaps ymm13, [rcx+128] vmovaps ymm14, [rcx+160] vmovaps ymm15, [rcx+192] vmovaps ymm2, [rcx+224] vunpcklps ymm6, ymm9, ymm10 vunpcklps ymm1, ymm11, ymm12 vunpckhps ymm8, ymm9, ymm10 vunpcklps ymm0, ymm13, ymm14 vunpcklps ymm9, ymm15, ymm2 vshufps ymm3, ymm6, ymm1, 0x4E vshufps ymm10, ymm6, ymm3, 0xE4 vshufps ymm6, ymm0, ymm9, 0x4E vunpckhps ymm7, ymm11, ymm12 vshufps ymm11, ymm0, ymm6, 0xE4 vshufps ymm12, ymm3, ymm1, 0xE4 vperm2f128 ymm3, ymm10, ymm11, 0x20 vmovaps [rdx], ymm3 vunpckhps ymm5, ymm13, ymm14 vshufps ymm13, ymm6, ymm9, 0xE4 vunpckhps ymm4, ymm15, ymm2 vperm2f128 ymm2, ymm12, ymm13, 0x20 vmovaps 32[rdx], ymm2 vshufps ymm14, ymm8, ymm7, 0x4 vshufps ymm15, ymm14, ymm7, 0xE4 vshufps ymm7, ymm5, ymm4, 0x4E vshufps ymm8, ymm8, ymm14, 0xE4 vshufps ymm5, ymm5, ymm7, 0xE4 vperm2f128 ymm6, ymm8, ymm5, 0x20 vmovaps 64[rdx], ymm6 vshufps ymm4, ymm7, ymm4, 0xE4 vperm2f128 ymm7, ymm15, ymm4, 0x20 vmovaps 96[rdx], ymm7 vperm2f128 ymm1, ymm10, ymm11, 0x31 vperm2f128 ymm0, ymm12, ymm13, 0x31 vmovaps 128[rdx], ymm1 vperm2f128 ymm5, ymm8, ymm5, 0x31 vperm2f128 ymm4, ymm15, ymm4, 0x31
movrcx, inpBuf movrdx, outBuf movr10, NumOfLoops movrbx, rdx loop1: vmovaps ymm9, [rcx] vmovaps ymm10, [rcx+32] vmovaps ymm11, [rcx+64] vmovaps ymm12, [rcx+96] vmovaps ymm13, [rcx+128] vmovaps ymm14, [rcx+160] vmovaps ymm15, [rcx+192] vmovaps ymm2, [rcx+224] vunpcklps ymm6, ymm9, ymm10 vunpcklps ymm1, ymm11, ymm12 vunpckhps ymm8, ymm9, ymm10 vunpcklps ymm0, ymm13, ymm14 vunpcklps ymm9, ymm15, ymm2 vshufps ymm3, ymm6, ymm1, 0x4E vblendps ymm10, ymm6, ymm3, 0xCC vshufps ymm6, ymm0, ymm9, 0x4E vunpckhps ymm7, ymm11, ymm12 vblendps ymm11, ymm0, ymm6, 0xCC vblendps ymm12, ymm3, ymm1, 0xCC vperm2f128 ymm3, ymm10, ymm11, 0x20 vmovaps [rdx], ymm3 vunpckhps ymm5, ymm13, ymm14 vblendps ymm13, ymm6, ymm9, 0xCC vunpckhps ymm4, ymm15, ymm2 vperm2f128 ymm2, ymm12, ymm13, 0x20 vmovaps 32[rdx], ymm2 vshufps ymm14, ymm8, ymm7, 0x4E vblendps ymm15, ymm14, ymm7, 0xCC vshufps ymm7, ymm5, ymm4, 0x4E vblendps ymm8, ymm8, ymm14, 0xCC vblendps ymm5, ymm5, ymm7, 0xCC vperm2f128 ymm6, ymm8, ymm5, 0x20 vmovaps 64[rdx], ymm6 vblendps ymm4, ymm7, ymm4, 0xCC vperm2f128 ymm7, ymm15, ymm4, 0x20 vmovaps 96[rdx], ymm7 vperm2f128 ymm1, ymm10, ymm11, 0x31 vperm2f128 ymm0, ymm12, ymm13, 0x31 vmovaps 128[rdx], ymm1 vperm2f128 ymm5, ymm8, ymm5, 0x31 vperm2f128 ymm4, ymm15, ymm4, 0x31 11-29
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-19. 8x8 Matrix Transpose - Replace Shuffles with Blends (Contd.) 256-bit AVX using VSHUFPS AVX replacing VSHUFPS with VBLENDPS vmovaps 160[rdx], ymm0 vmovaps 192[rdx], ymm5 vmovaps 224[rdx], ymm4 decr10 jnz loop1
vmovaps 160[rdx], ymm0 vmovaps 192[rdx], ymm5 vmovaps 224[rdx], ymm4 dec r10 jnz loop1
I n Exam ple 11- 19, replacing VSHUFPS wit h VBLENDPS relieved port 5 pressure and can gain alm ost 40% speedup. Asse m bly/ Com pile r Coding Rule 7 5 . ( M im pa ct , M ge ne r a lit y) Use Blend inst ruct ions in lieu of shuffle inst ruct ion in AVX whenever possible.
11.11.2 Design Algorithm With Fewer Shuffles I n som e cases you can reduce port 5 pressure by changing t he algorit hm t o use less shuffles. The figure below shows t hat t he t ranspose m oved all t he elem ent s in rows 0- 4 t o t he low lanes, and all t he elem ent s in rows 4- 7 t o t he high lanes. Therefore, using 256- bit loads in t he beginning of t he algorit hm requires using VPERM2F128 in order t o swap elem ent s bet ween t he lanes. The processor execut es t he VPERM2F128 inst ruct ion only on port 5. Exam ple 11- 19 used eight 256- bit loads and eight VPERM2F128 inst ruct ions. You can im plem ent t he sam e 8x8 Mat rix Transpose using VI NSERTF128 inst ead of t he 256- bit loads and t he eight VPERM2F128. Using VI NSERTF128 from m em ory is execut ed in t he load port s and on port 0 or 5. The original m et hod required loads t hat are perform ed on t he load port s and VPERM2F128 t hat is only perform ed on port 5. Therefore redesigning t he algorit hm t o use VI NSERTF128 reduces port 5 pressure and im proves perform ance.
The following figure describes st ep 1 of t he 8x8 m at rix t ranspose wit h vinsert f128. St ep 2 perform s t he sam e operat ions on different colum ns.
11-30
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-20. 8x8 Matrix Transpose Using VINSRTPS mov rcx, inpBuf mov rdx, outBuf mov r8, iLineSize mov r10, NumOfLoops loop1: vmovaps xmm0, [rcx] vinsertf128 ymm0, ymm0, [rcx + 128], 1 vmovaps xmm1, [rcx + 32] vinsertf128 ymm1, ymm1, [rcx + 160], 1 vunpcklpd vunpckhpd vmovaps vinsertf128 vmovaps vinsertf128
ymm8, ymm0, ymm1 ymm9, ymm0, ymm1 xmm2, [rcx+64] ymm2, ymm2, [rcx + 192], 1 xmm3, [rcx+96] ymm3, ymm3, [rcx + 224], 1
vunpcklpd vunpckhpd vshufps vmovaps vshufps vmovaps vshufps vmovaps vshufps vmovaps
ymm10, ymm2, ymm3 ymm11, ymm2, ymm3 ymm4, ymm8, ymm10, 0x88 [rdx], ymm4 ymm5, ymm8, ymm10, 0xDD [rdx+32], ymm5 ymm6, ymm9, ymm11, 0x88 [rdx+64], ymm6 ymm7, ymm9, ymm11, 0xDD [rdx+96], ymm7
11-31
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-20. 8x8 Matrix Transpose Using VINSRTPS (Contd.) vmovaps xmm0, [rcx+16] vinsertf128 ymm0, ymm0, [rcx + 144], 1 vmovaps xmm1, [rcx + 48] vinsertf128 ymm1, ymm1, [rcx + 176], 1 vunpcklpd vunpckhpd
ymm8, ymm0, ymm1 ymm9, ymm0, ymm1
vmovaps vinsertf128 vmovaps vinsertf128
xmm2, [rcx+80] ymm2, ymm2, [rcx + 208], 1 xmm3, [rcx+112] ymm3, ymm3, [rcx + 240], 1
vunpcklpd vunpckhpd
ymm10, ymm2, ymm3 ymm11, ymm2, ymm3
vshufps vmovaps vshufps vmovaps vshufps vmovaps vshufps vmovaps dec jnz
ymm4, ymm8, ymm10, 0x88 [rdx+128], ymm4 ymm5, ymm8, ymm10, 0xDD [rdx+160], ymm5 ymm6, ymm9, ymm11, 0x88 [rdx+192], ymm6 ymm7, ymm9, ymm11, 0xDD [rdx+224], ymm7 r10 loop1
I n Exam ple 11- 20, t his reduced port 5 pressure furt her t han t he com binat ion of VSHUFPS wit h VBLENDPS in Exam ple 11- 19. I t can gain 70% speedup relat ive t o relying on VSHUFPS alone in Exam ple 11- 19.
11.11.3 Perform Basic Shuffles on Load Ports Som e shuffles can be execut ed in t he load port s ( port s 2, 3) if t he source is from m em ory. The following exam ple shows how m oving som e shuffles ( vm ovsldup/ vm ovshdup) from Port 5 t o t he load port s im proves perform ance significant ly. The following figure describes an I nt el AVX im plem ent at ion of t he com plex m ult iply algorit hm wit h vm ovsldup/ vm ovshdup on t he load port s.
11-32
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Exam ple 11- 21 includes t wo versions of t he com plex m ult iply. Bot h versions are unrolled t wice. Alt ernat ive 1 shuffles all t he dat a in regist ers. Alt ernat ive 2 shuffles dat a while it is loaded from m em ory.
Example 11-21. Port 5 versus Load Port Shuffles Shuffles data in registers mov mov mov mov xor
loop1: vmovaps ymm0, [rax +8*rcx] vmovaps ymm4, [rax +8*rcx +32] ymm3, [rbx +8*rcx] vmovsldup ymm2, ymm3 vmulps ymm2, ymm2, ymm0 vshufps ymm0, ymm0, ymm0, 177 vmovshdup ymm1, ymm3 vmulps ymm1, ymm1, ymm0 vmovaps ymm7, [rbx +8*rcx +32] vmovsldup ymm6, ymm7 vmulps ymm6, ymm6, ymm4 vaddsubps ymm2, ymm2, ymm1 vmovshdup ymm5, ymm7
Shuffling loaded data mov mov mov mov xor
rax, inPtr1 rbx, inPtr2 rdx, outPtr r8, len rcx, rcx
vmovaps
rax, inPtr1 rbx, inPtr2 rdx, outPtr r8, len rcx, rcx
loop1: vmovaps ymm0, [rax +8*rcx] vmovaps ymm4, [rax +8*rcx +32] vmovsldup ymm2, [rbx +8*rcx] vmulps ymm2, ymm2, ymm0 vshufps ymm0, ymm0, ymm0, 177 vmovshdup ymm1, [rbx +8*rcx] vmulps ymm1, ymm1, ymm0 vmovsldup ymm6, [rbx +8*rcx +32] vmulps ymm6, ymm6, ymm4 vaddsubps ymm3, ymm2, ymm1 vmovshdup ymm5, [rbx +8*rcx +32]
11-33
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-21. Port 5 versus Load Port Shuffles (Contd.) Shuffles data in registers Shuffling loaded data vmovaps [rdx+8*rcx], ymm2 vshufps ymm4, ymm4, ymm4, 177 vmulps ymm5, ymm5, ymm4 vaddsubps ymm6, ymm6, ymm5 vmovaps [rdx+8*rcx+32], ymm6
vmovaps [rdx +8*rcx], ymm3 vshufps ymm4, ymm4, ymm4, 177 vmulps ymm5, ymm5, ymm4 vaddsubps ymm7, ymm6, ymm5 vmovaps [rdx +8*rcx +32], ymm7
addrcx, 8 cmprcx, r8 jl loop1
addrcx, 8 cmprcx, r8 jl loop1
11.12
DIVIDE AND SQUARE ROOT OPERATIONS
I n I nt el m icroarchit ect ures prior t o Skylake, t he SSE divide and square root inst ruct ions DI VPS and SQRTPS have a lat ency of 14 cycles ( or t he neighborhood) and t hey are not pipelined. This m eans t hat t he t hroughput of t hese inst ruct ions is one in every 14 cycles. The 256- bit I nt el AVX inst ruct ions VDI VPS and VSQRTPS execut e wit h 128- bit dat a pat h and have a lat ency of 28 cycles and t hey are not pipelined as well. Therefore, t he perform ance of t he I nt el SSE divide and square root inst ruct ions is sim ilar t o t he I nt el AVX 256- bit inst ruct ions on I nt el m icroarchit ect ure code nam e Sandy Bridge. Wit h t he Skylake m icroarchit ect ure, 256- bit and 128- bit version of ( V) DI VPS/ ( V) SQRTPS have t he sam e lat ency because t he 256- bit version can execut e wit h a 256- bit dat a pat h. The lat ency is im proved and is pipelined t o execut e wit h significant ly im proved t hroughput . See Appendix C, “ I A- 32 I nst ruct ion Lat ency and Throughput ” . I n m icroarchit ect ures t hat provide DI VPS/ SQRTPS wit h high lat ency and low t hroughput , it is possible t o speed up single- precision divide and square root calculat ions using t he ( V) RSQRTPS and ( V) RCPPS inst ruct ions. For exam ple, wit h 128- bit RCPPS/ RSQRTPS at 5- cycle lat ency and 1- cycle t hroughput or wit h 256- bit im plem ent at ion of t hese inst ruct ions at 7- cycle lat ency and 2- cycle t hroughput , a single Newt on- Raphson it erat ion or Taylor approxim at ion can achieve alm ost t he sam e precision as t he ( V) DI VPS and ( V) SQRTPS inst ruct ions. See I nt el ® 64 and I A- 32 Archit ect ures Soft ware Developer's Manual for m ore inform at ion on t hese inst ruct ions. I n som e cases, when t he divide or square root operat ions are part of a larger algorit hm t hat hides som e of t he lat ency of t hese operat ions, t he approxim at ion wit h Newt on- Raphson can slow down execut ion, because m ore m icro- ops, com ing from t he addit ional inst ruct ions, fill t he pipe. Wit h t he Skylake m icroarchit ect ure, choosing bet ween approxim at e reciprocal inst ruct ion alt ernat ive versus DI VPS/ SQRTPS for opt im al perform ance of sim ple algebraic com put at ions depend on a num ber of fact ors. Table 11- 5 shows several algebraic form ula t he t hroughput com parison of im plem ent at ions of different num eric accuracy t olerances. I n each row, 24- bit accurat e im plem ent at ions are I EEE- com pliant and using t he respect ive inst ruct ions of 128- bit or 256- bit I SA. The colum ns of 22- bit and 11- bit accurat e im plem ent at ions are using approxim at e reciprocal inst ruct ions of t he respect ive inst ruct ion set .
Table 11-5. Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture Algorithm
Instruction Type
24-bit Accurate
22-bit Accurate
11-bit Accurate
Z = X/Y
SSE
1X
0.9X
1.3X
256-bit AVX
1X
1.5X
2.6X
SSE
1X
0.7X
2X
256-bit AVX
1X
1.4X
3.4X
SSE
1X
1.7X
4.3X
256-bit AVX
1X
3X
7.7X
0.5
Z=X
-0.5
Z=X
11-34
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Table 11-5. Comparison of Numeric Alternatives of Selected Linear Algebra in Skylake Microarchitecture Algorithm Z = (X *Y + Y*Y
)0.5
Z = (X+2Y+3)/(Z-2Y-3)
Instruction Type
24-bit Accurate
22-bit Accurate
11-bit Accurate
SSE
1X
0.75X
0.85X
256-bit AVX
1X
1.1X
1.6X
SSE
1X
0.85X
1X
256-bit AVX
1X
0.8X
1X
I f t arget ing processors based on t he Skylake m icroarchit ect ure, Table 11- 5 can be sum m arized as:
•
•
For 256- bit AVX code, Newt on- Raphson approxim at ion can be beneficial on Skylake m icroarchit ect ure when t he algorit hm cont ains only operat ions execut ed on t he divide unit . However, when single precision divide or square root operat ions are part of a longer com put at ion, t he lower lat ency of t he DI VPS or SQRTPS inst ruct ions can lead t o bet t er overall perform ance. For SSE or 128- bit AVX im plem ent at ion, consider use of approxim at ion for divide and square root inst ruct ions only for algorit hm s t hat do not require precision higher t han 11- bit or algorit hm s t hat cont ain m ult iple operat ions execut ed on t he divide unit .
Table 11- 6 sum m arizes r ecom m ended calculat ion m et hods of divisions or square root when using singleprecision inst ruct ions, based on t he desired accuracy level across recent generat ions of I nt el m icroarchit ect ures.
Table 11-6. Single-Precision Divide and Square Root Alternatives Operation
Accuracy Tolerance
Recommendation
Divide
24 bits (IEEE)
DIVPS
~ 22 bits
Skylake: Consult Table 11- 5 Prior uarch: RCPPS + 1 Newton-Raphson Iteration + MULPS
Reciprocal square root
Square root
~ 11 bits
RCPPS + MULPS
24 bits (IEEE)
SQRTPS + DIVPS
~ 22 bits
RSQRTPS + 1 Newton-Raphson Iteration
~ 11 bits
RSQRTPS
24 bits (IEEE)
SQRTPS
~ 22 bits
Skylake: Consult Table 11- 5 Prior uarch: RSQRTPS + 1 Newton-Raphson Iteration + MULPS
~ 11 bits
RSQRTPS + RCPPS
11.12.1 Single-Precision Divide To com put e: Z[ i] = A[ i] / B[ i] On a large vect or of single- precision num bers, Z[ i] can be calculat ed by a divide operat ion, or by m ult iplying 1/ B[ i] by A[ i] . Denot ing B[ i] by N, it is possible t o calculat e 1/ N using t he ( V) RCPPS inst ruct ion, achieving approxim at ely 11- bit precision. For bet t er accuracy you can use t he one Newt on- Raphson it erat ion: X_( 0 ) ~ = 1/ N
; I nit ial est im at ion, rcp( N)
X_( 0 ) = 1/ N* ( 1- E) E= 1- N* X_0 X_1= X_0* ( 1+ E) = 1/ N* ( 1- E^ 2 )
; E ~ = 2^ ( - 11) ; E^ 2 ~ = 2^ ( - 22) 11-35
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
X_1= X_0* ( 1+ 1- N* X_0 ) = 2 * X_0 - N* X_0^ 2 X_1 is an approxim at ion of 1/ N wit h approxim at ely 22- bit precision.
Example 11-22. Divide Using DIVPS for 24-bit Accuracy SSE code using DIVPS Using VDIVPS mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
loop1: movups xmm0, [rax+rdx*1] movups xmm1, [rbx+rdx*1] divps xmm0, xmm1 movups [rcx+rdx*1], xmm0 add rdx, 0x10 cmp rdx, rsi jl loop1
loop1: vmovups ymm0, [rax+rdx*1] vmovups ymm1, [rbx+rdx*1] vdivps ymm0, ymm0, ymm1 vmovups [rcx+rdx*1], ymm0 add rdx, 0x20 cmp rdx, rsi jl loop1
Example 11-23. Divide Using RCPPS 11-bit Approximation SSE code using RCPPS Using VRCPPS mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
loop1: movups xmm0,[rax+rdx*1] movups xmm1,[rbx+rdx*1] rcpps xmm1,xmm1 mulps xmm0,xmm1 movups [rcx+rdx*1],xmm0 add rdx, 16 cmp rdx, rsi jl loop1
loop1: vmovups ymm0, [rax+rdx] vmovups ymm1, [rbx+rdx] vrcpps ymm1, ymm1 vmulps ymm0, ymm0, ymm1 vmovups [rcx+rdx], ymm0 add rdx, 32 cmp rdx, rsi jl loop1
Example 11-24. Divide Using RCPPS and Newton-Raphson Iteration RCPPS + MULPS ~ 22 bit accuracy VRCPPS + VMULPS ~ 22 bit accuracy mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
11-36
mov rax, pIn1 mov rbx, pIn2 mov rcx, pOut mov rsi, iLen xor rdx, rdx
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-24. Divide Using RCPPS and Newton-Raphson Iteration (Contd.) RCPPS + MULPS ~ 22 bit accuracy VRCPPS + VMULPS ~ 22 bit accuracy loop1: movups xmm0, [rax+rdx*1] movups xmm1, [rbx+rdx*1] rcpps xmm3, xmm1 movaps xmm2, xmm3 addps xmm3, xmm2 mulps xmm2, xmm2 mulps xmm2, xmm1 subps xmm3, xmm2 mulps xmm0, xmm3 movups xmmword ptr [rcx+rdx*1], xmm0 add rdx, 0x10 cmp rdx, rsi jl loop1
loop1: vmovups ymm0, [rax+rdx] vmovups ymm1, [rbx+rdx] vrcpps ymm3, ymm1 vaddps ymm2, ymm3, ymm3 vmulps ymm3, ymm3, ymm3 vmulps ymm3, ymm3, ymm1 vsubps ymm2, ymm2, ymm3 vmulps ymm0, ymm0, ymm2 vmovups [rcx+rdx], ymm0 add rdx, 32 cmp rdx, rsi jl loop1
Table 11-7. Comparison of Single-Precision Divide Alternatives Accuracy
Method
SSE Performance
AVX Performance
24 bits
(V)DIVPS
Baseline
1X
~ 22 bits
(V)RCPPS + Newton-Raphson
2.7X
4.5X
~ 11 bits
(V)RCPPS
6X
8X
11.12.2 Single-Precision Reciprocal Square Root To com put e Z[ i] = 1/ ( A[ i] ) ^ 0.5 on a large vect or of single- precision num bers, denot ing A[ i] by N, it is possible t o calculat e 1/ N using t he ( V) RSQRTPS inst ruct ion. For bet t er accuracy you can use one Newt on- Raphson it erat ion: X_0 ~ = 1/ N ; I nit ial est im at ion RCP( N) E= 1- N* X_0^ 2 X_0= ( 1/ N) ^ 0.5 * ( ( 1- E) ^ 0.5 ) = ( 1/ N) ^ 0.5 * ( 1- E/ 2) ; E/ 2~ = 2^ ( - 11) X_1= X_0* ( 1+ E/ 2) ~ = ( 1/ N) ^ 0.5 * ( 1- E^ 2/ 4)
; E^ 2/ 4?2^ ( - 22)
X_1= X_0* ( 1+ 1/ 2- 1/ 2* N* X_0^ 2 ) = 1/ 2* X_0* ( 3- N* X_0^ 2) X1 is an approxim at ion of ( 1/ N) ^ 0.5 wit h approxim at ely 22- bit precision.
11-37
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-25. Reciprocal Square Root Using DIVPS+SQRTPS for 24-bit Accuracy Using SQRTPS, DIVPS Using VSQRTPS, VDIVPS mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: movups xmm1, [rax+rdx] sqrtps xmm0, xmm1 divps xmm0, xmm1 movups [rbx+rdx], xmm0 add rdx, 16 cmp rdx, rcx jl loop1
mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: vmovups ymm1, [rax+rdx] vsqrtps ymm0,ymm1 vdivps ymm0, ymm0, ymm1 vmovups [rbx+rdx], ymm0 add rdx, 32 cmp rdx, rcx jl loop1
Example 11-26. Reciprocal Square Root Using RCPPS 11-bit Approximation SSE code using RCPPS Using VRCPPS mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: rsqrtps xmm0, [rax+rdx] movups [rbx+rdx], xmm0 add rdx, 16 cmp rdx, rcx jl loop1
mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: vrsqrtps ymm0, [rax+rdx] vmovups [rbx+rdx], ymm0 add rdx, 32 cmp rdx, rcx jl loop1
Example 11-27. Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration RCPPS + MULPS ~ 22 bit accuracy VRCPPS + VMULPS ~ 22 bit accuracy __declspec(align(16)) float minus_half[4] = {-0.5, -0.5, 0.5, -0.5}; __declspec(align(16)) float three[4] = {3.0, 3.0, 3.0, 3.0}; __asm { mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx movups xmm3, [three] movups xmm4, [minus_half]
11-38
__declspec(align(32)) float half[8] = {0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5}; __declspec(align(32)) float three[8] = {3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0}; __asm { mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx vmovups ymm3, [three] vmovups ymm4, [half]
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-27. Reciprocal Square Root Using RCPPS and Newton-Raphson Iteration (Contd.) RCPPS + MULPS ~ 22 bit accuracy VRCPPS + VMULPS ~ 22 bit accuracy loop1: movups xmm5, [rax+rdx] rsqrtps xmm0, xmm5 movaps xmm2, xmm0 mulps xmm0, xmm0 mulps xmm0, xmm5 subps xmm0, xmm3 mulps xmm0, xmm2 mulps xmm0, xmm4 movups [rbx+rdx], xmm0
loop1: vmovups ymm5, [rax+rdx] vrsqrtps ymm0, ymm5 vmulps ymm2, ymm0, ymm0 vmulps ymm2, ymm2, ymm5 vsubps ymm2, ymm3, ymm2 vmulps ymm0, ymm0, ymm2 vmulps ymm0, ymm0, ymm4
add rdx, 16 cmp rdx, rcx jl loop1
vmovups [rbx+rdx], ymm0 add rdx, 32 cmp rdx, rcx jl loop1
} }
Table 11-8. Comparison of Single-Precision Reciprocal Square Root Operation Accuracy
Method
SSE Performance
AVX Performance
24 bits
(V)SQRTPS + (V)DIVPS
Baseline
1X
~ 22 bits
(V)RCPPS + Newton-Raphson
5.2X
9.1X
~ 11 bits
(V)RCPPS
13.5X
17.5X
11.12.3 Single-Precision Square Root To com put e Z[ i] = ( A[ i] ) ^ 0.5 on a large vect or of single- precision num bers, denot ing A[ i] by N, t he approxim at ion for N^ 0.5 is N m ult iplied by ( 1/ N) ^ 0.5 , where t he approxim at ion for ( 1/ N) ^ 0.5 is described in t he previous sect ion. To get approxim at ely 22- bit precision of N^ 0.5, use t he following calculat ion: N^ 0.5 = X_1* N = 1/ 2* N* X_0* ( 3- N* X_0^ 2)
Example 11-28. Square Root Using SQRTPS for 24-bit Accuracy Using SQRTPS Using VSQRTPS mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: movups xmm1, [rax+rdx] sqrtps xmm1, xmm1 movups [rbx+rdx], xmm1 add rdx, 16 cmp rdx, rcx jl loop1
mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: vmovups ymm1, [rax+rdx] vsqrtps ymm1,ymm1 vmovups [rbx+rdx], ymm1 add rdx, 32 cmp rdx, rcx jl loop1
11-39
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-29. Square Root Using RCPPS 11-bit Approximation SSE code using RCPPS Using VRCPPS mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx loop1: movups xmm1, [rax+rdx] xorps xmm8, xmm8 cmpneqps xmm8, xmm1 rsqrtps xmm1, xmm1 rcpps xmm1, xmm1 andps xmm1, xmm8 movups [rbx+rdx], xmm1 add rdx, 16 cmp rdx, rcx jl loop1
mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx vxorps ymm8, ymm8, ymm8 loop1: vmovups ymm1, [rax+rdx] vcmpneqps ymm9, ymm8, ymm1 vrsqrtps ymm1, ymm1 vrcpps ymm1, ymm1 vandps ymm1, ymm1, ymm9 vmovups [rbx+rdx], ymm1 add rdx, 32 cmp rdx, rcx jl loop1
Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion RCPPS + Taylor ~ 22 bit accuracy VRCPPS + Taylor ~ 22 bit accuracy __declspec(align(16)) float minus_half[4] = {-0.5, -0.5, 0.5, -0.5}; __declspec(align(16)) float three[4] = {3.0, 3.0, 3.0, 3.0}; __asm { mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx movups xmm6, [three] movups xmm7, [minus_half]
loop1: movups xmm3, [rax+rdx] rsqrtps xmm1, xmm3 xorps xmm8, xmm8 cmpneqps xmm8, xmm3 andps xmm1, xmm8 movaps xmm4, xmm1 mulps xmm1, xmm3 movaps xmm5, xmm1 mulps xmm1, xmm4
11-40
__declspec(align(32)) float three[8] = {3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0}; __declspec(align(32)) float minus_half[8] = {-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5}; __asm { mov rax, pIn mov rbx, pOut mov rcx, iLen xor rdx, rdx vmovups ymm6, [three] vmovups ymm7, [minus_half] vxorps ymm8, ymm8, ymm8 loop1: vmovups ymm3, [rax+rdx] vrsqrtps ymm4, ymm3 vcmpneqps ymm9, ymm8, ymm3 vandps ymm4, ymm4, ymm9 vmulps ymm1,ymm4, ymm3 vmulps ymm2, ymm1, ymm4
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-30. Square Root Using RCPPS and One Taylor Series Expansion (Contd.) RCPPS + Taylor ~ 22 bit accuracy VRCPPS + Taylor ~ 22 bit accuracy subps xmm1, xmm6 mulps xmm1, xmm5 mulps xmm1, xmm7 movups [rbx+rdx], xmm1 add rdx, 16 cmp rdx, rcx jl loop1 }
vsubps ymm2, ymm2, ymm6 vmulps ymm1, ymm1, ymm2 vmulps ymm1, ymm1, ymm7 vmovups [rbx+rdx], ymm1 add rdx, 32 cmp rdx, rcx jl loop1 }
Table 11-9. Comparison of Single-Precision Square Root Operation Accuracy
Method
SSE Performance
AVX Performance
24 bits
(V)SQRTPS
Baseline
1X
~ 22 bits
(V)RCPPS + Taylor-Expansion
2.3X
4.3X
~ 11 bits
(V)RCPPS
4.7X
5.9X
11.13
OPTIMIZATION OF ARRAY SUB SUM EXAMPLE
This sect ion shows t he t ransform at ion of SSE im plem ent at ion of Array Sub Sum algorit hm t o I nt el AVX im plem ent at ion. The Array Sub Sum algorit hm is: Y[ i] = Sum of k from 0 t o i ( X[ k] ) = X[ 0] + X[ 1] + .. + X[ i] The following figure describes t he SSE im plem ent at ion.
The figure below describes t he I nt el AVX im plem ent at ion of t he Array Sub Sum s algorit hm . The PSLLDQ is an int eger SI MD inst ruct ion which does not have a 256- bit equivalent . I t is replaced by VSHUFPS.
11-41
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-31. Array Sub Sums Algorithm SSE code mov rax, InBuff mov rbx, OutBuff mov rdx, len xor rcx, rcx xorps xmm0, xmm0 loop1: movaps xmm2, [rax+4*rcx] movaps xmm3, [rax+4*rcx] movaps xmm4, [rax+4*rcx] movaps xmm5, [rax+4*rcx] pslldq xmm3, 4 pslldq xmm4, 8 pslldq xmm5, 12 addps xmm2, xmm3 addps xmm4, xmm5 addps xmm2, xmm4 addps xmm2, xmm0 movaps xmm0, xmm2 shufps xmm0, xmm2, 0xFF movaps [rbx+4*rcx], xmm2 add rcx, 4 cmp rcx, rdx jl loop1
AVX code mov rax, InBuff mov rbx, OutBuff mov rdx, len xor rcx, rcx vxorps ymm0, ymm0, ymm0 vxorps ymm1, ymm1, ymm1 loop1: vmovaps ymm2, [rax+4*rcx] vshufps ymm4, ymm0, ymm2, 0x40 vshufps ymm3, ymm4, ymm2, 0x99 vshufps ymm5, ymm0, ymm4, 0x80 vaddps ymm6, ymm2, ymm3 vaddps ymm7, ymm4, ymm5 vaddps ymm9, ymm6, ymm7 vaddps ymm1, ymm9, ymm1 vshufps ymm8, ymm9, ymm9, 0xff vperm2f128 ymm10, ymm8, ymm0, 0x2 vaddps ymm12, ymm1, ymm10 vshufps ymm11, ymm12, ymm12, 0xff vperm2f128 ymm1, ymm11, ymm11, 0x11 vmovaps [rbx+4*rcx], ymm12 add rcx, 8 cmp rcx, rdx jl loop1
Exam ple 11- 31 shows SSE im plem ent at ion of array sub sum m and AVX im plem ent at ion. The AVX code is about 40% fast er.
11-42
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11.14
HALF-PRECISION FLOATING-POINT CONVERSIONS
I n applicat ions t hat use float ing- point and require only t he dynam ic range and precision offered by t he 16- bit float ing- point form at , st oring persist ent float ing- point dat a encoded in 16- bit s has st rong advant ages in m em ory foot print and bandwidt h conservat ion. These sit uat ions are encount ered in som e graphics and im aging workloads. The encoding form at of half- precision float ing- point num bers can be found in Chapt er 4, “ Dat a Types” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1. I nst ruct ions t o convert bet ween packed, half- precision float ing- point num bers and packed single- precision float ing- point num bers is described in Chapt er 14, “ Program m ing wit h AVX, FMA and AVX2” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 1 and in t he reference pages of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2B. To perform com put at ions on half precision float ing- point dat a, packed 16- bit FP dat a elem ent s m ust be convert ed t o single precision form at first , and t he single- precision result s convert ed back t o half precision form at , if necessary. These conversions of 8 dat a elem ent s using 256- bit inst ruct ions are very fast and handle t he special cases of denorm al num bers, infinit y, zero and NaNs properly.
11.14.1 Packed Single-Precision to Half-Precision Conversion To convert t he dat a in single precision float ing- point form at t o half precision form at , wit hout special hardware support like VCVTPS2PH, a program m er needs t o do t he following:
• • • • •
Correct exponent bias t o perm it t ed range for each dat a elem ent . Shift and round t he significand of each dat a elem ent . Copy t he sign bit t o bit 15 of each elem ent . Take care of num bers out side t he half precision range. Pack each dat a elem ent t o a regist er of half size.
Exam ple 11- 32 com pares t wo im plem ent at ions of float ing- point conversion from single precision t o half precision. The code on t he left uses packed int eger shift inst ruct ions t hat is lim it ed t o 128- bit SI MD inst ruct ion set . The code on right is unrolled t wice and uses t he VCVTPS2PH inst ruct ion.
Example 11-32. Single-Precision to Half-Precision Conversion AVX-128 code VCVTPS2PH code __asm { mov rax, pIn mov rbx, pOut mov rcx, bufferSize add rcx, rax vmovdqu xmm0,SignMask16 vmovdqu xmm1,ExpBiasFixAndRound vmovdqu xmm4,SignMaskNot32 vmovdqu xmm5,MaxConvertibleFloat vmovdqu xmm6,MinFloat loop: vmovdqu xmm2, [rax] vmovdqu xmm3, [rax+16] vpaddd xmm7, xmm2, xmm1 vpaddd xmm9, xmm3, xmm1 vpand xmm7, xmm7, xmm4 vpand xmm9, xmm9, xmm4 add rax, 32
__asm { mov mov mov add loop: vmovups vmovups add vcvtps2ph vcvtps2ph add cmp jl
rax, pIn rbx, pOut rcx, bufferSize rcx, rax ymm0,[rax] ymm1,[rax+32] rax, 64 [rbx],ymm0, roundingCtrl [rbx+16],ymm1,roundingCtrl rbx, 32 rax, rcx loop
11-43
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-32. Single-Precision to Half-Precision Conversion (Contd.) AVX-128 code VCVTPS2PH code vminps vminps vpcmpgtd vpcmpgtd vpand vpand vpackssdw vpsrad vpsrad vpand vpackssdw vpaddw vmovdqu add cmp jl
xmm7, xmm7, xmm5 xmm9, xmm9, xmm5 xmm8, xmm7, xmm6 xmm10, xmm9, xmm6 xmm7, xmm8, xmm7 xmm9, xmm10, xmm9 xmm2, xmm3, xmm2 xmm7, xmm7, 13 xmm8, xmm9, 13 xmm2, xmm2, xmm0 xmm3, xmm7, xmm9 xmm3, xmm3, xmm2 [rbx], xmm3 rbx, 16 rax, rcx loop
The code using VCVTPS2PH is approxim at ely four t im es fast er t han t he AVX- 128 sequence. Alt hough it is possible t o load 8 dat a elem ent s at once wit h 256- bit AVX, m ost of t he per- elem ent conversion operat ions require packed int eger inst ruct ions which do not have 256- bit ext ensions yet . Using VCVTPS2PH is not only fast er but also provides handling of special cases t hat do not encode t o norm al half- precision float ing- point values.
11.14.2 Packed Half-Precision to Single-Precision Conversion Exam ple 11- 33 com pares t wo im plem ent at ions using AVX- 128 code and wit h VCVTPH2PS. Conversion from half precision t o single precision float ing- point form at is easier t o im plem ent , yet using VCVTPH2PS inst ruct ion perform s about 2.5 t im es fast er t han t he alt ernat ive AVX- 128 code.
Example 11-33. Half-Precision to Single-Precision Conversion AVX-128 code VCVTPS2PH code __asm { mov rax, pIn mov rbx, pOut mov rcx, bufferSize add rcx, rax vmovdqu xmm0,SignMask16 vmovdqu xmm1,ExpBiasFix16 vmovdqu xmm2,ExpMaskMarker loop: vmovdqu xmm3, [rax] add rax, 16 vpandn xmm4, xmm0, xmm3 vpand xmm5, xmm3, xmm0 vpsrlw xmm4, xmm4, 3 vpaddw xmm6, xmm4, xmm1 vpcmpgtw xmm7, xmm6, xmm2
11-44
__asm { mov rax, pIn mov rbx, pOut mov rcx, bufferSize add rcx, rax loop: vcvtph2ps ymm0,[rax] vcvtph2ps ymm1,[rax+16] add rax, 32 vmovups [rbx], ymm0 vmovups [rbx+32], ymm1 add rbx, 64 cmp rax, rcx jl loop
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-33. Half-Precision to Single-Precision Conversion (Contd.) AVX-128 code VCVTPS2PH code vpand vpand vpor vpsllw vpunpcklwd vpunpckhwd vmovdqu vmovdqu add cmp jl
xmm6, xmm6, xmm7 xmm8, xmm3, xmm7 xmm6, xmm6, xmm5 xmm8, xmm8, 13 xmm3, xmm8, xmm6 xmm4, xmm8, xmm6 [rbx], xmm3 [rbx+16], xmm4 rbx, 32 rax, rcx loop
11.14.3 Locality Consideration for using Half-Precision FP to Conserve Bandwidth Exam ple 11- 32 and Exam ple 11- 33 dem onst rat e t he perform ance advant age of using FP16C inst ruct ions when soft ware needs t o convert bet ween half- precision and single- precision dat a. Half- precision FP form at is m ore com pact , consum es less bandwidt h t han single- precision FP form at , but sacrifices dynam ic range, precision, and incurs conversion overhead if arit hm et ic com put at ion is required. Whet her it is profit able for soft ware t o use half- precision dat a will be highly dependent on localit y considerat ions of t he workload. This sect ion uses an exam ple based on t he horizont al m edian filt ering algorit hm , “ Median3”. The Median3 algorit hm calculat es t he m edian of every t hree consecut ive elem ent s in a vect or: Y[ i] = Median3( X[ i] , X[ i+ 1] , X[ i+ 2] ) Where: Y is t he out put vect or, and X is t he input vect or. Exam ple 11- 34 shows t wo im plem ent at ions of t he Median3 algorit hm ; one uses single- precision form at wit hout conversion, t he ot her uses half- precision form at and requires conversion. Alt ernat ive 1 on t he left works wit h single precision form at using 256- bit load/ st ore operat ions, each of which loads/ st ores eight 32- bit num bers. Alt ernat ive 2 uses 128- bit load/ st ore operat ions t o load/ st ore eight 16- bit num bers in half precision form at and VCVTPH2PS/ VCVTPS2PH inst ruct ions t o convert it t o/ from single precision float ing- point form at .
Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision Single-Precision code w/o Conversion Half-Precision code w/ Conversion __asm { xor rbx, rbx mov rcx, len mov rdi, inPtr mov rsi, outPtr vmovaps ymm0, [rdi] loop: add rdi, 32 vmovaps ymm6, [rdi] vperm2f128 ymm1, ymm0, ymm6, 0x21 vshufps ymm3, ymm0, ymm1, 0x4E vshufps ymm2, ymm0, ymm3, 0x99 vminps ymm5, ymm0 ,ymm2 vmaxps ymm0, ymm0, ymm2
__asm { xor rbx, rbx mov rcx, len mov rdi, inPtr mov rsi, outPtr vcvtph2ps ymm0, [rdi] loop: add rdi,16 vcvtph2ps ymm6, [rdi] vperm2f128 ymm1, ymm0, ymm6, 0x21 vshufps ymm3, ymm0, ymm1, 0x4E vshufps ymm2, ymm0, ymm3, 0x99 vminps ymm5, ymm0 ,ymm2 vmaxps ymm0, ymm0, ymm2
11-45
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-34. Performance Comparison of Median3 using Half-Precision vs. Single-Precision (Contd.) Single-Precision code w/o Conversion Half-Precision code w/ Conversion vminps ymm4, ymm0, ymm3 vmaxps ymm7, ymm4, ymm5 vmovaps ymm0, ymm6 vmovaps [rsi], ymm7 add rsi, 32 add rbx, 8 cmp rbx, rcx jl loop
vminps ymm5, ymm0 ,ymm2 vmaxps ymm0, ymm0, ymm2 vminps ymm4, ymm0, ymm3 vmaxps ymm7, ymm4, ymm5 vmovaps ymm0, ymm6 vcvtps2ph [rsi], ymm7, roundingCtrl add rsi, 16 add rbx, 8 cmp rbx, rcx jl loop
When t he localit y of t he working set resides in m em ory, using half- precision form at wit h processors based on I nt el m icroarchit ect ure code nam e I vy Bridge is about 30% fast er t han single- precision form at , despit e t he conversion overhead. When t he localit y resides in L3, using half- precision form at is st ill ~ 15% fast er. When t he localit y resides in L1, using single- precision form at is fast er because t he cache bandwidt h of t he L1 dat a cache is m uch higher t han t he rest of t he cache/ m em ory hierarchy and t he overhead of t he conversion becom es a perform ance considerat ion.
11.15
FUSED MULTIPLY-ADD (FMA) INSTRUCTIONS GUIDELINES
FMA inst ruct ions perform vect ored operat ions of “ a * b + c” on I EEE- 754- 2008 float ing- point values, where t he m ult iplicat ion operat ions “ a * b” are perform ed wit h infinit e precision, t he final result s of t he addit ion are rounded t o produced t he desired precision. Det ails of FMA rounding behavior and special case handling can be found in sect ion 2.3 of I nt el® Archit ect ure I nst ruct ion Set Ext ensions Program m ing Reference. FMA inst ruct ion can speed up and im prove t he accuracy of m any FP calculat ions. I nt el m icroarchit ect ure code nam e Haswell im plem ent s FMA inst ruct ions wit h execut ion unit s on port 0 and port 1 and 256- bit dat a pat hs. Dot product , m at rix m ult iplicat ion and polynom ial evaluat ions are expect ed t o benefit from t he use of FMA, 256- bit dat a pat h and t he independent execut ions on t wo port s. The peak t hroughput of FMA from each processor core are 16 single- precision and 8 double- precision result s each cycle. Algorit hm s designed t o use FMA inst ruct ion should t ake int o considerat ion t hat non- FMA sequence of MULPD/ PS and ADDPD/ PS likely will produce slight ly different result s com pared t o using FMA. For num erical com put at ions involving a convergence crit eria, t he difference in t he precision of int erm ediat e result s m ust be fact ored int o t he num eric form alism t o avoid surprise in com plet ion t im e due t o rounding issues. Use r / Sour ce Coding Rule 3 3 . Fact or in precision and rounding charact erist ics of FMA inst ruct ions when replacing m ult iply/ add operat ions execut ing non- FMA inst ruct ions. FMA im proves perform ance when an algorit hm is execut ion- port t hroughput lim it ed, like DGEMM. There m ay be sit uat ions where using FMA m ight not deliver bet t er perform ance. Consider t he vect ored operat ion of “ a * b + c * d” and dat a are ready at t he sam e t im e: I n t he t hree- inst ruct ion sequence of VADDPS ( VMULPS (a,b) , VMULPS (c,b) ); VMULPS can be dispat ched in t he sam e cycle and execut e in parallel, leaving t he lat ency of VADDPS ( 3 cycle) exposed. Wit h unrolling t he exposure of VADDPS lat ency m ay be furt her am ort ized. When using t he t wo- inst ruct ion sequence of VFMADD213PS ( c, d, VMULPS (a,b) ); The lat ency of FMA ( 5 cycle) is exposed for producing each vect or result . Use r / Sour ce Coding Rule 3 4 . Fact or in result - dependency, lat ency of FP add vs. FMA inst ruct ions when replacing FP add operat ions wit h FMA inst ruct ions.
11-46
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11.15.1 Optimizing Throughput with FMA and Floating-Point Add/MUL I n t he Skylake m icroarchit ect ure, t here are t wo pipes of execut ions support ing FMA, vect or FP Mult iply, and FP ADD inst ruct ions. All t hree cat egories of inst ruct ions have a lat ency of 4 cycles and can dispat ch t o eit her port 0 or port 1 t o execut e every cycle. The arrangem ent of ident ical lat ency and num ber of pipes allows soft ware t o increase t he perform ance of sit uat ions where float ing- point calculat ions are lim it ed by t he float ing- point add operat ions t hat follow FP m ult iplies. Consider a sit uat ion of vect or operat ion An = C1 + C2 * An- 1 :
Example 11-35. FP Mul/FP Add Versus FMA FP Mul/FP Add Sequence mov eax, NumOfIterations mov rbx, pA mov rcx, pC1 mov rdx, pC2 vmovups ymm0, Ymmword ptr [rbx] // A vmovups ymm1, Ymmword ptr [rcx] // C1 vmovups ymm2, Ymmword ptr [rdx] // C2 loop: vmulps ymm4, ymm0 ,ymm2 // A * C2 vaddps ymm0, ymm1, ymm4 dec eax jnz loop
FMA Sequence mov eax, NumOfIterations mov rbx, pA mov rcx, pC1 mov rdx, pC2 vmovups ymm0, Ymmword ptr [rbx] // A vmovups ymm1, Ymmword ptr [rcx] // C1 vmovups ymm2, Ymmword ptr [rdx] // C2 loop: vfmadd132ps ymm0, ymm1, ymm2 // C1 + A * C2 dec eax jnz loop vmovups ymmword ptr[rbx], ymm0 // store An
vmovups ymmword ptr[rbx], ymm0 // store An Cost per iteration: ~ fp add latency + fp add latency
Cost per iteration: ~ fma latency
The overall t hroughput of t he code sequence on t he LHS is lim it ed by t he com bined lat ency of t he FP MUL and FP ADD inst ruct ions of specific m icroarchit ect ure. The overall t hroughput of t he code sequence on t he RHS is lim it ed by t he t hroughput of t he FMA inst ruct ion of t he corresponding m icroarchit ect ure. A com m on sit uat ion where t he lat ency of t he FP ADD operat ion dom inat es perform ance is t he following C code: for ( int 1 = 0; i < arrLenght; i ++) result += arrToSum[i]; Exam ple 11- 35 shows t wo im plem ent at ions wit h and wit hout unrolling.
Example 11-36. Unrolling to Hide Dependent FP Add Latency No Unroll Unroll 8 times mov eax, arrLength mov rbx, arrToSum vmovups ymm0, Ymmword ptr [rbx] sub eax, 8 loop: add rbx, 32 vaddps ymm0, ymm0, ymmword ptr [rbx] sub eax, 8 jnz loop
mov eax, arrLength mov rbx, arrToSum vmovups ymm0, ymmword ptr [rbx] vmovups ymm1, ymmword ptr 32[rbx] vmovups ymm2, ymmword ptr 64[rbx] vmovups ymm3, ymmword ptr 96[rbx] vmovups ymm4, ymmword ptr 128[rbx] vmovups ymm5, ymmword ptr 160[rbx] vmovups ymm6, ymmword ptr 192[rbx] vmovups ymm7, ymmword ptr 224[rbx]
11-47
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-36. Unrolling to Hide Dependent FP Add Latency (Contd.) No Unroll Unroll 8 times vextractf128 xmm1, ymm0, 1 vaddps xmm0, xmm0, xmm1 vpermilps xmm1, xmm0, 0xe vaddps xmm0, xmm0, xmm1 vpermilps xmm1, xmm0, 0x1 vaddss xmm0, xmm0, xmm1
sub eax, 64 loop: add rbx, 256 vaddps ymm0, ymm0, ymmword ptr [rbx] vaddps ymm1, ymm1, ymmword ptr 32[rbx] vaddps ymm2, ymm2, ymmword ptr 64[rbx] vaddps ymm3, ymm3, ymmword ptr 96[rbx] vaddps ymm4, ymm4, ymmword ptr 128[rbx] vaddps ymm5, ymm5, ymmword ptr 160[rbx] vaddps ymm6, ymm6, ymmword ptr 192[rbx] vaddps ymm7, ymm7, ymmword ptr 224[rbx] sub eax, 64 jnz loop vaddps Ymm0, ymm0, ymm1 vaddps Ymm2, ymm2, ymm3 vaddps Ymm4, ymm4, ymm5 vaddps Ymm6, ymm6, ymm7 vaddps Ymm0, ymm0, ymm2 vaddps Ymm4, ymm4, ymm6 vaddps Ymm0, ymm0, ymm4
vmovss result, ymm0
vextractf128 xmm1, ymm0, 1 vaddps xmm0, xmm0, xmm1 vpermilps xmm1, xmm0, 0xe vaddps xmm0, xmm0, xmm1 vpermilps xmm1, xmm0, 0x1 vaddss xmm0, xmm0, xmm1 vmovss result, ymm0
Wit hout unrolling ( LHS of Exam ple 11- 35) , t he cost of sum m ing every 8 array elem ent s is about proport ional t o t he lat ency of t he FP ADD inst ruct ion, assum ing t he working set fit in L1. To use unrolling effect ively, t he num ber of unrolled operat ions should be at least “ lat ency of t he crit ical operat ion” * “ num ber of pipes”. The perform ance gain of opt im ized unrolling versus no unrolling, for a given m icroarchit ect ure, can approach “ num ber of pipes” * “ Lat ency of FP ADD”. Use r / Sour ce Coding Rule 3 5 . Consider using unrolling t echnique for loops cont aining back- t o- back dependent FMA, FP Add or Vect or MUL operat ions, The unrolling fact or can be chosen by considering t he lat ency of t he crit ical inst ruct ion of t he dependency chain and t he num ber of pipes available t o execut e t hat inst ruct ion.
11.15.2 Optimizing Throughput with Vector Shifts I n t he Skylake m icroarchit ect ure, m any com m on vect or shift inst ruct ions can dispat ch int o eit her port 0 or port 1, com pared t o only one port in prior generat ions, see Table 2- 2 and Table 2- 7. A com m on sit uat ion where t he lat ency of t he FP ADD operat ion dom inat es perform ance is t he following C code, where a, b, and c are int eger arrays: for ( int 1 = 0; i < len; i ++) c[i] += 4* a[i] + b[i]/2; Exam ple 11- 35 shows t wo im plem ent at ions wit h and wit hout unrolling.
11-48
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-37. FP Mul/FP Add Versus FMA FP Mul/FP Add Sequence mov eax, NumOfIterations mov rbx, pA mov rcx, pC1 mov rdx, pC2 vmovups ymm0, Ymmword ptr [rbx] // A vmovups ymm1, Ymmword ptr [rcx] // C1 vmovups ymm2, Ymmword ptr [rdx] // C2 loop: vmulps ymm4, ymm0 ,ymm2 // A * C2 vaddps ymm0, ymm1, ymm4 dec eax jnz loop
FMA Sequence mov eax, NumOfIterations mov rbx, pA mov rcx, pC1 mov rdx, pC2 vmovups ymm0, Ymmword ptr [rbx] // A vmovups ymm1, Ymmword ptr [rcx] // C1 vmovups ymm2, Ymmword ptr [rdx] // C2 loop: vfmadd132ps ymm0, ymm1, ymm2 // C1 + A * C2 dec eax jnz loop vmovups ymmword ptr[rbx], ymm0 // store An
vmovups ymmword ptr[rbx], ymm0 // store An Cost per iteration: ~ fp add latency + fp add latency
11.16
Cost per iteration: ~ fma latency
AVX2 OPTIMIZATION GUIDELINES
AVX2 inst ruct ions prom ot es t he great m aj orit y of 128- bit SI MD int eger inst ruct ions t o operat e on 256- bit YMM regist ers. AVX2 also adds a rich m ix of broadcast / perm ut e/ variable- shift inst ruct ions t o accelerat e num erical com put at ions. The 256- bit AVX2 inst ruct ions are support ed by t he I nt el m icroarchit ect ure Haswell which im plem ent s 256- bit dat a pat h wit h low lat ency and high t hroughput . Consider an int ra- coding 4x4 block im age t ransform at ion 1 shown in Figure 11- 3. A 128- bit SI MD im plem ent at ion can perform t his t ransform at ion by t he following t echnique:
• • • •
Convert 8- bit pixels int o 16- bit word elem ent s and fet ch t wo 4x4 im age block as 4 row vect ors. The m at rix operat ion 1/ 128 * ( B x R) can be evaluat ed wit h row vect ors of t he im age block and colum n vect ors of t he right- hand- side coefficient m at rix using a sequence of SI MD inst ruct ions of PMADDWD, PHADDD, packed shift and blend inst ruct ions. The t wo 4x4 word- granular, int erm ediat e result can be re- arranged int o colum n vect ors. The left- hand- side coefficient m at rix in row vect ors and t he colum n vect ors of t he int erm ediat e block can be calculat ed ( using PMADDWD, PHADDD, shift , blend) and writ t en out .
1. C. Yeo, Y. H. Tan, Z. Li and S. Rahardja, “Mode-Dependent Fast Separable KLT for Block-based Intra Coding,” JCTVC-B024, Geneva, Switzerland, Jul 2010 11-49
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11 ---------------128 128
29 29 55 55 74 74 84 84 74 74 74 74 00 ––74 74 84 – 29 – 74 55 84 – 29 – 74 55 55 55 ––84 84 74 74 ––29 29
X
L
X
64 1 - 84 -------128 64 35
64 35 – 64 – 84
B
64 – 35 – 64 84
64 – 84 64 – 35
R
Figure 11-3. 4x4 Image Block Transformation
The sam e t echnique can be im plem ent ed using AVX2 inst ruct ions in a st raight forward m anner. The AVX2 sequence is illust rat ed in Exam ple 11- 38 and Exam ple 11- 39.
Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2 // b0: input row vector from 4 consecutive 4x4 image block of word pixels // rmc0-3: columnar vector coefficient of the RHS matrix, repeated 4X for 256-bit // min32km1: saturation constant vector to cap intermediate pixel to less than or equal to 32767 // w0: output row vector of garbled intermediate matrix, elements within each block are garbled // e.g Low 128-bit of row 0 in descending order: y07, y05, y06, y04, y03, y01, y02, y00 (continue) #define __MyM_KIP_PxRMC_ROW_4x4Wx4(b0, w0, rmc0_256, rmc1_256, rmc2_256, rmc3_256, min32km1)\ {__m256i tt0, tt1, tt2, tt3;\ tt0 = _mm256_madd_epi16(b0, (rmc0_256));\ tt0 = _mm256_hadd_epi32(tt0, tt0) ;\ tt1 = _mm256_madd_epi16(b0, rmc1_256);\ tt1 = _mm256_blend_epi16(tt0, _mm256_hadd_epi32(tt1, tt1) , 0xf0);\ tt1 = _mm256_min_epi32(_mm256_srai_epi32( tt1, 7), min32km1);\ tt1 = _mm256_shuffle_epi32(tt1, 0xd8); \ tt2 = _mm256_madd_epi16(b0, rmc2_256);\ tt2 = _mm256_hadd_epi32(tt2, tt2) ;\ tt3 = _mm256_madd_epi16(b0, rmc3_256);\ tt3 = _mm256_blend_epi16(tt2, _mm256_hadd_epi32(tt3, tt3) , 0xf0);\ tt3 = _mm256_min_epi32( _mm256_srai_epi32(tt3, 7), min32km1);\ tt3 = _mm256_shuffle_epi32(tt3, 0xd8);\ w0 = _mm256_blend_epi16(tt1, _mm256_slli_si256( tt3, 2), 0xaa);\ }
11-50
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-38. Macros for Separable KLT Intra-block Transformation Using AVX2 (Contd.) // t0-t3: 256-bit input vectors of un-garbled intermediate matrix 1/128 * (B x R) // lmr_256: 256-bit vector of one row of LHS coefficient, repeated 4X // min32km1: saturation constant vector to cap final pixel to less than or equal to 32767 // w0; Output row vector of final result in un-garbled order #define __MyM_KIP_LMRxP_ROW_4x4Wx4(w0, t0, t1, t2, t3, lmr_256, min32km1)\ {__m256itb0, tb1, tb2, tb3;\ tb0 = _mm256_madd_epi16( lmr_256, t0);\ tb0 = _mm256_hadd_epi32(tb0, tb0) ;\ tb1 = _mm256_madd_epi16( lmr_256, t1);\ tb1 = _mm256_blend_epi16(tb0, _mm256_hadd_epi32(tb1, tb1), 0xf0 );\ tb1 = _mm256_min_epi32( _mm256_srai_epi32( tb1, 7), min32km1);\ tb1 = _mm256_shuffle_epi32(tb1, 0xd8);\ tb2 = _mm256_madd_epi16( lmr_256, t2);\ tb2 = _mm256_hadd_epi32(tb2, tb2) ;\ tb3 = _mm256_madd_epi16( lmr_256, t3);\ tb3 = _mm256_blend_epi16(tb2, _mm256_hadd_epi32(tb3, tb3) , 0xf0);\ tb3 = _mm256_min_epi32( _mm256_srai_epi32( tb3, 7), min32km1);\ tb3 = _mm256_shuffle_epi32(tb3, 0xd8); \ tb3 = _mm256_slli_si256( tb3, 2);\ tb3 = _mm256_blend_epi16(tb1, tb3, 0xaa);\ w0 = _mm256_shuffle_epi8(tb3, _mm256_setr_epi32( 0x5040100, 0x7060302, 0xd0c0908, 0xf0e0b0a, 0x5040100, 0x7060302, 0xd0c0908, 0xf0e0b0a));\ }
I n Exam ple 11- 39, m at rix m ult iplicat ion of 1/ 128 * ( B xR) is evaluat ed first in a 4- wide m anner by fet ching from 4 consecut ive 4x4 im age block of word pixels. The first m acro shown in Exam ple 11- 38 produces an out put vect or where each int erm ediat e row result is in an garbled sequence bet ween t he t wo m iddle elem ent s of each 4x4 block. I n Exam ple 11- 39, undoing t he garbled elem ent s and t ransposing t he int erm ediat e row vect or int o colum n vect ors are im plem ent ed using blend prim it ives inst ead of shuffle/ unpack prim it ives. I n I nt el m icroarchit ect ure code nam e Haswell, shuffle/ pack/ unpack prim it ives rely on t he shuffle execut ion unit dispat ched t o port 5. I n som e sit uat ions of heavy SI MD sequences, port 5 pressure m ay becom e a det erm ining fact or in perform ance. I f 128- bit SI MD code faces port 5 pressure when running on Haswell, port ing 128- bit code t o use 256- bit AVX2 can im prove perform ance and alleviat e port 5 pressure.
11-51
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-39. Separable KLT Intra-block Transformation Using AVX2 short __declspec(align(16))cst_rmc0[8] = {64, 84, 64, 35, 64, 84, 64, 35}; short __declspec(align(16))cst_rmc1[8] = {64, 35, -64, -84, 64, 35, -64, -84}; short __declspec(align(16))cst_rmc2[8] = {64, -35, -64, 84, 64, -35, -64, 84}; short __declspec(align(16))cst_rmc3[8] = {64, -84, 64, -35, 64, -84, 64, -35}; short __declspec(align(16))cst_lmr0[8] = {29, 55, 74, 84, 29, 55, 74, 84}; short __declspec(align(16))cst_lmr1[8] = {74, 74, 0, -74, 74, 74, 0, -74}; short __declspec(align(16))cst_lmr2[8] = {84, -29, -74, 44, 84, -29, -74, 55}; short __declspec(align(16)) cst_lmr3[8] = {55, -84, 74, -29, 55, -84, 74, -29};
void Klt_256_d(short * Input, short * Output, int iWidth, int iHeight) {int iX, iY; __m256i rmc0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *) &cst_rmc0[0])); __m256i rmc1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc1[0])); __m256i rmc2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc2[0])); __m256i rmc3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_rmc3[0])); __m256i lmr0 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr0[0])); __m256i lmr1 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr1[0])); __m256i lmr2 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr2[0])); __m256i lmr3 = _mm256_broadcastsi128_si256( _mm_loadu_si128((__m128i *)&cst_lmr3[0])); __m256i min32km1 = _mm256_broadcastsi128_si256( _mm_setr_epi32( 0x7fff7fff, 0x7fff7fff, 0x7fff7fff, 0x7fff7fff)); __m256i b0, b1, b2, b3, t0, t1, t2, t3; __m256i w0, w1, w2, w3; short* pImage = Input; short* pOutImage = Output; int hgt = iHeight, wid= iWidth; (continue) // We implement 1/128 * (Mat_L x (1/128 * (Mat_B x Mat_R))) from the inner most parenthesis for( iY = 0; iY < hgt; iY+=4) { for( iX = 0; iX < wid; iX+=16) { //load row 0 of 4 consecutive 4x4 matrix of word pixels b0 = _mm256_loadu_si256( (__m256i *) (pImage + iY*wid+ iX)) ; // multiply row 0 with columnar vectors of the RHS matrix coefficients __MyM_KIP_PxRMC_ROW_4x4Wx4(b0, w0, rmc0, rmc1, rmc2, rmc3, min32km1); // low 128-bit of garbled row 0, from hi->lo: y07, y05, y06, y04, y03, y01, y02, y00 b1 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+1)*wid+ iX) ); __MyM_KIP_PxRMC_ROW_4x4Wx4(b1, w1, rmc0, rmc1, rmc2, rmc3, min32km1); // hi->lo y17, y15, y16, y14, y13, y11, y12, y10 b2 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+2)*wid+ iX) ); __MyM_KIP_PxRMC_ROW_4x4Wx4(b2, w2, rmc0, rmc1, rmc2, rmc3, min32km1); b3 = _mm256_loadu_si256( (__m256i *) (pImage + (iY+3)*wid+ iX) ); __MyM_KIP_PxRMC_ROW_4x4Wx4(b3, w3, rmc0, rmc1, rmc2, rmc3, min32km1);
11-52
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-39. Separable KLT Intra-block Transformation Using AVX2 (Contd.) // unscramble garbled middle 2 elements of each 4x4 block, then // transpose into columnar vectors: t0 has 4 consecutive column 0 or 4 4x4 intermediate t0 = _mm256_blend_epi16( w0, _mm256_slli_epi64(w1, 16), 0x22); t0 = _mm256_blend_epi16( t0, _mm256_slli_epi64(w2, 32), 0x44); t0 = _mm256_blend_epi16( t0, _mm256_slli_epi64(w3, 48), 0x88); t1 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 32), _mm256_srli_epi64(w1, 16), 0x22); t1 = _mm256_blend_epi16( t1, w2, 0x44); t1 = _mm256_blend_epi16( t1, _mm256_slli_epi64(w3, 16), 0x88); // column 1 t2 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 16), w1, 0x22); t2 = _mm256_blend_epi16( t2, _mm256_slli_epi64(w2, 16), 0x44); t2 = _mm256_blend_epi16( t2, _mm256_slli_epi64(w3, 32), 0x88); // column 2 t3 = _mm256_blend_epi16( _mm256_srli_epi64(w0, 48), _mm256_srli_epi64(w1, 32), 0x22); t3 = _mm256_blend_epi16( t3, _mm256_srli_epi64(w2, 16), 0x44); t3 = _mm256_blend_epi16( t3, w3, 0x88);// column 3
// multiply row 0 of the LHS coefficient with 4 columnar vectors of intermediate blocks // final output row are arranged in normal order __MyM_KIP_LMRxP_ROW_4x4Wx4(w0, t0, t1, t2, t3, lmr0, min32km1); _mm256_store_si256( (__m256i *) (pOutImage+iY*wid+ iX), w0) ; __MyM_KIP_LMRxP_ROW_4x4Wx4(w1, t0, t1, t2, t3, lmr1, min32km1); _mm256_store_si256( (__m256i *) (pOutImage+(iY+1)*wid+ iX), w1) ; __MyM_KIP_LMRxP_ROW_4x4Wx4(w2, t0, t1, t2, t3, lmr2, min32km1); _mm256_store_si256( (__m256i *) (pOutImage+(iY+2)*wid+ iX), w2) ; (continue)
__MyM_KIP_LMRxP_ROW_4x4Wx4(w3, t0, t1, t2, t3, lmr3, min32km1); _mm256_store_si256( (__m256i *) (pOutImage+(iY+3)*wid+ iX), w3) ; } } Alt hough 128- bit SI MD im plem ent at ion is not shown here, it can be easily derived. When running 128- bit SI MD code of t his KLT int ra- coding t ransform at ion on I nt el m icroarchit ect ure code nam e Sandy Bridge, t he port 5 pressure are less because t here are t wo shuffle unit s, and t he effect ive t hroughput for each 4x4 im age block t ransform at ion is around 50 cycles. I t s speed- up relat ive t o opt im ized scalar im plem ent at ion is about 2.5X. When t he 128- bit SI MD code runs on Haswell, m icro- ops issued t o port 5 account for slight ly less t han 50% of all m icro- ops, com pared t o about one t hird on prior m icroarchit ect ure, result ing in about 25% perform ance regression. On t he ot her hand, AVX2 im plem ent at ion can deliver effect ive t hroughput in less t han 35 cycle per 4x4 block.
11-53
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
11.16.1 Multi-Buffering and AVX2 There are m any com put e- int ensive algorit hm s ( e.g. hashing, encrypt ion, et c.) which operat e on a st ream of dat a buffers. Very oft en, t he dat a st ream m ay be part it ioned and t reat ed as m ult iple independent buffer st ream s t o leverage SI MD inst ruct ion set s. Det ailed t reat m ent of hashing several buffers in parallel can be found at ht t p: / / www.scirp.org/ j ournal/ PaperI nform at ion.aspx?paperI D= 23995 and at ht t p: / / eprint .iacr.org/ 2012/ 476.pdf. Wit h AVX2 providing a full com plim ent of 256- bit SI MD inst ruct ions wit h rich funct ionalit y at m ult iple widt h granularit ies for logical and arit hm et ic operat ions. Algorit hm s t hat had leveraged XMM regist ers and prior generat ions of SSE inst ruct ion set s can ext end t hose m ult i- buffering algorit hm s t o use AVX2 on YMM and deliver even higher t hroughput . Opt im ized 256- bit AVX2 im plem ent at ion m ay deliver up t o 1.9X t hroughput when com pared t o 128- bit versions. The im age block t ransform at ion exam ple discussed in Sect ion 11.16 can be const rued also as a m ult ibuffering im plem ent at ion of 4x4 blocks. When t he perform ance baseline is swit ched from a t wo- shuffleport m icroarchit ect ure ( Sandy Bridge) t o single- shuffle- port m icroarchit ect ure, t he 256- bit wide AVX2 provides a speed up of 1.9X relat ive t o 128- bit SI MD im plem ent at ion. Great er det ails on m ult i- buffering can be found in t he whit e paper at : ht t ps: / / wwwssl.int el.com / cont ent / www/ us/ en/ com m unicat ions/ com m unicat ions- ia- m ult i- buffer- paper.ht m l.
11.16.2 Modular Multiplication and AVX2 Modular m ult iplicat ion of very large int egers are oft en used t o im plem ent efficient m odular exponent iat ion operat ions which are crit ical in public key crypt ography, such as RSA 2048. Library im plem ent at ion of m odular m ult iplicat ion is oft en done wit h MUL/ ADC chain sequences. Typically, a MUL inst ruct ion can produce a 128- bit int erm ediat e int eger out put , and add- carry chains m ust be used at 64- bit int erm ediat e dat a granularit y. I n AVX2, VPMULUDQ/ VPADDQ/ VPSRLQ/ VPSLLQ/ VPBROADCASTQ/ VPERMQ allow vect orized approach t o im plem ent efficient m odular m ult iplicat ion/ exponent iat ion for key lengt hs corresponding t o RSA1024 and RSA2048. For det ails of m odular exponent iat ion/ m ult iplicat ion and AVX2 im plem ent at ion in OpenSSL, see ht t p: / / rd.springer.com / chapt er/ 10.1007% 2F978- 3- 642- 31662- 3_9?LI = t rue. The basic heurist ic st art s wit h reform ulat ing t he large int eger input operands in 512/ 1024 bit exponent iat ion in redundant represent at ions. For exam ple, a 1024- bit int eger can be represent ed using base 2^ 29 and 36 “ digit s”, where each “ digit ” is less t han 2^ 29. A digit in such redundant represent at ion can be placed in a dword slot of a vect or regist er. Such redundant represent at ion of large int eger sim plifies t he requirem ent t o perform carry- add chains across t he hardware granularit y of t he int erm ediat e result s of unsigned int eger m ult iplicat ions. Each VPMULUDQ in AVX2 using t he digit s from a redundant represent at ion can produce 4 separat e 64bit int erm ediat e result wit h sufficient headroom ( e.g. 5 m ost significant bit s are 0 excluding sign bit ) . Then, VPADDQ is sufficient t o im plem ent add- carry chain requirem ent wit hout needing SI MD versions of equivalent of ADC- like inst ruct ions. More det ails are available in t he reference cit ed in paragraph above, including t he cost fact or of conversion t o redundant represent at ion and effect ive speedup account ing for parallel out put bandwidt h of VPMULUDQ/ VPADDQ chain.
11.16.3 Data Movement Considerations I nt el m icroarchit ect ure code nam e Haswell can support up t o t wo 256- bit load and one 256- bit st ore m icro- ops dispat ched each cycle. Most exist ing binaries wit h heavy dat a- m ovem ent operat ion can benefit from t his enhancem ent and t he higher bandwidt hs of t he L1 dat a cache and L2 wit hout re- com pilat ion, if t he binary is already opt im ized for prior generat ion m icroarchit ect ure. For exam ple, 256- bit SAXPY com put at ion were lim it ed by t he num ber of load/ st ore port s available in prior generat ion m icroarchit ect ure. I t will benefit im m ediat ely on t he I nt el m icroarchit ect ure Haswell.
11-54
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
I n som e sit uat ion, t here m ay be som e int ricat e int eract ions bet ween m icroarchit ect ural rest rict ions on t he inst ruct ion set t hat is wort h som e discussion. We consider t wo com m only used library funct ions m em cpy( ) and m em set ( ) and t he opt im al choice t o im plem ent t hem on t he new m icroarchit ect ure. Wit h m em cpy( ) on I nt el m icroarchit ect ure code nam e Haswell, using REP MOVSB t o im plem ent m em cpy operat ion for large copy lengt h can t ake advant age t he 256- bit st ore dat a pat h and deliver t hroughput of m ore t han 20 byt es per cycle. For copy lengt h t hat are sm aller t han a few hundred byt es, REP MOVSB approach is slower t han using 128- bit SI MD t echnique described in Sect ion 11.16.3.1.
11.16.3.1 SIMD Heuristics to implement Memcpy() We st art wit h a discussion of t he general heurist ic t o at t em pt im plem ent ing m em cpy( ) wit h 128- bit SI MD inst ruct ions, which revolves around t hree num eric fact ors ( dest inat ion address alignm ent , source address alignm ent , byt es t o copy) relat ive t o t he widt h of regist er widt h of t he desired inst ruct ion set . The dat a m ovem ent work of m em cpy can be separat ed int o t he following phases:
•
• •
An init ial unaligned copy of 16 byt es, allows looping dest inat ion address point er t o becom e 16- byt e aligned. Thus subsequent st ore operat ions can use as m any 16- byt e aligned st ores. The rem aining byt es- left- t o- copy are decom posed int o ( a) m ult iples of unrolled 16- byt e copy operat ions, plus ( b) residual count t hat m ay include som e copy operat ions of less t han 16 byt es. For exam ple, t o unroll eight t im e t o am ort ize loop it erat ion overhead, t he residual count m ust handle individual cases from 1 t o 8x16- 1 = 127. I nside an 8X16 unrolled m ain loop, each 16 byt e copy operat ion m ay need t o deal wit h source point er address is not aligned t o 16- byt e boundary and st ore 16 fresh dat a t o 16B- aligned dest inat ion address. When t he it erat ing source point er is not 16B- aligned, t he m ost efficient t echnique is a t hree inst ruct ion sequence of: — Fet ch an 16- byt e chunk from an 16- byt e- aligned adj ust ed point er address and use a port ion of t his chunk wit h com plem ent ary port ion from previous 16- byt e- aligned fet ch. — Use PALI GNR t o st it ch a port ion of t he current chunk wit h t he previous chunk. — St ored st it ched 16- byt e fresh dat a t o aligned dest inat ion address, and repeat t his 3 inst ruct ion sequence. This 3- inst ruct ion t echnique allows t he fet ch: st ore inst ruct ion rat io for each 16- byt e copy operat ion t o rem ain at 1: 1.
While t he above t echnique ( specifically, t he m ain loop dealing wit h copying t housands of byt es of dat a) can achieve t hroughput of approxim at ely 10 byt es per cycle on I nt el m icroarchit ect ure Sandy Bridge and I vy Bridge wit h 128- bit dat a pat h for st ore operat ions, an at t em pt t o ext end t his t echnique t o use wider dat a pat h will run int o t he following rest rict ions:
•
To use 256- bit VPALI GNR wit h it s 2X128- bit lane m icroarchit ect ure, st it ching of t wo part ial chunks of t he current 256- bit 32- byt e- aligned fet ch requires anot her 256- bit fet ch from an address 16- byt e offset from t he current 32- byt e- aligned 256- bit fet ch. — The fet ch: st ore rat io for each 32- byt e copy operat ion becom es 2: 1. — The 32- byt e- unaligned fet ch ( alt hough aligned t o 16- byt e boundary) will experience a cache- line split penalt y, once every 64- byt es of copy operat ion.
The net of t his at t em pt t o use 256- bit I SA t o t ake advant age of t he 256- bit st ore dat a- pat h m icroarchit ect ure was offset by t he 4- inst ruct ion sequence and cacheline split penalt y.
11.16.3.2 Memcpy() Implementation Using Enhanced REP MOVSB I t is int erest ing t o com pare t he alt ernat e approach of using enhanced REP MOVSB t o im plem ent m em cpy( ) . I n I nt el m icroarchit ect ure code nam e Haswell and I vy Bridge, REP MOVSB is an opt im ized, hardware provided, m icro- op flow. On I nt el m icroarchit ect ure code nam e I vy Bridge, a REP MOVSB im plem ent at ion of m em cpy can achieve t hroughput at slight ly bet t er t han t he 128- bit SI MD im plem ent at ion when copying t housands of byt es. However, if t he size of copy operat ion is less t han a few hundred byt es, t he REP MOVSB approach is less
11-55
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
efficient t han t he explicit residual copy t echnique described in phase 2 of Sect ion 11.16.3.1. This is because handling 1- 127 residual copy lengt h ( via j um p t able or swit ch/ case, and is done before t he m ain loop) plus one or t wo 8x16B it erat ions incurs less branching overhead t han t he hardware provided m icroop flows. For t he grueling im plem ent at ion det ails of 128- bit SI MD im plem ent at ion of m em cpy( ) , one can look up from t he archived sources of open source library such as GLibC. On I nt el m icroarchit ect ure code nam e Haswell, using REP MOVSB t o im plem ent m em cpy operat ion for large copy lengt h can t ake advant age t he 256- bit st ore dat a pat h and deliver t hroughput of m ore t han 20 byt es per cycle. For copy lengt h t hat are sm aller t han a few hundred byt es, REP MOVSB approach is st ill slower t han t reat ing t he copy lengt h as t he residual phase of Sect ion 11.16.3.1.
11.16.3.3 Memset() Implementation Considerations The int erface of Mem set ( ) has one address point er as dest inat ion, which sim plifies t he com plexit y of m anaging address alignm ent scenarios t o use 256- bit aligned st ore inst ruct ion. Aft er an init ial unaligned st ore, and adj ust ing t he dest inat ion point er t o be 32- byt e aligned, t he residual phase follows t he sam e considerat ion as described in Sect ion 11.16.3.1, which m ay em ploy a large j um p t able t o handle each residual value scenario wit h m inim al branching, depending on t he am ount of unrolled 32B- aligned st ores. The m ain loop is a sim ple YMM regist er t o 32- byt e- aligned st ore operat ion, which can deliver close t o 30 byt es per cycle for lengt hs m ore t han a t housand byt e. The lim it ing fact or here is due t o each 256- bit VMOVDQA st ore consist s of a st ore_address and a st ore_dat a m icro- op flow. Only port 4 is available t o dispat ch t he st ore_dat a m icro- op each cycle. Using REP STOSB t o im plem ent m em set ( ) has t he code size advant age versus a SI MD im plem ent at ion, like REP MOVSB for m em cpy( ) . On I nt el m icroarchit ect ure code nam e Haswell, a m em set ( ) rout ine im plem ent ed using REP STOSB will also benefit t he from t he 256- bit dat a pat h and increased L1 dat a cache bandwidt h t o deliver up t o 32 byt es per cycle for large count values. Com paring t he perform ance of m em set ( ) im plem ent at ions using REP STOSB vs. 256- bit AVX2 requires one t o consider t he pat t ern of invocat ion of m em set ( ) . The invocat ion pat t ern can lead t o t he necessit y of using different perform ance m easurem ent t echniques. There m ay be side effect s affect ing t he out com e of each m easurem ent t echnique. The m ost com m on m easurem ent t echnique t hat is oft en used wit h a sim ple rout ine like m em set ( ) is t o execut e m em set ( ) inside a loop wit h a large it erat ion count , and wrap t he invocat ion of RDTSC before and aft er t he loop. A slight variat ion of t his m easurem ent t echnique can apply t o m easuring m em set ( ) invocat ion pat t erns of m ult iple back- t o- back calls t o m em set ( ) wit h different count values wit h no ot her int ervening inst ruct ion st ream s execut ed bet ween calls t o m em set ( ) . I n bot h of t he above m em set ( ) invocat ion scenarios, branch predict ion can play a significant role in affect ing t he m easured t ot al cycles for execut ing t he loop. Thus, m easuring AVX2- im plem ent ed m em set ( ) under a large loop t o m inim ize RDTSC overhead can produce a skewed result wit h t he branch predict or being t rained by t he large loop it erat ion count . I n m ore realist ic soft ware st acks, t he invocat ion pat t erns of m em set ( ) will likely have t he charact erist ics t hat :
• •
There are int ervening inst ruct ion st ream s being execut ed bet ween invocat ions of m em set ( ) , t he st at e of branch predict or prior t o m em set ( ) invocat ion is not pre- t rained for t he branching sequence inside a m em set ( ) im plem ent at ion. Mem set ( ) count values are likely t o be uncorrect ed.
The proper m easurem ent t echnique t o com pare m em set ( ) perform ance for m ore realist ic m em set ( ) invocat ion scenarios will require a per- invocat ion t echnique t hat wraps t wo RDTSC around each invocat ion of m em set ( ) . Wit h t he per- invocat ion RDTSC m easurem ent t echnique, t he overhead of RDTSC and be pre- calibrat ed and post- validat ed out side of a m easurem ent loop. The per- invocat ion t echnique m ay also consider cache warm ing effect by using a loop t o wrap around t he per- invocat ion m easurem ent s. When t he relevant skew fact ors of m easurem ent t echniques are t aken int o effect , t he perform ance of m em set ( ) using REP STOSB, for count values sm aller t han a few hundred byt es, is generally fast er t han
11-56
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
t he AVX2 version for t he com m on m em set ( ) invocat ion scenarios. Only in t he ext rem e scenarios of hundreds of unrolled m em set ( ) calls, all using count values less t han a few hundred byt es and wit h no int ervening inst ruct ion st ream bet ween each pair of m em set ( ) can t he AVX2 version of m em set ( ) t ake advant age of t he t raining effect of t he branch predict or.
11.16.3.4 Hoisting Memcpy/Memset Ahead of Consuming Code There m ay be sit uat ions where t he dat a furnished by a call t o m em cpy/ m em set and subsequent inst ruct ions consum ing t he dat a can be re- arranged: memcpy ( pBuf, pSrc, Cnt); // make a copy of some data with knowledge of Cnt ..... // subsequent instruction sequences are not consuming pBuf immediately result = compute( pBuf); // memcpy result consumed here When t he count is known t o be at least a t housand byt e or m ore, using enhanced REP MOVSB/ STOSB can provide anot her advant age t o am ort ize t he cost of t he non- consum ing code. The heurist ic can be underst ood using a value of Cnt = 4096 and m em set ( ) as exam ple:
• •
A 256- bit SI MD im plem ent at ion of m em set ( ) will need t o issue/ execut e ret ire 128 inst ances of 32byt e st ore operat ion wit h VMOVDQA, before t he non- consum ing inst ruct ion sequences can m ake t heir way t o ret irem ent . An inst ance of enhanced REP STOSB wit h ECX= 4096 is decoded as a long m icro- op flow provided by hardware, but ret ires as one inst ruct ion. There are m any st ore_dat a operat ion t hat m ust com plet e before t he result of m em set ( ) can be consum ed. Because t he com plet ion of st ore dat a operat ion is de- coupled from program - order ret irem ent , a subst ant ial part of t he non- consum ing code st ream can process t hrough t he issue/ execut e and ret irem ent , essent ially cost- free if t he non- consum ing sequence does not com pet e for st ore buffer resources.
Soft ware t hat use enhanced REP MOVSB/ STOSB m uch check it s availabilit y by verifying CPUI D.( EAX= 07H, ECX= 0) : EBX.ERMSB ( bit 9) report s 1.
11.16.3.5 256-bit Fetch versus Two 128-bit Fetches On I nt el m icroarchit ect ure code nam e Sandy Bridge and I vy Bridge, using t wo 16- byt e aligned loads are preferred due t o t he 128- bit dat a pat h lim it at ion in t he m em ory pipeline of t he m icroarchit ect ure. To t ake advant age of I nt el m icroarchit ect ure code nam e Haswell’s 256- bit dat a pat h m icroarchit ect ure, t he use of 256- bit loads m ust consider t he alignm ent im plicat ions. I nst ruct ion t hat fet ched 256- bit dat a from m em ory should pay at t ent ion t o be 32- byt e aligned. I f a 32- byt e unaligned fet ch would span across cache line boundary, it is st ill preferable t o fet ch dat a from t wo 16- byt e aligned address inst ead.
11.16.3.6 Mixing MULX and AVX2 Instructions Com bining MULX and AVX2 inst ruct ion can furt her im prove t he perform ance of som e com m on com put at ion t ask, e.g. num eric conversion 64- bit int eger t o ascii form at can benefit from t he flexibilit y of MULX regist er allocat ion, wider YMM regist er, and variable packed shift prim it ive VPSRLVD for parallel m oduli/ rem ainder calculat ions. Exam ple 11- 40 shows a m acro sequence of AVX2 inst ruct ion t o calculat e one or t wo finit e range unsigned short int eger( s) int o respect ive decim al digit s, feat uring VPSRLVD in conj unct ion wit h Montgom ery reduct ion t echnique.
Example 11-40. Macros for Parallel Moduli/Remainder Calculation static short quoTenThsn_mulplr_d[16] = { 0x199a, 0, 0x28f6, 0, 0x20c5, 0, 0x1a37, 0, 0x199a, 0, 0x28f6, 0, 0x20c5, 0, 0x1a37, 0}; static short mten_mulplr_d[16] = { -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1, -10, 1};
11-57
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-40. Macros for Parallel Moduli/Remainder Calculation (Contd.) // macro to convert input t5 (a __m256i type) containing quotient (dword 4) and remainder // (dword 0) into single-digit integer (between 0-9) in output y3 ( a__m256i); //both dword element "t5" is assume to be less than 10^4, the rest of dword must be 0; //the output is 8 single-digit integer, located in the low byte of each dword, MS digit in dword 0 #define __ParMod10to4AVX2dw4_0( y3, t5 ) \ { __m256i x0, x2;
\
x0 = _mm256_shuffle_epi32( t5, 0); \ x2 = _mm256_mulhi_epu16(x0, _mm256_loadu_si256( (__m256i *) quoTenThsn_mulplr_d));\ x2 = _mm256_srlv_epi32( x2, _mm256_setr_epi32(0x0, 0x4, 0x7, 0xa, 0x0, 0x4, 0x7, 0xa) ); \ (y3) = _mm256_or_si256(_mm256_slli_si256(x2, 6), _mm256_slli_si256(t5, 2) ); \ (y3) = _mm256_or_si256(x2, y3);\ (y3) = _mm256_madd_epi16(y3, _mm256_loadu_si256( (__m256i *) mten_mulplr_d) ) ;\ }} // parallel conversion of dword integer (< 10^4) to 4 single digit integer in __m128i #define __ParMod10to4AVX2dw( x3, dw32 ) \ { __m128i x0, x2;
\
x0 = _mm_broadcastd_epi32( _mm_cvtsi32_si128( dw32)); \ x2 = _mm_mulhi_epu16(x0, _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d));\ x2 = _mm_srlv_epi32( x2, _mm_setr_epi32(0x0, 0x4, 0x7, 0xa) ); \ (x3) = _mm_or_si128(_mm_slli_si128(x2, 6), _mm_slli_si128(_mm_cvtsi32_si128( dw32), 2) ); \ (x3) = _mm_or_si128(x2, (x3));\ (x3) = _mm_madd_epi16((x3), _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;\ } Exam ple 11- 41 shows a helper ut ilit y and overall st eps t o reduce a 64- bit signed int eger int o 63- bit unsigned range. reduced- range int eger quot ient / rem ainder pairs using MULX.
Example 11-41. Signed 64-bit Integer Conversion Utility #defineQWCG10to80xabcc77118461cefdull static short quo4digComp_mulplr_d[8] = { 1024, 0, 64, 0, 8, 0, 0, 0}; static int pr_cg_10to4[8] = { 0x68db8db, 0 , 0, 0, 0x68db8db, 0, 0, 0}; static int pr_1_m10to4[8] = { -10000, 0 , 0, 0 , 1, 0 , 0, 0}; char * i64toa_avx2i( __int64 xx, char * p) {int cnt; _mm256_zeroupper(); if( xx < 0) cnt = avx2i_q2a_u63b(-xx, p); else cnt = avx2i_q2a_u63b(xx, p); p[cnt] = 0; return p; }
11-58
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-41. Signed 64-bit Integer Conversion Utility (Contd.) // Convert unsigned short (< 10^4) to ascii __inline int ubsAvx2_Lt10k_2s_i2(int x_Lt10k, char *ps) {int tmp; __m128i x0, m0, x2, x3, x4, compv; if( x_Lt10k < 10) { *ps = '0' + x_Lt10k; return 1; } x0 = _mm_broadcastd_epi32( _mm_cvtsi32_si128( x_Lt10k)); // calculate quotients of divisors 10, 100, 1000, 10000 m0 = _mm_loadu_si128( (__m128i *) quoTenThsn_mulplr_d); x2 = _mm_mulhi_epu16(x0, m0); // u16/10, u16/100, u16/1000, u16/10000 x2 = _mm_srlv_epi32( x2, _mm_setr_epi32(0x0, 0x4, 0x7, 0xa) ); // 0, u16, 0, u16/10, 0, u16/100, 0, u16/1000 x3 = _mm_insert_epi16(_mm_slli_si128(x2, 6), (int) x_Lt10k, 1); x4 = _mm_or_si128(x2, x3); // produce 4 single digits in low byte of each dword x4 = _mm_madd_epi16(x4, _mm_loadu_si128( (__m128i *) mten_mulplr_d) ) ;// add bias for ascii encoding x2 = _mm_add_epi32( x4, _mm_set1_epi32( 0x30303030 ) ); // pack 4 single digit into a dword, start with most significant digit x3 = _mm_shuffle_epi8(x2, _mm_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080) ); if (x_Lt10k > 999 ) *(int *) ps = _mm_cvtsi128_si32( x3); return 4; else { tmp = _mm_cvtsi128_si32( x3); if (x_Lt10k > 99 ) { *((short *) (ps)) = (short ) (tmp >>8); ps[2] = (char ) (tmp >>24); return 3; } (continue) else if ( x_Lt10k > 9){ *((short *) ps) = (short ) tmp; return 2; } } }
Exam ple 11- 42 shows t he st eps of num eric conversion of 63- bit dynam ic range int o ascii form at according t o a progressive range reduct ion t echnique using vect orized Mont gom ery reduct ion schem e.
11-59
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-42. Unsigned 63-bit Integer Conversion Utility unsigned avx2i_q2a_u63b (unsigned __int64 xx, char *ps) { __m128i v0; __m256i m0, x1, x2, x3, x4, x5 ; unsigned __int64 xxi, xx2, lo64, hi64; __int64 w; int j, cnt, abv16, tmp, idx, u; // conversion of less than 4 digits if ( xx < 10000 ) { j = ubsAvx2_Lt10k_2s_i2 ( (unsigned ) xx, ps); return j; } else if (xx < 100000000 ) { // dynamic range of xx is less than 9 digits // conversion of 5-8 digits x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(xx)); // broadcast to every dword // calculate quotient and remainder, each with reduced range (< 10^4) x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 )); x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4)); // quotient in dw4, remainder in dw0 m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(xx)), x3); __ParMod10to4AVX2dw4_0( x3, m0); // 8 digit in low byte of each dw x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) ); x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080, 0x0004080c, 0x80808080, 0x80808080, 0x80808080) ); // pack 8 single-digit integer into first 8 bytes and set rest to zeros x4 = _mm256_permutevar8x32_epi32( x4, _mm256_setr_epi32(0x4, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1, 0x1) ); tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) ); _BitScanForward((unsigned long *) &idx, tmp); cnt = 8 -idx; // actual number non-zero-leading digits to write to output } else { // conversion of 9-12 digits lo64 = _mulx_u64(xx, (unsigned __int64) QWCG10to8, &hi64); hi64 >>= 26; xxi = _mulx_u64(hi64, (unsigned __int64)100000000, &xx2); lo64 = (unsigned __int64)xx - xxi; (continue)
11-60
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.) if( hi64 < 10000) { // do digist 12-9 first __ParMod10to4AVX2dw(v0, hi64); v0 = _mm_add_epi32( v0, _mm_set1_epi32( 0x30303030 ) ); // continue conversion of low 8 digits of a less-than 12-digit value x5 = _mm256_setzero_si256( ); x5 = _mm256_castsi128_si256( _mm_cvtsi32_si128(lo64)); x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(lo64)); // broadcast to every dword x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 )); x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4)); m0 = _mm256_add_epi32( x5, x3); // quotient in dw4, remainder in dw0 __ParMod10to4AVX2dw4_0( x3, m0); x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) ); x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080, 0x0004080c, 0x80808080, 0x80808080, 0x80808080) ); x5 = _mm256_castsi128_si256( _mm_shuffle_epi8( v0, _mm_setr_epi32(0x80808080, 0x80808080, 0x0004080c, 0x80808080) )); x4 = _mm256_permutevar8x32_epi32( _mm256_or_si256(x4, x5), _mm256_setr_epi32(0x2, 0x4, 0x0, 0x1, 0x1, 0x1, 0x1, 0x1) ); tmp = _mm256_movemask_epi8( _mm256_cmpgt_epi8(x4, _mm256_set1_epi32( 0x30303030 )) ); _BitScanForward((unsigned long *) &idx, tmp); cnt = 12 -idx; } else { // handle greater than 12 digit input value cnt = 0; if ( hi64 > 100000000) { // case of input value has more than 16 digits xxi = _mulx_u64(hi64, (unsigned __int64) QWCG10to8, &xx2) ; abv16 = xx2 >>26; hi64 -= _mulx_u64((unsigned __int64) abv16, (unsigned __int64) 100000000, &xx2); __ParMod10to4AVX2dw(v0, abv16); v0 = _mm_add_epi32( v0, _mm_set1_epi32( 0x30303030 ) ); v0 = _mm_shuffle_epi8(v0, _mm_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080) ); tmp = _mm_movemask_epi8( _mm_cmpgt_epi8(v0, _mm_set1_epi32( 0x30303030 )) ); _BitScanForward((unsigned long *) &idx, tmp); cnt = 4 -idx; } (continue)
11-61
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.) // conversion of lower 16 digits x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(hi64)); // broadcast to every dword x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 )); x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4)); m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(hi64)), x3); __ParMod10to4AVX2dw4_0( x3, m0); x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) ); x4 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x0004080c, 0x80808080, 0x80808080, 0x80808080, 0x0004080c, 0x80808080, 0x80808080, 0x80808080) ); x1 = _mm256_broadcastd_epi32( _mm_cvtsi32_si128(lo64)); // broadcast to every dword x3 = _mm256_mul_epu32(x1, _mm256_loadu_si256( (__m256i *) pr_cg_10to4 )); x3 = _mm256_mullo_epi32(_mm256_srli_epi64(x3, 40), _mm256_loadu_si256( (__m256i *)pr_1_m10to4)); m0 = _mm256_add_epi32( _mm256_castsi128_si256( _mm_cvtsi32_si128(hi64)), x3); __ParMod10to4AVX2dw4_0( x3, m0); x3 = _mm256_add_epi32( x3, _mm256_set1_epi32( 0x30303030 ) ); x5 = _mm256_shuffle_epi8(x3, _mm256_setr_epi32(0x80808080, 0x80808080, 0x0004080c, 0x80808080, 0x80808080, 0x80808080, 0x0004080c, 0x80808080) ); x4 = _mm256_permutevar8x32_epi32( _mm256_or_si256(x4, x5), _mm256_setr_epi32(0x4, 0x0, 0x6, 0x2, 0x1, 0x1, 0x1, 0x1) ); cnt += 16; if (cnt >24); *(unsigned *) ps = (w >>32); break; case6:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32); break; case7:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16); *(unsigned *) (&ps[3]) = (w >>32); break; case 8: *(long long *)ps = w; break; case9:*ps++ = (char) (w >>24); *(long long *) (&ps[0]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4)); break; (continue)
11-62
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
Example 11-42. Unsigned 63-bit Integer Conversion Utility (Contd.) case10:*(short *)ps = (short) (w >>16); *(long long *) (&ps[2]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4)); break; case11:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16); *(long long *) (&ps[3]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4)); break; case 12: *(unsigned *)ps = w; *(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 4)); break; case13:*ps++ = (char) (w >>24); *(unsigned *) ps = (w >>32); *(long long *) (&ps[4]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8)); break; case14:*(short *)ps = (short) (w >>16); *(unsigned *) (&ps[2]) = (w >>32); *(long long *) (&ps[6]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8)); break; case15:*ps = (char) (w >>8); *(short *) (&ps[1]) = (short) (w >>16); *(unsigned *) (&ps[3]) = (w >>32); *(long long *) (&ps[7]) = _mm_cvtsi128_si64( _mm_srli_si128(_mm256_castsi256_si128(x4), 8)); break; case 16: _mm_storeu_si128( (__m128i *) ps, _mm256_castsi256_si128(x4)); break; case17:u = _mm_cvtsi128_si64(v0); *ps++ = (char) (u >>24); _mm_storeu_si128( (__m128i *) &ps[0], _mm256_castsi256_si128(x4)); break; case18:u = _mm_cvtsi128_si64(v0); *(short *)ps = (short) (u >>16); _mm_storeu_si128( (__m128i *) &ps[2], _mm256_castsi256_si128(x4)); break; case19:u = _mm_cvtsi128_si64(v0); *ps = (char) (u >>8); *(short *) (&ps[1]) = (short) (u >>16); _mm_storeu_si128( (__m128i *) &ps[3], _mm256_castsi256_si128(x4)); break; case20:u = _mm_cvtsi128_si64(v0); *(unsigned *)ps = (short) (u); _mm_storeu_si128( (__m128i *) &ps[4], _mm256_castsi256_si128(x4)); break; } return cnt; } The AVX2 version of num eric conversion across t he dynam ic range of 3/ 9/ 17 out put digit s are approxim at ely 23/ 57/ 54 cycles per input , com pared t o st andard library im plem ent ion’s range of 85/ 260/ 560 cycles per input . The t echniques illust rat ed above can be ext ended t o num eric conversion of ot her library, such as binaryint eger- decim al ( BI D) encoded I EEE- 754- 2008 Decim al float ing- point form at . For BI D- 128 form at , Exam ple 11- 42 can be adapt ed by adding anot her range- reduct ion st age using a pre- com put ed 256- bit const ant t o perform Mont gom ery reduct ion at m odulus 10^ 16. The t echnique t o const ruct t he 256- bit
11-63
OPTIMIZATIONS FOR INTEL® AVX, FMA AND AVX2
const ant is covered in Chapt er 10, “ SSE4.2 and SI MD Program m ing For TextProcessing/ LexI NG/ Parsing” of I nt el® 64 and I A- 32 Archit ect ures Opt im izat ion Reference Manual.
11.16.4 Considerations for Gather Instructions VGATHER fam ily of inst ruct ions fet ch m ult iple dat a elem ent s specified by a vect or index regist er cont aining relat ive offset s from a base address. Processors based on t he Haswell m icroarchit ect ure is t he first im plem ent at ion of t he VGATHER inst ruct ion and a single inst ruct ion result s in m ult iple m icro- ops being execut ed. I n t he Broadwell m icroarchit ect ure, t he t hroughput of t he VGATHER fam ily of inst ruct ions have im proved significant ly; see Table C- 5. Depending on dat a organizat ion and access pat t erns, it is possible t o creat e equivalent code sequences wit hout using VGATHER inst ruct ion t hat will execut e fast er and wit h fewer m icro- ops t han a single VGATHER inst ruct ion ( e.g. see Sect ion 11.5.1) . Exam ple 11- 43 shows som e of t hose sit uat ions where, use of VGATHER on I nt el m icroarchit ect ure code nam e Haswell is unlikely t o provide perform ance benefit .
Example 11-43. Access Patterns Favoring Non-VGATHER Techniques Access Patterns
Recommended Instruction Selection
Sequential elements
Regular SIMD loads (MOVAPS/MOVUPS, MOVDQA/MOVDQU)
Fewer than 4 elements
Regular SIMD load + horizontal data-movement to re-arrange slots
Small Strides
Load all nearby elements + shuffle/permute to collected strided elements: VMOVUPD VPERMQ VPERMQ
YMM0, [sequential elements] YMM1, YMM0, 0x08 // the even elements YMM2, YMM0, 0x0d // the odd elements
Transpositions
Regular SIMD loads + shuffle/permute/blend to transpose to columns
Redundant elements
Load once + shuffle/blend/logical to build data vectors in register. In this case, result[i] = x[index[i]] + x[index[i+1]], the technique below may be preferable to using multiple VGATHER: ymm0 Ba se cat egory int o FP Ar it h . wit h Sca la r and Ve ct or operat ions dist inct ion. For m ore det ails see Mat rix- Mult iply use- case2.
B.1.8
TMAM and Skylake Microarchitecture
The perform ance m onit oring capabilit ies in t he Skylake m icroarchit ect ure is significant ly enhanced over prior generat ions. TMAM benefit s direct ly from t he enhancem ent in t he breadt h of available count er event s and in Precise Event Based Sam pling ( PEBS) capabilit ies. Figure B- 3 shows Skylake m icroarchit ect ure’s support for TMAM, where t he boxes in green indicat es PEBS event s are available. The I nt el Vt une Perform ance Analyzer allows t he user t o apply TMAM on m any I nt el m icroarchit ect ures. The reader m ay wish t o consult t he whit e paper available at ht t ps: / / soft ware.int el.com / en- us/ art icles/ how- t o- t une- applicat ions- using- a- t op- down- charact erizat ion- of- m icroarchit ect ural- issues, and t he use cases in t he whit e paper for addit ional det ails.
B.1.8.1
TMAM Examples
Sect ion 11.15.1 describes t echniques of opt im izing float ing- point calculat ions involving lat ency and t hroughput considerat ions of FP MUL, FP ADD and FMA inst ruct ions. There is no explicit perform ance count er event s t hat can direct ly det ect exposures of lat ency issues of FP_ADD and FP_MUL inst ruct ions. TMAM m ay be used t o figure out when t his perform ance issue is likely t o be a perform ance lim it er. I f t he prim ary bot t leneck is Backend_Bound- > Core_Bound- > Port s_Ut ilizat ion and t here is a significant m easure in t he GFLOPS m et ric, it is possible t hat t he user code is hit t ing t his issue. The user m ay consider opt im izat ions list ed in Sect ion 11.15.1. Sect ion 11.3.1 describes possible perform ance issues of execut ing SSE code while t he upper YMM st at e is dirt y in Skylake Microarchit ect ure. To det ect t he perform ance issue associat ed wit h part ial regist er dependence and associat ed blend cost on SSE code execut ion, TMAM can be used t o m onit or t he rat e of m ixt ure of SSE operat ion and AVX operat ion on perform ance- crit ical SSE code whose source code did not direct ly execut e AVX inst ruct ions. I f t he prim ary bot t leneck is Backend_Bound- > Core_Bound, and t here is a significant m easure in Vect orMixRat e m et ric, it is possible t hat t he presence of Vect or operat ion wit h m is- m at ched vect or widt h was due t o t he ext ra blend operat ion on t he upper YMM regist ers. The Vect orMixRat e m et ric requires t he UOPS_I SSUED.VECTOR_WI DTH_MI SMATCH event t hat is available in t he Skylake Microarchit ect ure. This event count Uops insert ed at issue- st age in order t o preserve upper bit s of vect or regist ers. This event count s t he num ber of Blend Uops issued by t he Resource Allocat ion Table ( RAT) t o t he reservat ion st at ion ( RS) in order t o preserve upper bit s of vect or regist ers.
B-6
USING PERFORMANCE MONITORING EVENTS
Addit ionally, t he m et ric uses t he UOPS_I SSUED.ANY, which is com m on in recent I nt el m icroarchit ect ures, as t he denom inat or. The UOPS_I SSUED.ANY event count s t he t ot al num ber of Uops t hat t he RAT issues t o RS. The Vect orMixRat e m et ric gives t he percent age of inj ect ed blend uops out of all uops issued. Usually a Vect orMixRat e over 5% is wort h invest igat ing. Vect orMixRat e[ % ] = 100 * UOPS_I SSUED.VECTOR_WI DTH_MI SMATCH / UOPS_I SSUED.ANY Not e t he act ual penalt y m ay vary as it st em s from t he addit ional dat a- dependency on t he dest inat ion regist er t he inj ect ed blend operat ions add.
Pipeli eSlots Not Stalled
Reti i g
BadSpe ulatio
MS- Ba h Ma hi e ROM Mispedi t Clea
Base
Stalled
F o t E dBou d Fet h Late
Ba kE dBou d
Fet h CoeBou d Ba d idth
L Bou d
L Bou d
L Bou d
Sto es Bou d
Di ide LSD DSB MITE Ba h Restee s MS S it hes DSB S it hes I a he Miss ITLB Miss
Othe
FP-A ith
E e utio pots Utilizatio
Me o Bou d E t. Me o Bou d
Me Late Me Ba d idth L late Data sha i g Co tested a ess
K aliasi g Sto e f d lk DTLB Sto e False sha i g Sto e Miss po ts o po ts + po ts
L Miss L Hit STLB Miss STLB Hit
Ve to S ala X
Figure B-3. TMAM Hierarchy Supported by Skylake Microarchitecture
B.2
PERFORMANCE MONITORING AND MICROARCHITECTURE
This sect ion provides inform at ion of perform ance m onit oring hardware and t erm inology relat ed t o t he Silverm ont , Airm ont and Goldm ont m icroarchit ect ures. The feat ures described here m ay be specific t o individual m icroarchit ect ure, as indicat ed in Table B- 1.
B-7
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Name
Description
L2Q, XQ
When a memory reference misses the L1 data cache, the request goes to the L2 Queue (L2Q). If the request also misses the L2 cache, the request is sent to the XQ, where it waits for an opportunity to be issued to memory across the Intra-Die Interface (IDI) link. Note that since the L2 is shared between a pair of processor cores, a single L2Q is shared between those two cores. Similarly, there is a single XQ for a pair of processor cores, situated between the L2Q and the IDI link.
Applicable Microarchitectures Silvermont, Airmont, Goldmont
The XQ will fill up when the response rate from the IDI link is smaller than the rate at which new requests arrive at the XQ. The event L2_reject_XQ indicates that a request is unable to move from the L2 Queue to the XQ because the XQ is full, and thus indicates that the memory system is oversubscribed Core Reject
The core_reject event indicates that a request from the core cannot be accepted at the L2Q. However, there are several additional reasons why a request might be rejected from the L2Q. Beyond rejecting a request because the L2Q is full, a request from one core can be rejected to maintain fairness to the other core. That is, one core is not permitted to monopolize the shared connection to the L2Q/cache/XQ/IDI links, and might have its requests rejected even when there is room available in the L2Q. In addition, if the request from the core is a dirty L1 cache eviction, the hardware must insure that this eviction does not conflict with any pending request in the L2Q. (pending requests can include an external snoop). In the event of a conflict, the dirty eviction request might be rejected even when there is room in the L2Q.
Silvermont, Airmont, Goldmont
Thus, while the L2_reject_XQ event indicates that the request rate to memory from both cores exceeds the response rate of the memory, the Core_reject event is more subtle. It can indicate that the request rate to the L2Q exceeds the response rate from the XQ, or it can indicate the request rate to the L2Q exceeds the response rate from the L2, or it can indicate that one core is attempting to request more than its fair share of response from the L2Q. Or, it can be an indicator of conflict between dirty evictions and other pending requests. In short, the L2_reject_XQ event indicates memory oversubscription. The Core_reject event can indicate either (1) memory oversubscription, (2) L2 oversubscription, (3) rejecting one cores requests to insure fairness to the other core, or (4) a conflict between dirty evictions and other pending requests. Divider Busy
B-8
The divide unit is unable to accept a new divide uop when it is busy processing a previously dispatched divide uop. The "CYCLES_DIV_BUSY.ANY" event will count cycles that the divide unit is busy, irrespective of whether or not another divide uop is waiting to enter the divide unit (from the RS). The event will count cycles while a divide is in progress even if the RS is empty.
Silvermont, Airmont, Goldmont
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Name
Description
BACLEAR
Shortly after decoding an instruction and recognizing a branch/call/jump/ret instruction, it is possible for a Branch Address Calculator Clear (BACLEAR) event to occur. Possible causes of a BACLEAR include predicting the wrong target of a direct branch or not predicting a branch at that instruction location.
Applicable Microarchitectures Silvermont, Airmont, Goldmont
A BACLEAR causes the front end to restart fetching from a different location. While BACLEAR has similarities to a branch mispredict signaled from the execute part of the pipeline, it is not counted as a BR_MISP_RETIRED event or noted as a mispredict in the LBRs (where LBRs report mispredict). Branch mispredicts and BACLEARS are similar in that they both restart the front end to begin instruction fetch at a new target location, and they both flush some speculative work. However, a branch misprect must flush partially completed instructions from both front end and back end. Since a BACLEAR occurs right at decode time, it flushes instruction bytes and not yet fully decoded instructions. Recovery after a BACLEAR is less complicated, and faster than recovery after a branch mispredict. Front-end Bottleneck
The front-end is responsible for fetching the instruction, decoding into micro-ops (uops) and putting those uops into a micro-op queue to be consumed by the back end. The back end then takes these micro-ops and allocates the required resources. When all resources are ready, micro-ops are executed. Front-end bottleneck occurs when front-end of the machine is not delivering uops to the back-end and the band-end is not stalled. Cycles where the back-end is not ready to accept micro-ops from the frontend should not be counted as front-end bottlenecks even though such back-end bottlenecks will cause allocation unit stalls, eventually forcing the front-end to wait until the back-end is ready to receive more uops.
Silvermont, Airmont, Goldmont
NO_ALLOC_CYCL ES
Front-end issues can be analyzed using various sub-events within this event class.
Silvermont, Airmont
UOPS_NOT_DELI VERED.ANY
The UOPS_NOT_DELIVERED.ANY event is used to measure front-end inefficiencies to identify if the machine is truly front-end bound. Some examples of front-end inefficiencies are: Icache misses, ITLB misses, and decoder restrictions that limit the front-end bandwidth.
Goldmont
ICache
Requests to Instruction Cache (ICache) are made in a fixed size unit called a chunk. There are multiple chunks in a cache line, and multiple accesses might be made to a single cache line.
Goldmont
In the Goldmont microarchitecture, the event strives to count on a cache line basis, so that multiple fetches to a single cache line count as one ICACHE.ACCESS, and either one HIT or one MISS. Specifically, the event counts when straight line code crosses the cache line boundary, or when a branch target is on a new line. This event is highly speculative in nature, with bytes being fetched in advance of being decoded, executed or retired. The speculation occurs in straight line code as well as in the presence of branches. Consequently, one cannot deduce ICACHE statistics by examining the number of instructions retired. In the Silvermont microarchitecture, ICACHE events (HIT, MISS) count at different granularity.
B-9
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Name
Description
ICache Access
An ICache fetch accesses an aligned chunk of fixed size. A request to fetch a specific chunk from the instruction cache might occur multiple times due to speculative execution. It may be possible that the same chunk, while outstanding, is requested multiple times. However, an instruction fetch miss is counted only once and not counted every cycle it is outstanding.
Applicable Microarchitectures Silvermont, Airmont, Goldmont
After an ICache miss fetches the line, another request to the same cache line is likely to be made again and will be counted as a hit. Thus, the number "hits" plus "misses" does not equal to the number of accesses. From a software perspective, to get the number of true ICache hits, one should subtract the ICache miss count from the ICache hit count. Last Level Cache References, Misses
On processors that do not have L3, L2 is the last level cache. The architectural performance event to count LLC references and misses are also known as L2_REQUESTS.ANY and L2_REQUESTS.MISS.
Silvermont, Airmont, Goldmont
Machine Clear
There are many conditions that might cause a machine clear (including the receipt of an interrupt, or a trap or a fault). All those conditions (including but not limited to MO (Memory Ordering), SMC (Self or Cross Modifying Code) and FP (Floating Point assist) are captured in the MACHINE_CLEAR.ANY event. In addition, some conditions can be specifically counted (i.e. SMC, MO, FP). However, the sum of SMC, MO and FP machine clears will not necessarily equal the number of ANY.
Silvermont, Airmont, Goldmont
MACHINE_CLEAR. FP_ASSIST
Most of the time, the floating point execute unit can properly produce the correct output bits. On rare occasions, it needs a little help. When help is needed, a machine clear is asserted against the instruction. After this machine clear (as described above), the front end of the machine begins to deliver instructions that will figure out exactly what FP operation was asked for, and they will do the extra work to produce the correct FP result (for instance, if the result was a floating point denormal, sometimes the hardware asks the help to produce the correctly rounded IEEE compliant result).
Silvermont, Airmont, Goldmont
MACHINE_CLEAR. SMC
Self Modifying Code (SMC) refers to a piece of code that wrote to the instruction stream ahead of where the machine will execute. In the Silvermont microarchitecture, the processor detects SMC in a 1K aligned region. A detected SMC condition causes a machine clear assist and will flush the pipeline.
Silvermont, Airmont, Goldmont
Writing to memory within 1K of where the processor is executing can trigger the SMC detection mechanism and cause a machine clear. Since the machine clear allows the store pipeline to drain, when front end restart occurs, the correct instructions (after the write) will be executed.
B-10
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Name
Description
MACHINE_CLEAR. MO
Memory order machine clear happens when a snoop request occurs and the machine is uncertain if memory ordering will be preserved. For instance, suppose you have two loads, one to address X followed by another to address Y, in the program order. Both loads have been issued; however, load to Y completes first and all the dependent ops on this load continue with the data loaded by this load. Load to X is still waiting for the data. Suppose that at the same time another processor writes to the same address Y and causes a snoop to address Y.
Applicable Microarchitectures Silvermont, Airmont, Goldmont
This presents a problem: the load to Y got the old value, but we have not yet finished loading X. The other processor saw the loads in a different order by not consuming the latest value from the store to address Y. We need to undo everything from the load to address Y so that we will see the post-write data. Note: we do not have to undo load Y if there were no other pending reads; the fact that the load to X is not yet finished causes this ordering problem. MACHINE_CLEAR. DISAMBIGUATION
Disambiguation machine clear is triggered due to a younger load passing an older store to the same address.
Goldmont
Page Walk
When a translation of linear address to physical address cannot be found in the Translation Look-aside Buffer (TLB), dedicated hardware must retrieve the physical address from the page table and other paging structures if needed. After the page walk, the translation is stored in the TLB for future use.
Silvermont, Airmont, Goldmont
Since paging structures are stored in memory, the page walk can require multiple memory accesses. These accesses are considered part of demand data, even if the page walk is to translate an instruction reference. The number of cycles for a page walk is variable, depending on how many memory accesses are required and the cache locality of those memory accesses. The PAGE_WALKS event can be used to count page walk durations with EDGE triger bit cleared. Page walk duration divided by number of page walks is the average duration of page-walks. In the Goldmont microarchitecture, the number of page walks can be determined by using the events MEM_UOPS_RETIRED.DTLB_MISS and ITLB.MISS. In the Silvermont microarchitecture, the combined number of page walks for data and instruction can be counted with PAGE_WALKS.WALKS.
B-11
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Applicable Microarchitectures
Name
Description
RAT
The allocation pipeline which moves uops from the front end to the backend. At the end of the allocate pipe a uop needs to be written into one of 6 reservation stations (the RS). Each RS holds uops that are to be sent to a specific execution (or memory) cluster. Each RS has a finite capacity, and it may accumulate uops when it is unable to send a uop to its execution cluster. Typical reasons why an RS may fill include, but are not limited to, execution of long latency uops like divide, or inability to schedule uops due to dependencies, or too many outstanding memory references. When the RS becomes full, it is unable to accept more uops, and it will stall the allocation pipeline. The RS_FULL_STALL.ANY event will be asserted on any cycle when the allocation is stalled for any one of the RSs being full and not for other reasons. (i.e. the allocate pipeline might be stalled for some other reason, but if RS is not full, the RS_FULL_STALL.ANY will not count). The MEC sub-event allows discovery of whether the MEC RS being full prevents further allocation.
Silvermont, Airmont, Goldmont
REHABQ
An internal queue that holds memory reference micro-ops which cannot complete for one reason or another. The micro-ops remain in the REHABQ until they can be re-issued and successfully completed.
Silvermont, Airmont
Examples of bottlenecks that cause micro-ops to go into REHABQ include, but are not limited to: cache line splits, blocked store forward and data not ready. There are many other conditions that might cause a load or store to be sent to the REHABQ. For instance, if an older store has an unknown address, all subsequent stores must be sent to the REHABQ until that older store’s address becomes known. LOAD_BLOCKS
Loads can be blocked for multiple reasons including but not limited UTLB misses, blocked store forwards, 4-K aliases or other conditions. When a load needs data (in whole or in part) that was produced by a previous store, forward progress of the machine will face two scenarios: (i) the machine waits until the previous store is complete (forwarding restricted, loads blocked) or (ii) data can be forwarded to the load before the previous store is complete. The restricted situations are described next. When a load is checked against previous stores, not all of its address bits are compared to the store addresses. This can cause a load to be blocked because its address is similar (LD_BLOCKS.4K_ALIAS) to a pending store, even though technically the load does not need to be blocked). When conditions do not allow the load to receive data from the in-progress store, then the load is blocked until the pending store operation is complete. LD_BLOCKS.STORE_FORWARD counts times when a load was prohibited from receiving forwarded data from the store because of address mismatch (explained below). LD_BLOCKS.DATA_UNKOWN counts when a load is blocked from using a store forward, because the store data was not available at the right time. A load block will not be counted as both LD_BLOCKS.DATA_UNKNOWN and LD_BLOCK.STORE_FORWARD. The conditions under which a load can receive data from an older store is shown in Table 14-12. These events are precise events and thus will not be count speculative loads that do not retire.
B-12
Goldmont
USING PERFORMANCE MONITORING EVENTS
Table B-1. Performance Monitoring Taxonomy Name
Description
Uops Retired
The processor decodes complex macro instructions into a sequence of simpler micro-ops. Most instructions are composed of one or two microops. Some instructions are decoded into longer sequences of uops; for example, floating point transcendental instructions, assists, and rep string instructions.
Applicable Microarchitectures Silvermont, Airmont, Goldmont
In some cases micro-op sequences are fused, or whole instructions are fused, into one micro-op. A sub-event within UOPS_RETIRED is available for differentiating MSROM micro-ops on Goldmont. The available subevents differ on other microarchitectures. HW_INTERRUP TS
These Events provide information regarding Hardware (Vectored, Fixed) interrupts. HW_INTERRUPTS.RECEIVED provides a count of the total number of Hardware Interrupts received by the processor. This event is a straightforward count of the number of interrupts the ROB recognizes. HW_INTERRUPTS.PENDING_AND_MASKED counts the number of core cycles that an interrupt is pending but cannot be delivered due to EFLAGS.IF being 0. It will not count interrupts that are masked by TPR or ISR. These events are not precise but collecting non-precise PEBS records on these events can help identify issues causing an unresponsive system.
Goldmont
MEM_UOPS_R ETIRED
These events count when a uop that reads (loads) or writes (stores) data if that uop retired valid. Speculative loads and stores are not counted. The sub-events can indicate conditions that generally require extra cycles to complete the operation: specifically if the address of memory uop misses in the Data Translation Lookaside Buffer (DTLB), the data requested spans a cache line (split), or the memory uop is a locked load:. These are precise events, so the EventingRIP field in the PEBS record indicates the instruction which caused the event.
Silvermont, Airmont, Goldmont
MEM_LOAD_U OPS_RETIRED
These events count when an instruction produces a uop that reads (loads) data if that uop retired valid. Speculative loads are not counted. These events report the various states of the memory hierarchy for the data being requested, which helps determine the source of latency stalls in accessing data. These are precise events, so the EventingRIP field in the PEBS record indicates the instruction which caused the event.
Goldmont
B.3
INTEL® XEON® PROCESSOR 5500 SERIES
I nt el Xeon processor 5500 series are based on t he sam e m icroarchit ect ure as I nt el Core i7 processors, see Sect ion 2.5, “ I nt el® Microarchit ect ure Code Nam e Nehalem ” . I n addit ion, I nt el Xeon processor 5500 series support non- uniform m em ory access ( NUMA) in plat form s t hat have t wo physical processors, see Figure B- 4. Figure B- 4 illust rat es 4 processor cores and an uncore sub- syst em in each physical processor. The uncore sub- syst em consist s of L3, an int egrat ed m em ory cont roller ( I MC) , and I nt el QuickPat h I nt erconnect ( QPI ) int erfaces. The m em ory sub- syst em consist s of t hree channels of DDR3 m em ory locally connect ed t o each I MC. Access t o physical m em ory connect ed t o a non- local I MC is oft en described as a rem ot e m em ory access.
B-13
USING PERFORMANCE MONITORING EVENTS
Two-way DDR3
Core0
Core1
Core0
Core1
Core2
Core3
Core2
Core3
8MB L3 IMC
QPI
DDR3
8MB L3 QPI
QPI
QPI
IMC
QPI Link
IOH/PCH
Figure B-4. System Topology Supported by Intel® Xeon® Processor 5500 Series
The perform ance m onit oring event s on I nt el Xeon processor 5500 series can be used t o analyze t he int eract ion bet ween soft ware ( code and dat a) and m icroarchit ect ural unit s hierarchically:
•
•
Per- core PMU: Each processor core provides 4 program m able count ers and 3 fixed count ers. The program m able per- core count ers can be configured t o invest igat e front end/ m icro- op flow issues, st alls inside a processor core. Addit ionally, a subset of per- core PMU event s support precise eventbased sam pling ( PEBS) . Load lat ency m easurem ent facilit y is new in I nt el Core i7 processor and I nt el Xeon processor 5500. Uncore PMU: The uncore PMU provides 8 program m able count ers and 1 fixed count er. The program m able per- core count ers can be configured t o charact erize L3 and I nt el QPI operat ions, local and rem ot e dat a m em ory accesses.
The num ber and variet y of perform ance count ers and t he breadt h of program m able perform ance event s available in I nt el Xeon processor 5500 offer soft ware t uning engineers t he abilit y t o analyze perform ance issues and achieve higher perform ance. Using perform ance event s t o analyze perform ance issues can be grouped int o t he following subj ect s:
• • • • • • • •
Cycle Account ing and Uop Flow. St all Decom posit ion and Core Mem ory Access Event s ( non- PEBS) . Precise Mem ory Access Event s ( PEBS) . Precise Branch Event s ( PEBS, LBR) . Core Mem ory Access Event s ( non- PEBS) . Ot her Core Event s ( non- PEBS) . Front End I ssues. Uncore Event s.
B.4
PERFORMANCE ANALYSIS TECHNIQUES FOR INTEL® XEON® PROCESSOR 5500 SERIES
The t echniques covered in t his chapt er focuses on ident ifying opport unit y t o rem ove/ reduce perform ance bot t lenecks t hat are m easurable at runt im e. Com pile- t im e and source- code level t echniques are covered in ot her chapt ers in t his docum ent . I ndividual sub- sect ions describe specific t echniques t o ident ify t uning opport unit y by exam ining various m et rics t hat can be m easured or derived direct ly from perform ance m onit oring event s.
B-14
USING PERFORMANCE MONITORING EVENTS
B.4.1
Cycle Accounting and Uop Flow Analysis
The obj ect ives, perform ance m et rics and com ponent event s of t he basic cycle account ing t echnique is sum m arized in Table B- 2.
Table B-2. Cycle Accounting and Micro-ops Flow Recipe Summary Objective
Identify code/basic block that had significant stalls
Method
Binary decomposition of cycles into “productive“ and “unproductive“ parts
PMU-Pipeline Focus
Micro-ops issued to execute
Event code/Umask
Event code B1H, Umask= 3FH for micro-op execution; Event code 3CH, Umak= 1, CMask=2 for counting total cycles
EvtSelc
Use CMask, Invert, Edge fields to count cycles and separate stalled vs. active cycles
Basic Equation
“Total Cycles“ = UOPS_EXECUTED.CORE_STALLS_CYCLES + UOPS_EXECUTED.CORE_ACTIVE_CYCLES
Metric
UOPS_EXECUTED.CORE_STALLS_CYCLES / UOPS_EXECUTED.CORE_STALLS_COUNT
Drill-down scope
Counting: Workload; Sampling: basic block
Variations
Port 0,1, 5 cycle counting for computational micro-ops execution.
Cycle account ing of execut ed m icro- ops is an effect ive t echnique t o ident ify st alled cycles for perform ance t uning. Wit hin t he m icroarchit ect ure pipeline, t he m eaning of m icro- ops being “ issued“ , “ dispat ched“ , “ execut ed”, “ r et ired” has precise m eaning. This is illust rat ed in Figure B- 5. Cycles are divided int o t hose where m icro- ops are dispat ched t o t he execut ion unit s and t hose where no m icro- ops are dispat ched, which are t hought of as execut ion st alls. “ Tot al cycles” of execut ion for t he code under t est can be direct ly m easured wit h CPU_CLK_UNHALTED.THREAD ( event code 3CH, Um ask= 1) and set t ing CMask = 2 and I NV= 1 in I A32_PERFEVTSELCn. The signals used t o count t he m em ory access uops execut ed ( port s 2, 3 and 4) are t he only core event s which cannot be count ed per- logical processor. Thus, Event code B1H wit h Um ask= 3FH only count s on a per- core basis, and t he t ot al execut ion st all cycles can only be evaluat ed on a per core basis. I f HT is disabled, t his present s no difficult y t o conduct per- t hread analysis of m icro- op flow cycle account ing.
B-15
USING PERFORMANCE MONITORING EVENTS
“UOPS_EXECUTED“ IFetch/ BPU Dispatch
Decoder
RS
Resource Allocator
Execution Units
“UOPS_RETIRED”
ROB
Retirement/ Writeback
“UOPS_ISSUED” “RESOURCE_STALLS”
Figure B-5. PMU Specific Event Logic Within the Pipeline
The PMU signals t o count uops_execut ed in port 0, 1, 5 can count on a per- t hread basis even when HT is act ive. This provides an alt ernat e cycle account ing t echnique when t he workload under t est int eract s wit h HT. The alt ernat e m et ric is built from UOPS_EXECUTED.PORT015_STALL_CYCLES, using appropriat e CMask, I nv, and Edge set t ings. Det ails of perform ance event s are shown in Table B- 3.
Table B-3. CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow Event Name
Umask
Event Code
Cmask
Inv
Edge
All Thread
CPU_CLK_UNHALTED.TOTAL_CYCLES
0H
3CH
2
1
0
0
UOPS_EXECUTED.CORE_STALLS_CYC LES
3FH
B1H
1
1
0
1
UOPS_EXECUTED.CORE_STALLS_CO UNT
3FH
B1H
1
1
!
1
UOPS_EXECUTED.CORE_ACTIVE_CYC LES
3FH
B1H
1
0
0
1
UOPS_EXECUTED.PORT015_STALLS_ CYCLES
40H
B1H
1
1
0
0
UOPS_RETIRED.STALL_CYCLES
1H
C2H
1
1
0
0
UOPS_RETIRED.ACTIVE_CYCLES
1H
C2H
1
0
0
0
B.4.1.1
Cycle Drill Down and Branch Mispredictions
While execut ed m icro- ops are considered product ive from t he perspect ive of execut ion unit s being subscribed, not all such m icro- ops cont ribut e t o forward progress of t he program . Branch m ispredict ions can int roduce execut ion inefficiencies in OOO processor t hat are t ypically decom posed int o t hree com ponent s:
• •
Wast ed work associat ed wit h execut ing t he uops of t he incorrect ly predict ed pat h. Cycles lost when t he pipeline is flushed of t he incorrect uops.
B-16
USING PERFORMANCE MONITORING EVENTS
•
Cycles lost while wait ing for t he correct uops t o arrive at t he execut ion unit s.
I n processors based on I nt el m icroarchit ect ure code nam e Nehalem , t here are no execut ion st alls associat ed wit h clearing t he pipeline of m ispredict ed uops ( com ponent 2) . These uops are sim ply rem oved from t he pipeline wit hout st alling execut ions or dispat ch. This t ypically lowers t he penalt y for m ispredict ed branches. Furt her, t he penalt y associat ed wit h inst ruct ion st arvat ion ( com ponent 3) can be m easured. The wast ed work wit hin execut ed uops are t hose uops t hat will never be ret ired. This is part of t he cost associat ed wit h m ispredict ed branches. I t can be found t hrough m onit oring t he flow of uops t hrough t he pipeline. The uop flow can be m easured at 3 point s in Figure B- 5, going int o t he RS wit h t he event UOPS_I SSUED, going int o t he execut ion unit s wit h UOPS_EXECUTED and at ret irem ent wit h UOPS_RETI RED. The differences of bet ween t he upst ream m easurem ent s and at ret irem ent m easure t he wast ed work associat ed wit h t hese m ispredict ed uops. As UOPS_EXECUTED m ust be m easured per core, rat her t han per t hread, t he wast ed work per core is evaluat ed as: Wast ed Work = UOPS_EXECUTED.PORT234_CORE + UOPS_EXECUTED.PORT015_All_Thread UOPS_RETI RED.ANY_ALL_THREAD. The rat io above can be convert ed t o cycles by dividing t he average issue rat e of uops. The event s above were designed t o be used in t his m anner wit hout correct ions for m icro fusion or m acro fusion. A “ per t hread” m easurem ent can be m ade from t he difference bet ween t he uops issued and uops ret ired as t he lat t er t wo of t he above event s can be count ed per t hread. I t over count s slight ly, by t he m ispredict ed uops t hat are elim inat ed in t he RS before t hey can wast e cycles being execut ed, but t his is usually a sm all correct ion: Wast ed Work/ t hread = ( UOPS_I SSUED.ANY + UOPS_I SSUED.FUSED) - UOPS_RETI RED.ANY.
Table B-4. Cycle Accounting of Wasted Work Due to Misprediction Summary Objective
Evaluate uops that executed but not retired due to misprediction
Method
Examine uop flow differences between execution and retirement
PMU-Pipeline Focus
Micro-ops execute and retirement
Event code/Umask
Event code B1H, Umask= 3FH for micro-op execution; Event code C2H, Umask= 1, AllThread=1 for per-core counting
EvtSelc
Zero CMask, Invert, Edge fields to count uops
Basic Equation
“Wasted work“ = UOPS_EXECUTED.PORT234_CORE + UOPS_EXECUTED.PORT015_ALL_THREAD - UOPS_RETIRED.ANY_ALL_THREAD
Drill-down scope
Counting: Branch misprediction cost
Variations
Divide by average uop issue rate for cycle accounting. Set AllThread=0 to estimate per-thread cost.
The t hird com ponent of t he m ispredict ion penalt y, inst ruct ion st arvat ion, occurs when t he inst ruct ions associat ed wit h t he correct pat h are far away from t he core and execut ion is st alled due t o lack of uops in t he RAT. Because t he t wo prim ary cause of uops not being issued are eit her front end st arvat ion or resource not available in t he back end. So we can explicit ly m easured at t he out put of t he resource allocat ion as follows:
•
Count t he t ot al num ber of cycles where no uops were issued t o t he OOO engine.
B-17
USING PERFORMANCE MONITORING EVENTS
•
Count t he cycles where resources ( RS, ROB ent ries, load buffer, st ore buffer, et c.) are not available for allocat ion.
I f HT is not act ive, inst ruct ion st arvat ion is sim ply t he difference: I nst ruct ion St arvat ion = UOPS_I SSUED.STALL_CYCLES - RESOURCE_STALLS.ANY. When HT is enabled, t he uop delivery t o t he RS alt ernat es bet ween t he t wo t hreads. I n an ideal case t he above condit ion would t hen over count , as 50% of t he issuing st all cycles m ay be delivering uops for t he ot her t hread. We can m odify t he expression by subt ract ing t he cycles t hat t he ot her t hread is having uops issued. I nst ruct ion St arvat ion ( per t hread) = UOPS_I SSUED.STALL_CYCLES - RESOURCE_STALLS.ANY UOPS_I SSUED.ACTI VE_CYCLES_OTHER_THREAD. The per- t hread expression above will over count som ewhat because t he resource_st all condit ion could exist on “ t his” t hread while t he ot her t hread in t he sam e core was issuing uops. An alt ernat ive m ight be: CPU_CLK_UNHALTED.THREAD - UOPS_I SSUED.CORE_CYCLES_ACTI VE- RESOURCE_STALLS.ANY. The above t echnique is sum m arized in Table B- 5.
Table B-5. Cycle Accounting of Instruction Starvation Summary Objective
Evaluate cycles that uops issuing is starved after misprediction
Method
Examine cycle differences between uops issuing and resource allocation
PMU-Pipeline Focus
Micro-ops issue and resource allocation
Event code/Umask
Event code 0EH, Umak= 1, for uops issued. Event code A2H, Umask=1, for Resource allocation stall cycles
EvtSelc
Set CMask=1, Inv=1, fields to count uops issue stall cycles. Set CMask=1, Inv=0, fields to count uops issue active cycles. Use AllThread = 0 and AllThread=1 on two counter to evaluate contribution from the other thread for UOPS_I SSUED.ACTI VE_CYCLES_OTHER_THREAD
Basic Equation
“Instruction Starvation“ (HT off) = UOPS_ISSUED.STALL_CYCLES RESOURCE_STALLS.ANY;
Drill-down scope
Counting: Branch misprediction cost
Variations
Evaluate per-thread contribution with Instruction Starvation = UOPS_ISSUED.STALL_CYCLES - RESOURCE_STALLS.ANY UOPS_ISSUED.ACTIVE_CYCLES_OTHER_THREAD
Det ails of perform ance event s are shown in Table B- 6.
B-18
USING PERFORMANCE MONITORING EVENTS
Table B-6. CMask/Inv/Edge/Thread Granularity of Events for Micro-op Flow Event Name
Umask
Event Code
Cmask
Inv
Edge
All Thread
UOPS_EXECUTED.PORT234_CORE
80H
B1H
0
0
0
1
UOPS_EXECUTED.PORT015_ALL_THR EAD
40H
B1H
0
0
0
1
UOPS_RETIRED.ANY_ALL_THREAD
1H
C2H
0
0
0
1
RESOURCE_STALLS.ANY
1H
A2H
0
0
0
0
UOPS_ISSUED.ANY
1H
0EH
0
0
0
0
UOPS_ISSUED.STALL_CYCLES
1H
0EH
1
1
0
0
UOPS_ISSUED.ACTIVE_CYCLES
1H
0EH
1
0
0
0
UOPS_ISSUED.CORE_CYCLES_ACTIVE
1H
0EH
1
0
0
1
B.4.1.2
Basic Block Drill Down
The event I NST_RETI RED.ANY ( inst ruct ions ret ired) is com m only used t o evaluat e a cycles/ inst ruct ion rat io ( CPI ) . Anot her im port ant usage is t o det erm ine t he perform ance- crit ical basic blocks by evaluat ing basic block execut ion count s. I n a sam pling t ool ( such as VTune Analyzer) , t he sam ples t end t o clust er around cert ain I P values. This is t rue when using I NST_RETI RED.ANY or cycle count ing event s. Disassem bly list ing based on t he hot sam ples m ay associat e som e inst ruct ions wit h high sam ple count s and adj acent inst ruct ions wit h no sam ples. Because all inst ruct ions wit hin a basic block are ret ired exact ly t he sam e num ber of t im es by t he very definit ion of a basic block. Drilling down t he hot basic blocks will be m ore accurat e by averaging t he sam ple count s over t he inst ruct ions of t he basic block. Basic Block Execut ion Count = Sum ( Sam ple count s of inst ruct ions wit hin basic block) * Sam ple_aft er_value / ( num ber of inst ruct ions in basic block) I nspect ion of disassem bly list ing t o ident ify basic blocks associat ed wit h loop st ruct ure being a hot loop or not can be done syst em at ically by adapt ing t he t echnique above t o evaluat e t he t rip count of each loop const ruct . For a sim ple loop wit h no condit ional branches, t he t rip count ends up being t he rat io of t he basic block execut ion count of t he loop block t o t he basic block execut ion count of t he block im m ediat ely before and/ or aft er t he loop block. Judicious use of averaging over m ult iple blocks can be used t o im prove t he accuracy. This will allow t he user t o ident ify loops wit h high t rip count s t o focus on t uning effort s. This t echnique can be im plem ent ed using fixed count ers. Chains of dependent long- lat ency inst ruct ions ( fm ul, fadd, im ul, et c) can result in t he dispat ch being st alled while t he out put s of t he long lat ency inst ruct ions becom e available. I n general t here are no event s t hat assist in count ing such st alls wit h t he except ion of inst ruct ions using t he divide/ sqrt execut ion unit . I n such cases, t he event ARI TH can be used t o count bot h t he occurrences of t hese inst ruct ions and t he durat ion in cycles t hat t hey kept t heir execut ion unit s occupied. The event ARI TH.CYCLES_DI V_BUSY count s t he cycles t hat eit her t he divide/ sqrt execut ion unit was occupied.
B-19
USING PERFORMANCE MONITORING EVENTS
B.4.2
Stall Cycle Decomposition and Core Memory Accesses
The decom posit ion of t he st all cycles is accom plished t hrough a st andard approxim at ion. I t is assum ed t hat t he penalt ies occur sequent ially for each perform ance im pact ing event . Consequent ly, t he t ot al loss of cycles available for useful work is t hen t he num ber of event s, Ni , t im es t he average penalt y for each t ype of event , Pi Count ed_St all_Cycles = Sum ( Ni * Pi ) This only account s for t he perform ance im pact ing event s t hat are or can be count ed wit h a PMU event . Ult im at ely t here will be several sources of st alls t hat cannot be count ed, however t heir t ot al cont ribut ion can be est im at ed: Unaccount ed st all cycles = St all_Cycles - Count ed_St all_Cycles = UOPS_EXECUTED.CORE_STALLS_CYCLES - Sum ( Ni * Pi ) _bot h_t hreads The unaccount ed com ponent can becom e negat ive as t he sequent ial penalt y m odel is overly sim ple and usually over count s t he cont ribut ions of t he individual m icroarchit ect ural issues. As not ed in Sect ion B.4.1.1, UOPS_EXECUTED.CORE_STALL_CYCLES count s on a per core basis rat her t han on a per t hread basis, t he over count ing can becom e severe. I n such cases it m ay be preferable t o use t he port 0,1,5 uop st alls, as t hat can be done on a per t hread basis: Unaccount ed st all cycles ( per t hread ) = UOPS_EXECUTED.PORT015_THREADED_STALLS_CYCLES Sum ( Ni * Pi ) This unaccount ed com ponent is m eant t o represent t he com ponent s t hat were eit her not count ed due t o lack of perform ance event s or sim ply neglect ed during t he dat a collect ion. One can also choose t o use t he “ ret irem ent ” point as t he basis for st alls. The PEBS event , UOPS_RETI RED.STALL_CYCLES, has t he advant age of being evaluat ed on a per t hread basis and being having t he HW capt ure t he I P associat ed wit h t he ret iring uop. This m eans t hat t he I P dist ribut ion will not be effect ed by STI / CLI deferral of int errupt s in crit ical sect ions of OS kernels, t hus producing a m ore accurat e profile of OS act ivit y.
B.4.2.1
Measuring Costs of Microarchitectural Conditions
Decom posit ion of st alled cycles in t his m anner should st art by first focusing on condit ions t hat carry large perform ance penalt y, for exam ple, event s wit h penalt ies of great er t han 10 cycles. Short penalt y event s ( P < 5 cycles) can frequent ly be hidden by t he com bined act ions of t he OOO execut ion and t he com piler. The OOO engine m anages bot h t ypes of sit uat ions in t he inst ruct ion st ream and st rive t o keep t he execut ion unit s busy during st alls of eit her t ype due t o inst ruct ion dependencies. Usually, t he large penalt y operat ions are dom inat ed by m em ory access and t he very long lat ency inst ruct ions for divide and sqrt . The largest penalt y event s are associat ed wit h load operat ions t hat require a cacheline which is not in L1 or L2 of t he cache hierarchy. Not only m ust we count how m any occur, but we need t o know what penalt y t o assign. The st andard approach t o m easuring lat ency is t o m easure t he average num ber of cycles a request is in a queue: Lat ency = Sum (CYCLES_Queue_entries_outstanding) /Queue_insert s where “ queue_insert s“ refers t o t he t ot al num ber of ent ries t hat caused t he out st anding cycles in t hat queue. However, t he penalt y associat ed wit h each queue insert ( i.e. cachem iss) , is t he lat ency divided by t he average queue occupancy. This correct ion is needed t o avoid over count ing associat ed wit h overlapping penalt ies. Avg_Queue_Dept h= Sum (CYCLES_Queue_entries_outstanding) / Cycles_Queue_not _em pt y The t he penalt y ( cost ) of each occurrence is Penalt y = Lat ency / Avg_Queue_Dept h = Cycles_Queue_not _em pt y / Queue_insert s An alt ernat ive way of t hinking about t his is t o realize t hat t he sum of all t he penalt ies, for an event t hat occupies a queue for it s durat ion, cannot exceed t he t im e t hat t he queue is not em pt y Cycles_Queue_not _em pt y = Event s * < Penalt y> B-20
USING PERFORMANCE MONITORING EVENTS
The st andard t echniques described above are sim ple concept ually. I n pract ice, t he large am ount of m em ory references in t he workload and wide range of varying st at e/ locat ion- specific lat encies m ade st andard sam pling t echniques less pract ical. Using precise- event- based sam pling ( PEBS) is t he preferred t echnique on processors based on I nt el m icroarchit ect ure code nam e Nehalem . The profiling t he penalt y by sam pling ( t o localize t he m easurem ent in I P) is likely t o have accuracy difficult ies. Since t he lat encies for L2 m isses can vary from 40 t o 400 cycles, collect ing t he num ber of required sam ples will t end t o be invasive. The use of t he precise lat ency event , t hat will be discussed lat er, provides a m ore accurat e and flexible m easurem ent t echnique when sam pling is used. As each sam ple records bot h a load t o use lat ency and a dat a source, t he average lat ency per dat a source can be evaluat ed. Furt her as t he PEBS hardware support s buffering t he event s wit hout generat ing a PMI unt il t he buffer is full, it is possible t o m ake such an evaluat ion efficient wit hout pert urbing t he workload int rusively. A num ber of perform ance event s in core PMU can be used t o m easure t he cost s of m em ory accesses t hat originat ed in t he core and experienced delays due t o various condit ions, localit y, or t raffic due t o cache coherence requirem ent s. The lat ency of m em ory accesses vary, depending on localit y of L3, DRAM at t ached t o t he local m em ory cont roller or rem ot e cont roller, and cache coherency fact ors. Som e exam ples of t he approxim at e lat ency values are shown in Table B- 7.
Table B-7. Approximate Latency of L2 Misses of Intel Xeon Processor 5500 Data Source
Latency
L3 hit, Line exclusive
~ 42 cycles
L3 Hit, Line shared
~ 63 cycles
L3 Hit, modified in another core
~ 73 cycles
Remote L3
100 - 150 cycles
Local DRAM
~ 50 ns
Remote DRAM
~ 90 ns
B.4.3
Core PMU Precise Events
The Precise Event Based Sam pling ( PEBS) m echanism enables t he PMU t o capt ure t he archit ect ural st at e and I P at t he com plet ion of t he inst ruct ion t hat caused t he event . This provides t wo significant benefit for profiling and t uning:
• •
The locat ion of t he event ing condit ion in t he inst ruct ion space can be accurat e profiled, I nst ruct ion argum ent s can be reconst ruct ed in a post processing phase, using capt ured PEBS records of t he regist er st at es.
The PEBS capabilit y has been great ly expanded in processors based on I nt el m icroarchit ect ure code nam e Nehalem , covering a large num ber of and m ore t ypes of precise event s. The m echanism works by using t he count er overflow t o arm t he PEBS dat a acquisit ion. Then on t he next event , t he dat a is capt ured and t he int errupt is raised. The capt ured I P value is som et im es referred t o as I P + 1, because at t he com plet ion of t he inst ruct ion, t he I P value is t hat of t he next inst ruct ion. By t heir very nat ure precise event s m ust be “ at- ret irem ent ” event s. For t he purposes of t his discussion t he precise event s are divided int o Mem ory Access event s, associat ed wit h t he ret irem ent of loads and st ores, and Execut ion Event s, associat ed wit h t he ret irem ent of all inst ruct ions or specific non m em ory inst ruct ions ( branches, FP assist s, SSE uops) .
B-21
USING PERFORMANCE MONITORING EVENTS
B.4.3.1
Precise Memory Access Events
There are t wo im port ant com m on propert ies t o all precise m em ory access event s:
•
•
The exact inst ruct ion can be ident ified because t he hardware capt ures t he I P of t he offending inst ruct ion. Of course t he capt ured I P is t hat of t he following inst ruct ion but one sim ply m oves t he sam ples up one inst ruct ion. This works even when t he recorded I P point s t o t he first inst ruct ion of a basic block because n such a case t he offending inst ruct ion has t o be t he last inst ruct ion of t he previous basic block, as branch inst ruct ions never load or st ore dat a, inst ruct ion argum ent s can be reconst ruct ed in a post processing phase, using capt ured PEBS records of t he regist er st at es. The PEBS buffer cont ains t he values of all 16 general regist ers, R1- R16, where R1 is also called RAX. When coupled wit h t he disassem bly t he address of t he load or st ore can be reconst ruct ed and used for dat a access profiling. The I nt el® Perform ance Tuning Ut ilit y does exact ly t his, providing a wide variet y of powerful analysis t echniques
Precise m em ory access event s m ainly focus on loads as t hose are t he event s t ypically responsible for t he very long durat ion execut ion st alls. They are broken down by t he dat a source, t hereby indicat ing t he t ypical lat ency and t he dat a localit y in t he int rinsically NUMA configurat ions. These precise load event s are t he only L2, L3 and DRAM access event s t hat only count loads. All ot hers will also include t he L1D and/ or L2 hardware prefet ch request s. Many will also include RFO request s, bot h due t o st ores and t o t he hardware prefet chers. All four general count ers can be program m ed t o collect dat a for precise event s. The abilit y t o reconst ruct t he virt ual addresses of t he load and st ore inst ruct ions allows an analysis of t he cacheline and page usage efficiency. Even t hough cachelines and pages are defined by physical address t he lower order bit s are ident ical, so t he virt ual address can be used. As t he PEBS m echanism capt ures t he values of t he regist er at com plet ion of t he inst ruct ion, one should be aware t hat point er- chasing t ype of load operat ion will not be capt ured because it is not possible t o infer t he load inst ruct ion from t he dereferenced address. The basic PEBS m em ory access event s falls int o t he following cat egories:
• •
• • •
MEM_I NST_RETI RED: This cat egory count s inst ruct ion ret ired which cont ain a load operat ion, it is select ed by event code 0BH. MEM_LOAD_RETI RED: This cat egory count s ret ired load inst ruct ions t hat experienced specific condit ion select ed by t he Um ask value, t he event code is 0CBH. MEM_UNCORE_RETI RED: This cat egory count s m em ory inst ruct ions ret ired and received dat a from t he uncore sub- syst em , it is select ed by event code 0FH. MEM_STORE_RETI RED: This cat egory count s inst ruct ion ret ired which cont ain a st ore operat ion, it is select ed by event code 0CH. I TLE_MI SS_RETI RED: This count s inst ruct ion ret ired which m issed t he I TLB, it is select ed by event code 0C8H
Um ask values and associat ed nam e suffixes for t he above PEBS m em ory event s are list ed under t he in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. The precise event s list ed above allow load driven cache m isses t o be ident ified by dat a source. This does not ident ify t he “ hom e” locat ion of t he cachelines wit h respect t o t he NUMA configurat ion. The except ions t o t his st at em ent are t he event s MEM_UNCORE_RETI RED.LOCAL_DRAM and MEM_UNCORE_RETI RED.NON_LOCAL_DRAM. These can be used in conj unct ion wit h inst rum ent ed m alloc invocat ions t o ident ify t he NUMA “ hom e” for t he crit ical cont iguous buffers used in an applicat ion. The sum of all t he MEM_LOAD_RETI RED event s will equal t he MEM_I NST_RETI RED.LOADS count . A count of L1D m isses can be achieved wit h t he use of all t he MEM_LOAD_RETI RED event s, except MEM_LOAD_RETI RED.L1D_HI T. I t is bet t er t o use all of t he individual MEM_LOAD_RETI RED event s t o do t his, rat her t han t he difference of MEM_I NST_RETI RED.LOADSMEM_LOAD_RETI RED.L1D_HI T because while t he t ot al count s of precise event s will be correct , and t hey
B-22
USING PERFORMANCE MONITORING EVENTS
will correct ly ident ify inst ruct ions t hat caused t he event in quest ion, t he dist ribut ion of t he event s m ay not be correct due t o PEBS SHADOWI NG, discussed lat er in t his sect ion. L1D_MI SSES = MEM_LOAD_RETI RED.HI T_LFB + MEM_LOAD_RETI RED.L2_HI T + MEM_LOAD_RETI RED.L3_UNSHARED_HI T + MEM_LOAD_RETI RED.OTHER_CORE_HI T_HI TM + MEM_LOAD_RETI RED.L3_MI SS The MEM_LOAD_RETI RED.L3_UNSHARED_HI T event m erit s som e explanat ion. The inclusive L3 has a bit pat t ern t o ident ify which core has a copy of t he line. I f t he only bit set is for t he request ing core ( unshared hit ) t hen t he line can be ret urned from t he L3 wit h no snooping of t he ot her cores. I f m ult iple bit s are set , t hen t he line is in a shared st at e and t he copy in t he L3 is current and can also be ret urned wit hout snooping t he ot her cores. I f t he line is read for ownership ( RFO) by anot her core, t his will put t he copy in t he L3 int o an exclusive st at e. I f t he line is t hen m odified by t hat core and lat er evict ed, t he writ t en back copy in t he L3 will be in a m odified st at e and snooping will not be required. MEM_LOAD_RETI RED.L3_UNSHARED_HI T count s all of t hese. The event should really have been called MEM_LOAD_RETI RED.L3_HI T_NO_SNOOP. The event MEM_LOAD_RETI RED.L3_HI T_OTHER_CORE_HI T_HI TM could have been nam ed as MEM_LOAD_RETI RED.L3_HI T_SNOOP int uit ively for sim ilar reason. When a m odified line is ret rieved from anot her socket it is also writ t en back t o m em ory. This causes rem ot e HI TM access t o appear as com ing from t he hom e dram . The MEM_UNCORE_RETI RED.LOCAL_DRAM and MEM_UNCORE_RETI RED.REMOTE_DRAM evens t hus also count t he L3 m isses sat isfied by m odified lines in t he caches of t he rem ot e socket . There is a difference in t he behavior of MEM_LOAD_RETI RED.DTLB_MI SSES wit h respect t o t hat on I nt el® Core™2 processors. Previously t he event only count ed t he first m iss t o t he page, as do t he im precise event s. The event now count s all loads t hat result in a m iss, t hus it includes t he secondary m isses as well.
B.4.3.2
Load Latency Event
I nt el Processors based on t he I nt el m icroarchit ect ure code nam e Nehalem provide support for “ loadlat ency event ”, MEM_I NST_RETI RED wit h event code 0BH and Um ask value of 10H ( LATENCY_ABOVE_THRESHOLD) . This event sam ples loads, recording t he num ber of cycles bet ween t he execut ion of t he inst ruct ion and act ual deliver of t he dat a. I f t he m easured lat ency is larger t han t he m inim um lat ency program m ed int o MSR 0x3f6, bit s 15: 0, t hen t he count er is increm ent ed. Count er overflow arm s t he PEBS m echanism and on t he next event sat isfying t he lat ency t hreshold, t he PMU writ es t he m easured lat ency, t he virt ual or linear address, and t he dat a source int o a PEBS record form at in t he PEBS buffer. Because t he virt ual address is capt ured int o a known locat ion, t he sam pling driver could also execut e a virt ual t o physical t ranslat ion and capt ure t he physical address. The physical address ident ifies t he NUMA hom e locat ion and in principle allows an analysis of t he det ails of t he cache occupancies. Furt her, as t he address is capt ured before ret irem ent even t he point er chasing encoding “ MOV RAX, [ RAX+ const ] ” have t heir addresses capt ured. Because t he MSR_PEBS_LD_LAT_THRESHOLD MSR is required t o specify t he lat ency t hreshold value, only one m inim um lat ency value can be sam pled on a core during a given period. To enable t his, t he I nt el perform ance t ools rest rict t he program m ing of t his event t o count er 4 t o sim plify t he scheduling. Table B- 8 list s a few exam ples of event program m ing configurat ions used by t he I nt el® PTU and Vt une™ Perform ance Analyzer for t he load lat ency event s. Different t hreshold values for t he m inim um lat encies are specified in MSR_PEBS_LD_LAT_THRESHOLD ( address 0x3f6) .
B-23
USING PERFORMANCE MONITORING EVENTS
Table B-8. Load Latency Event Programming Load Latency Precise Events
MSR 0x3F6
Umask
Event Code
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_4
4
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_8
8
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_10
16
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_20
32
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_40
64
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_80
128
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_100
256
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_200
512
10H
0BH
MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_8000
32768
10H
0BH
One of t he t hree fields writ t en t o each PEBS record by t he PEBS assist m echanism of t he load lat ency event , encodes t he dat a source localit y inform at ion.
Table B-9. Data Source Encoding for Load Latency PEBS Record Encoding
Description
0x0
Unknown L3 cache miss.
0x1
Minimal latency core cache hit. This request was satisfied by the L1 data cache.
0x2
Pending core cache HIT. Outstanding core cache miss to same cache-line address was already underway. The data is not yet in the data cache, but is located in a fill buffer that will soon be committed to cache.
0x3
This data request was satisfied by the L2.
0x4
L3 HIT. Local or Remote home requests that hit L3 cache in the uncore with no coherency actions required (snooping).
0x5
L3 HIT (other core hit snoop). Local or Remote home requests that hit the L3 cache and was serviced by another processor core with a cross core snoop where no modified copies were found. (Clean).
0x6
L3 HIT (other core HITM). Local or Remote home requests that hit the L3 cache and was serviced by another processor core with a cross core snoop where modified copies were found. (HITM).
0x7
Reserved
0x8
L3 MISS (remote cache forwarding). Local homed requests that missed the L3 cache and was serviced by forwarded data following a cross package snoop where no modified copies found. (Remote home requests are not counted).
0x9
Reserved.
0xA
L3 MISS (local DRMA go to S). Local home requests that missed the L3 cache and was serviced by local DRAM (go to shared state).
0xB
L3 MISS (remote DRMA go to S). Remote home requests that missed the L3 cache and was serviced by remote DRAM (go to shared state).
0xC
L3 MISS (local DRMA go to E). Local home requests that missed the L3 cache and was serviced by local DRAM (go to exclusive state).
B-24
USING PERFORMANCE MONITORING EVENTS
Table B-9. Data Source Encoding for Load Latency PEBS Record (Contd.) Encoding
Description
0xD
L3 MISS (remote DRMA go to E). Remote home requests that missed the L3 cache and was serviced by remote DRAM (go to exclusive state).
0xE
I/O, Request of input/output operation.
0xF
The request was to un-cacheable memory.
The lat ency event is t he recom m ended m et hod t o m easure t he penalt ies for a cycle account ing decom posit ion. Each t im e a PMI is raised by t his PEBS event a load t o use lat ency and a dat a source for t he cacheline is recorded in t he PEBS buffer. The dat a source for t he cacheline can be deduced from t he low order 4 bit s of t he dat a source field and t he t able shown above. Thus an average lat ency for each of t he 16 sources can be evaluat ed from t he collect ed dat a. As only one m inim um lat ency at a t im e can be collect ed it m ay be awkward t o evaluat e t he lat ency for an MLC hit and a rem ot e socket dram . A m inim um lat ency of 32 cycles should give a reasonable dist ribut ion for all t he offcore sources however. The I nt el® PTU version 3.2 perform ance t ool can display t he lat ency dist ribut ion in t he dat a profiling m ode and allows sophist icat ed event filt ering capabilit ies for t his event .
B.4.3.3
Precise Execution Events
PEBS capabilit y in core PMU goes beyond load and st ore inst ruct ions. Branches, near calls and condit ional branches can all be count ed wit h precise event s, for bot h ret ired and m ispredict ed ( and ret ired) branches of t he t ype select ed. For t hese event s, t he PEBS buffer will cont ain t he t arget of t he branch. I f t he Last Branch Record ( LBR) is also capt ured t hen t he locat ion of t he branch inst ruct ion can also be det erm ined. When t he branch is t aken t he I P value in t he PEBS buffer will also appear as t he last t arget in t he LBR. I f t he branch was not t aken ( condit ional branches only) t hen it won’t and t he branch t hat was not t aken and ret ired is t he inst ruct ion before t he I P in t he PEBS buffer. I n t he case of near calls ret ired, t his m eans t hat Event Based Sam pling ( EBS) can be used t o collect accurat e funct ion call count s. As t his is t he prim ary m easurem ent for driving t he decision t o inline a funct ion, t his is an im port ant im provem ent . I n order t o m easure call count s, you m ust sam ple on calls. Any ot her t rigger int roduces a bias t hat cannot be guarant eed t o be correct ed properly. The precise branch event s can be found under event code C4H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. There is one source of sam pling art ifact associat ed wit h precise event s. I t is due t o t he t im e delay bet ween t he PMU count er overflow and t he arm ing of t he PEBS hardware. During t his period event s cannot be det ect ed due t o t he t im ing shadow. To illust rat e t he effect , consider a funct ion call chain where a long durat ion funct ion, “ foo”, which calls a chain of 3 very short durat ion funct ions, “ foo1” calling “ foo2” which calls “ foo3”, followed by a long durat ion funct ion “ foo4”. I f t he durat ions of foo1, foo2 and foo3 are less t han t he shadow period t he dist ribut ion of PEBS sam pled calls will be severely dist ort ed. For exam ple:
•
•
I f t he overflow occurs on t he call t o foo, t he PEBS m echanism is arm ed by t he t im e t he call t o foo1 is execut ed and sam ples will be t aken showing t he call t o foo1 from foo. I f t he overflow occurs due t o t he call t o foo1, foo2 or foo3 however, t he PEBS m echanism will not be arm ed unt il execut ion is in t he body of foo4. Thus t he calls t o foo2, foo3 and foo4 cannot appear as PEBS sam pled calls.
Shadowing can effect t he dist ribut ion of all PEBS event s. I t will also effect t he dist ribut ion of basic block execut ion count s ident ified by using t he com binat ion of a branch ret ired event ( PEBS or not ) and t he last ent ry in t he LBR. I f t here were no delay bet ween t he PMU count er overflow and t he LBR freeze, t he last LBR ent ry could be used t o sam ple t aken ret ired branches and from t hat t he basic block execut ion count s. All t he inst ruct ions bet ween t he last t aken branch and t he previous t arget are execut ed once. Such a sam pling could be used t o generat e a “ soft ware” inst ruct ion ret ired event wit h uniform sam pling, which in t urn can be used t o ident ify basic block execut ion count s. Unfort unat ely t he shadowing causes t he branches at t he end of short basic blocks t o not be t he last ent ry in t he LBR, dist ort ing t he m easurem ent . Since all t he inst ruct ions in a basic block are by definit ion execut ed t he sam e num ber of t im es.
B-25
USING PERFORMANCE MONITORING EVENTS
The shadowing effect on call count s and basic block execut ion count s can be alleviat ed t o a large degree by averaging over t he ent ries in t he LBR. This will be discussed in t he sect ion on LBRs. Typically, branches account for m ore t han 10% of all inst ruct ions in a workload, loop opt im izat ion needs t o focus on t hose loops wit h high t ripcount s. For count ed loops, it is very com m on for t he induct ion variable t o be com pared t o t he t ripcount in t he t erm inat ion condit ion evaluat ion. This is part icularly t rue if t he induct ion variable is used wit hin t he body of t he loop, even in t he face of heavy opt im izat ion. Thus a loop sequence of unrolled operat ion by eight t im es m ay resem ble: add cmp jnge
rcx, 8 rcx, rax triad+0x27
I n t his case t he t wo regist ers, rax and rcx are t he t ripcount and induct ion variable. I f t he PEBS buffer is capt ured for t he condit ional branches ret ired event , t he average values of t he t wo regist ers in t he com pare can be evaluat ed. The one wit h t he larger average will be t he t ripcount . Thus t he average, RMS, m in and m ax can be evaluat ed and even a dist ribut ion of t he recorded values.
B.4.3.4
Last Branch Record (LBR)
The LBR capt ures t he source and t arget of each ret ired t aken branch. Processors based on I nt el m icroarchit ect ure code nam e Nehalem can t rack 16 pair of source/ t arget addresses in a rot at ing buffer. Filt ering of t he branch inst ruct ions by t ypes and privilege levels are perm it t ed using a dedicat ed facilit y, MSR_LBR_SELECT. This m eans t hat t he LBR m echanism can be program m ed t o capt ure branches occurring at ring0 or ring3 or bot h ( default ) privilege levels. Furt her t he t ypes of t aken branches t hat are recorded can also be filt ered. The list of filt ering opt ions t hat can be specified using MSR_LBR_SELECT is described in Chapt er 17, “ Debug, Branch Profile, TSC, and Qualit y of Service” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. The default is t o capt ure all branches at all privilege levels ( all bit s zero) . Anot her reasonable program m ing would set all bit s t o 1 except bit 1 ( capt ure ring 3) and bit 3 ( capt ure near calls) and bit s 6 and 7. This would leave only ring 3 calls and uncondit ional j um ps in t he LBR. Such a program m ing would result in t he LBR having t he last 16 t aken calls and uncondit ional j um ps ret ired and t heir t arget s in t he buffer. A PMU sam pling driver could t hen capt ure t his rest rict ed “ call chain” wit h any event , t hereby providing a “ call t ree” cont ext . The inclusion of t he uncondit ional j um ps will unfort unat ely cause problem s, part icularly when t here are if- else st ruct ures wit hin loops. I n t he case of frequent funct ion calls at all levels, t he inclusion of ret urns could be added t o clarify t he cont ext . However t his would reduce t he call chain dept h t hat could be capt ured. A fairly obvious usage would be t o t rigger t he sam pling on ext rem ely long lat ency loads, t o enrich t he sam ple wit h accesses t o heavily cont ended locked variables, and t hen capt ure t he call chain t o ident ify t he cont ext of t he lock usage. Ca ll Cou n t s a n d Fun ct ion Ar gum e n t s I f t he LBRs are capt ured for PMI s t riggered by t he BR_I NST_RETI RED.NEAR_CALL event , t hen t he call count per calling funct ion can be det erm ined by sim ply using t he last ent ry in LBR. As t he PEBS I P will equal t he last t arget I P in t he LBR, it is t he ent ry point of t he calling funct ion. Sim ilarly, t he last source in t he LBR buffer was t he call sit e from wit hin t he calling funct ion. I f t he full PEBS record is capt ured as well, t hen for funct ions wit h lim it ed num bers of argum ent s on 64- bit OS’s, you can sam ple bot h t he call count s and t he funct ion argum ent s.
LBRs a nd Ba sic Block Ex e cut ion Count s Anot her int erest ing usage is t o use t he BR_I NST_RETI RED.ALL_BRANCHES event and t he LBRs wit h no filt er t o evaluat e t he execut ion rat e of basic blocks. As t he LBRs capt ure all t aken branches, all t he basic blocks bet ween a branch I P ( source) and t he previous t arget in t he LBR buffer were execut ed one t im e. Thus a sim ple way t o evaluat e t he basic block execut ion count s for a given load m odule is t o m ake a m ap of t he st art ing locat ions of every basic block. Then for each sam ple t riggered by t he PEBS collect ion of BR_I NST_RETI RED.ALL_BRANCHES, st art ing from t he PEBS address ( a t arget but perhaps for a not t aken branch and t hus not necessarily in t he LBR buffer) and walking backwards t hrough t he LBRs unt il
B-26
USING PERFORMANCE MONITORING EVENTS
finding an address not corresponding t o t he load m odule of int erest , count all t he basic blocks t hat were execut ed. Calling t his value “ num ber_of_basic_blocks”, increm ent t he execut ion count s for all of t hose blocks by 1/ ( num ber_of_basic_blocks) . This t echnique also yields t he t aken and not t aken rat es for t he act ive branches. All branch inst ruct ions bet ween a source I P and t he previous t arget I P ( wit hin t he sam e m odule) were not t aken, while t he branches list ed in t he LBR were t aken. This is illust rat ed in t he graphics below.
“From”
“To”
Branch_0 Target_0
“LBR record”
Branch_1 Target_1 “All instructions between Target_0 and Branch_1 are retired 1 time for each event count” “All basic blocks between Target_0 and Branch_1 are executed 1 time for each event count” “All branch instructions between Target_0 and Branch_1 are not taken”
Figure B-6. LBR Records and Basic Blocks
The 16 set s LBR records can help rect ify t he art ifact of PEBS sam ples aggregat ing disproport ionat ely t o cert ain inst ruct ions in t he sam pling process. The sit uat ion of skewed dist ribut ion of PEBS sam ple is illust rat ed below in Figure B- 7. Consider a num ber of basic blocks in t he flow of norm al execut ion, som e basic block t akes 20 cycles t o execut e, ot hers t aking 2 cycles, and shadowing t akes 10 cycles. Each t im e an overflow condit ion occurs, t he delay of PEBS being arm ed is at least 10 cycles. Once t he PEBS is arm ed, PEBS record is capt ured on t he next event ing condit ion. The skewed dist ribut ion of sam pled inst ruct ion address using PEBS record will be skewed as shown in t he m iddle of Figure B- 7. I n t his concept ual exam ple, we assum e every branch in t aken in t hese basic blocks. I n t he skewed dist ribut ion of PEBS sam ples, t he branch I P of t he last basic block will be recorded 5 t im es as m uch as t he least sam pled branch I P address ( t he 2nd basic block) .
PEBS Sample Distribution
Cycle Flow 20
2
O P C
O O
2
O
2
O
2 20
20
P
C
P C
P C
O P C
0
16N
N
16N
0
16N
0
17N
0
18N
0
19N
5N
20N
P C
O: overflow; P: PEBS armed; C: interrupt occurs
Branch IP Distribution in LBRTrajectory
Figure B-7. Using LBR Records to Rectify Skewed Sample Distribution
B-27
USING PERFORMANCE MONITORING EVENTS
This sit uat ion where som e basic blocks would appear t o never get sam ples and som e have m any t im es t oo m any. Weight ing each ent ry by 1/ ( num of basic blocks in t he LBR t raj ect ory) , in t his exam ple would result in dividing t he num bers in t he right m ost t able by 16. Thus we end up wit h far m ore accurat e execut ion count s ( ( 1.25- > 1.0) * N) in all of t he basic blocks, even t hose t hat never direct ly caused a PEBS sam ple. As on I nt el® Core™2 processors t here is a precise inst ruct ions ret ired event t hat can be used in a wide variet y of ways. I n addit ion t here are precise event s for uops_ret ired, various SSE inst ruct ion classes, FP assist s. I t should be not ed t hat t he FP assist event s only det ect x87 FP assist s, not t hose involving SSE FP inst ruct ions. Det ect ing all assist s will be discussed in t he sect ion on t he pipeline Front End. The inst ruct ions ret ired event has a few special uses. While it s dist ribut ion is not uniform , t he t ot als are correct . I f t he values recorded for all t he inst ruct ions in a basic block are averaged, a m easure of t he basic block execut ion count can be ext ract ed. The rat ios of basic block execut ions can be used t o est im at e loop t ripcount s when t he count ed loop t echnique discussed above cannot be applied. The PEBS version ( general count er) inst ruct ions ret ired event can furt her be used t o profile OS execut ion accurat ely even in t he face of STI / CLI sem ant ics, because t he PEBS int errupt t hen occurs aft er t he critical sect ion has com plet ed, but t he dat a was frozen correct ly. I f t he cm ask value is set t o som e very high value and t he invert condit ion is applied, t he result is always t rue, and t he event will count core cycles ( halt ed + unhalt ed) . Consequent ly bot h cycles and inst ruct ions ret ired can be accurat ely profiled. The UOPS_RETI RED.ANY event , which is also precise can also be used t o profile Ring 0 execut ion and really gives a m ore accurat e display of execut ion. The precise event s available for t his purpose are list ed under event code C0H, C2H, C7H, F7H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.4.3.5
Measuring Core Memory Access Latency
Drilling down perform ance issues associat ed wit h localit y or cache coherence issues will require using perform ance m onit oring event s. I n each processor core, t here is a super queue t hat allocat es ent ries t o buffer request s of m em ory access t raffic due t o an L2 m iss t o t he uncore sub- syst em . Table B- 10 list s various perform ance event s available in t he core PMU t hat can drill down perform ance issues relat ed t o L2 m isses.
Table B-10. Core PMU Events to Drill Down L2 Misses Core PMU Events
Umask
Event Code
OFFCORE_REQUESTS.DEMAND.READ_DATA1
01H
B0H
OFFCORE_REQUESTS.DEMAND.READ_CODE1
02H
B0H
OFFCORE_REQUESTS.DEMAND.RFO1
04H
B0H
OFFCORE_REQUESTS.ANY.READ
08H
B0H
OFFCORE_REQUESTS.ANY.RFO
10H
B0H
OFFCORE_REQUESTS.UNCACHED_MEM
20H
B0H
OFFCORE_REQUESTS.L1D.WRITEBACK
40H
B0H
OFFCORE_REQUESTS.ANY
80H
B0H
NOTES: 1. The *DEMAND* events also include any requests made by the L1D cache hardware prefetchers. Table B- 11 list s various perform ance event s available in t he core PMU t hat can drill down perform ance issues relat ed t o super queue operat ion. B-28
USING PERFORMANCE MONITORING EVENTS
Table B-11. Core PMU Events for Super Queue Operation Core PMU Events
Umask
Event Code
OFFCORE_REQUESTS_BUFFER_FULL
01H
B2H
Addit ionally, L2 m isses can be drilled down furt her by dat a origin at t ribut es and response at t ribut es. The m at rix t o specify dat a origin and response t ype at t ribut es is done by a dedicat ed MSR OFFCORE_RSP_0 at address 1A6H. See Table B- 12 and Table B- 13.
Table B-12. Core PMU Event to Drill Down OFFCore Responses Core PMU Events
OFFCORE_RSP_0 MSR
Umask
Event Code
OFFCORE_RESPONSE
See Table B- 13
01H
B7H
Table B-13. OFFCORE_RSP_0 MSR Programming
Request type
Position
Description
Note
0
Demand Data Rd = DCU reads (includes partials, DCU Prefetch)
1
Demand RFO = DCU RFOs
2
Demand Ifetch = IFU Fetches
3
Writeback = L2_EVICT/DCUWB
4
PF Data Rd = L2 Prefetcher Reads
5
PF RFO= L2 Prefetcher RFO
6
PF Ifetch= L2 Prefetcher Instruction fetches
7
Other
Include non-temporal stores
8
L3_HIT_UNCORE_HIT
exclusive line
9
L3_HIT_OTHER_CORE_HIT_SNP
clean line
10
L3_HIT_OTHER_CORE_HITM
modified line
11
L3_MISS_REMOTE_HIT_SCRUB
Used by multiple cores
12
L3_MISS_REMOTE_FWD
Clean line used by one core
13
L3_MISS_REMOTE_DRAM
14
L3_MISS_LOCAL_DRAM
15
Non-DRAM
Non-DRAM requests
B-29
USING PERFORMANCE MONITORING EVENTS
Alt hough Table B- 13 allows 2^ 16 com binat ions of set t ing in MSR_OFFCORE_RSP_0 in t heory, it is m ore useful t o consider com bining t he subset s of 8- bit values t o specify “ Request t ype” and “ Response t ype”. The m ore com m on 8- bit m ask values are list ed in Table B- 14.
Table B-14. Common Request and Response Types for OFFCORE_RSP_0 MSR Request Type
Mask
Response Type
Mask
ANY_DATA
xx11H
ANY_CACHE_DRAM
7FxxH
ANY_IFETCH
xx44H
ANY_DRAM
60xxH
ANY_REQUEST
xxFFH
ANY_L3_MISS
F8xxH
ANY_RFO
xx22H
ANY_LOCATION
FFxxH
CORE_WB
xx08H
IO
80xxH
DATA_IFETCH
xx77H
L3_HIT_NO_OTHER_CORE
01xxH
DATA_IN
xx33H
L3_OTHER_CORE_HIT
02xxH
DEMAND_DATA
xx03H
L3_OTHER_CORE_HITM
04xxH
DEMAND_DATA_RD
xx01H
LOCAL_CACHE
07xxH
DEMAND_IFETCH
xx04H
LOCAL_CACHE_DRAM
47xxH
DEMAND_RFO
xx02H
LOCAL_DRAM
40xxH
OTHER1
xx80H
REMOTE_CACHE
18xxH
PF_DATA
xx30H
REMOTE_CACHE_DRAM
38xxH
PF_DATA_RD
xx10H
REMOTE_CACHE_HIT
10xxH
PF_IFETCH
xx40H
REMOTE_CACHE_HITM
08xxH
PF_RFO
xx20H
REMOTE-DRAM
20xxH
PREFETCH
xx70H
NOTES: 1. The PMU may report incorrect counts with setting MSR_OFFCORE_RSP_0 to the value of 4080H. Non-temporal stores to the local DRAM is not reported in the count.
B.4.3.6
Measuring Per-Core Bandwidth
Measuring t he bandwidt h of all m em ory t raffic for an individual core is com plicat ed, t he core PMU and uncore PMU do provide capabilit y t o m easure t he im port ant com ponent s of per- core bandwidt h. At t he m icroarchit ect ural level, t here is t he buffering of L3 for writ ebacks/ evict ions from L2 ( sim ilarly t o som e degree wit h t he non t em poral writ es) . The evict ion of m odified lines from t he L2 causes a writ e of t he line back t o t he L3. The line in L3 is only writ t en t o m em ory when it is evict ed from t he L3 som e t im e lat er ( if at all) . And L3 is part of t he uncore sub- syst em , not part of t he core. The writ ebacks t o m em ory due t o evict ion of m odified lines from L3 cannot be associat ed wit h an individual core in t he uncore PMU logic. The net result of t his is t hat t he t ot al writ e bandwidt h for all t he cores can be m easured wit h event s in t he uncore PMU. The read bandwidt h and t he non- t em poral writ e bandwidt h can be m easured on a per core basis. I n a syst em populat ed wit h t wo physical processor, t he NUMA nat ure of m em ory bandwidt h im plies t he m easurem ent for t hose 2 com ponent s has t o be divided int o bandwidt hs for t he core on a per- socket basis.
B-30
USING PERFORMANCE MONITORING EVENTS
The per- socket read bandwidt h can be m easured wit h t he event s: OFFCORE_RESPONSE_0.DATA_I FETCH.L3_MI SS_LOCAL_DRAM. OFFCORE_RESPONSE_0.DATA_I FETCH.L3_MI SS_REMOTE_DRAM. The t ot al read bandwidt h for all socket s can be m easured wit h t he event : OFFCORE_RESPONSE_0.DATA_I FETCH.ANY_DRAM. The per- socket non- t em poral st ore bandwidt h can be m easured wit h t he event s: OFFCORE_RESPONSE_0.OTHER.L3_MI SS_LOCAL_CACHE_DRAM. OFFCORE_RESPONSE_0.OTHER.L3_MI SS_REMOTE_DRAM. The t ot al non- t em poral st ore bandwidt h can be m easured wit h t he event : OFFCORE_RESPONSE_0.OTHER.ANY.CACHE_DRAM. The use of “ CACHE_DRAM” encoding is t o work around t he defect in t he foot not e of Table B- 14. Not e t hat none of t he above includes t he bandwidt h associat ed wit h writ ebacks of m odified cacheable lines.
B.4.3.7
Miscellaneous L1 and L2 Events for Cache Misses
I n addit ion t o t he OFFCORE_RESPONSE_0 event and t he precise event s t hat will be discussed lat er, t here are several ot her event s t hat can be used as well. There are addit ional event s t hat can be used t o supplem ent t he offcore_response_0 event s, because t he offcore_response_0 event code is support ed on count er 0 only. L2 m isses can also be count ed wit h t he archit ect urally defined event LONGEST_LAT_CACHE_ACCESS, however as t his event also includes request s due t o t he L1D and L2 hardware prefet chers, it s ut ilit y m ay be lim it ed. Som e of t he L2 access event s can be used for bot h drilling down L2 accesses and L2 m isses by t ype, in addit ion t o t he OFFCORE_REQUESTS event s discussed earlier. The L2_RQSTS and L2_DATA_RQSTS event s can be used t o discern assort ed access t ypes. I n all of t he L2 access event s t he designat ion PREFETCH only refers t o t he L2 hardware prefet ch. The designat ion DEMAND includes loads and request s due t o t he L1D hardware prefet chers. The L2_LI NES_I N and L2_LI NES_OUT event s have been arranged slight ly different ly t han t he equivalent event s on I nt el® Core™2 processors. The L2_LI NES_OUT event can now be used t o decom pose t he evict ed lines by clean and dirt y ( i.e. a Writ eback) and whet her t hey were evict ed by an L1D request or an L2 HW prefet ch. The event L2_TRANSACTI ONS count s all int eract ions wit h t he L2. Writ es and locked writ es are count ed wit h a com bined event , L2_WRI TE. The det ails of t he num erous derivat ives of L2_RQSTS, L2_DATA_RQSTS, L2_LI NES_I N, L2_LI NES_OUT, L2_TRANSACTI ONS, L2_WRI TE, can be found under event codes24H, 26H, F1H, F2H, F0H, and 27H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.4.3.8
TLB Misses
The next largest set of m em ory access delays are associat ed wit h t he TLBs when linear- t o- physical address t ranslat ion is m apped wit h a finit e num ber of ent ries in t he TLBs. A m iss in t he first level TLBs result s in a very sm all penalt y t hat can usually be hidden by t he OOO execut ion and com piler's scheduling. A m iss in t he shared TLB result s in t he Page Walker being invoked and t his penalt y can be not iceable in t he execut ion. The ( non- PEBS) TLB m iss event s break down int o t hree set s:
• • •
DTLB m isses and it s derivat ives are program m ed wit h event code 49H. Load DTLB m isses and it s derivat ives are program m ed wit h event code 08H. I TLB m isses and it s derivat ives are program m ed wit h event code 85H.
B-31
USING PERFORMANCE MONITORING EVENTS
St ore DTLB m isses can be evaluat ed from t he difference of t he DTLB m isses and t he Load DTLB m isses. Each t hen has a set of sub event s program m ed wit h t he um ask value. The Um ask det ails of t he num erous derivat ives of t he above event s are list ed in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.4.3.9
L1 Data Cache
There are PMU event s t hat can be used t o analyze L1 dat a cache operat ions. These event s can only be count ed wit h t he first 2 of t he 4 general count ers, i.e. I A32_PMC0 and I A32_PMC1. Most of t he L1D event s are self explanat ory. The t ot al num ber of references t o t he L1D can be count ed wit h L1D_ALL_REF, eit her j ust cacheable references or all. The cacheable references can be divided int o loads and st ores wit h L1D_CACHE_LOAD and L1D_CACHE.STORE. These event s are furt her subdivided by MESI st at es t hrough t heir Um ask values, wit h t he I st at e references indicat ing t he cache m isses. The evict ions of m odified lines in t he L1D result in writ ebacks t o t he L2. These are count ed wit h t he L1D_WB_L2 event s. The um ask values break t hese down by t he MESI st at e of t he version of t he line in t he L2. The locked references can be count ed also wit h t he L1D_CACHE_LOCK event s. Again t hese are broken down by MES st at es for t he lines in L1D. The t ot al num ber of lines brought int o L1D, t he num ber t hat arrived in an M st at e and t he num ber of m odified lines t hat get evict ed due t o receiving a snoop are count ed wit h t he L1D event and it s Um ask variat ions. The L1D event s are list ed under event codes28H, 40H, 41H, 42H, 43H, 48H, 4EH, 51H, 52H, 53H, 80H, and 83H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. There are few cases of loads not being able t o forward from act ive st ore buffers. The predom inant sit uat ions have t o do wit h larger loads overlapping sm aller st ores. There is not event t hat det ect s when t his occurs. There is also a “ false st ore forwarding” case where t he addresses only m at ch in t he lower 12 address bit s. This is som et im es referred t o as 4K aliasing. This can be det ect ed wit h t he event “ PARTI AL_ADDRESS_ALI AS“ which has event code 07H and Um ask 01H.
B.4.4
Front End Monitoring Events
Branch m ispredict ion effect s can som et im es be reduced t hrough code changes and enhanced inlining. Most ot her front end perform ance lim it at ions have t o be dealt wit h by t he code generat ion. The analysis of such issues is m ost ly of use by com piler developers.
B.4.4.1
Branch Mispredictions
I n addit ion t o branch ret ired event s t hat was discussed in conj unct ion wit h PEBS in Sect ion B.4.3.3. These are enhanced by use of t he LBR t o ident ify t he branch locat ion t o go along wit h t he t arget locat ion capt ured in t he PEBS buffer. Aside from t hose usage, m any ot her PMU event s ( event code E6, E5, E0, 68, 69) associat ed wit h branch predict ions are m ore relevant t o hardware design t han perform ance t uning. Branch m ispredict ions are not in and of t hem selves an indicat ion of a perform ance bot t leneck. They have t o be associat ed wit h dispat ch st alls and t he inst ruct ion st arvat ion condit ion, UOPS_I SSUED: C1: I 1 – RESOURCE_STALLS.ANY. Such st alls are likely t o be associat ed wit h icache m isses and I TLB m isses. The precise I TLB m iss event can be useful for such issues. The icache and I TLB m iss event s are list ed under event code 80H, 81H, 82H, 85H, AEH.
B.4.4.2
Front End Code Generation Metrics
The rem aining front end event s are m ost ly of use in ident ifying when det ails of t he code generat ion int eract poorly wit h t he inst ruct ions decoding and uop issue t o t he OOO engine. Exam ples are lengt h
B-32
USING PERFORMANCE MONITORING EVENTS
changing prefix issues associat ed wit h t he use of 16 bit im m ediat es, rob read port st alls, inst ruct ion alignm ent int erfering wit h t he loop det ect ion and inst ruct ion decoding bandwidt h lim it at ions. The act ivit y of t he LSD is m onit ored using CMASK values on a signal m onit oring act ivit y. Som e of t hese event s are list ed under event code 17H, 18H, 1EH, 1FH, 87H, A6H, A8H, D0H, D2H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. Som e inst ruct ions ( FSI N, FCOS, and ot her t ranscendent al inst ruct ions) are decoded wit h t he assist ance of MS- ROM. Frequent occurrences of inst ruct ions t hat required assist ance of MS- ROM t o decode com plex uop flows are opport unit y t o im prove inst ruct ion select ion t o reduce such occurrences. The UOPS_DECODED.MS event can be used t o ident ify code regions t hat could benefit from bet t er inst ruct ion select ion. Ot her sit uat ions t hat can t rigger t his event are due t o FP assist s, like perform ing a num eric operat ion on denorm alized FP values or QNaNs. I n such cases t he penalt y is essent ially t he uops required for t he assist plus t he pipeline clearing required t o ensure t he correct st at e. Consequent ly t his sit uat ion has a very clear signat ure consist ing of MACHI NE_CLEAR.CYCLES and uops being insert ed by t he m icrocode sequencer, UOPS_DECODED.MS. The execut ion penalt y being t he sum of t hese t wo cont ribut ions. The event codes for t hese are list ed under D1H and C3H.
B.4.5
Uncore Performance Monitoring Events
The uncore sub- syst em includes t he L3, I MC and I nt el QPI unit s in t he diagram shown in Figure B- 4. Wit hin t he uncore sub- syst em , t he uncore PMU consist s of eight general- purpose count ers and one fixed count er. The fixed count er in uncore m onit ors t he unhalt ed clock cycles in t he uncore clock dom ain, which runs at a different frequency t han t he core. The uncore cannot by it self generat e a PMI int errupt . While t he core PMU can raise PMI at a per- logicalprocessor specificit y, t he uncore PMU can cause PMI at a per- core specificit y using t he int errupt hardware in t he processor core. When an uncore count er overflows, a bit pat t ern is used t o specify which cores should be signaled t o raise a PMI . The uncore PMU is unaware of t he core, Processor I D or Thread I D t hat caused t he event t hat overflowed a count er. Consequent ly t he m ost reasonable approach for sam pling on uncore event s is t o raise a PMI on all t he logical processors in t he package. There are a wide variet y of event s t hat m onit or queue occupancies and insert s. There are ot hers t hat count cacheline t ransfers, dram paging policy st at ist ics, snoop t ypes and responses, and so on. The uncore is t he only place t he t ot al bandwidt h t o m em ory can be m easured. This will be discussed explicit ly aft er all t he uncore com ponent s and t heir event s are described.
B.4.5.1
Global Queue Occupancy
Each processor core has a super queue t hat buffers request s of m em ory access t raffic due t o an L2 m iss. The uncore has a global queue ( GQ) t o service t ransact ion request s from t he processor cores and buffers dat a t raffic t hat arrive from L3, I MC, or I nt el QPI links. Wit hin t he GQ, t here are 3 “ t rackers” in t he GQ for t hree t ypes of t ransact ions:
• • •
On- package read request s, it s t racker queue has 32 ent ries. On- package writ eback request s, it s t racker queue has 16 ent ries. Request s t hat arrive from a “ peer ”, it s t racker queue has 12 ent ries.
A “ peer ” refers t o any request s com ing from t he I nt el® QuickPat h I nt erconnect . The occupancies, insert s, cycles full and cycles not em pt y for all t hree t rackers can be m onit ored. Furt her as load request s go t hrough a series of st ages t he occupancy and insert s associat ed wit h t he st ages can also be m onit ored, enabling a “ cycle account ing” breakdown of t he uncore m em ory accesses due t o loads. When a uncore count er is first program m ed t o m onit or a queue occupancy, for any of t he uncore queues, t he queue m ust first be em pt ied. This is accom plished by t he driver of t he m onit oring soft ware t ool issuing a bus lock. This only needs t o be done when t he count er is first program m ed. From t hat point on
B-33
USING PERFORMANCE MONITORING EVENTS
t he count er will correct ly reflect t he st at e of t he queue, so it can be repeat edly sam pled for exam ple wit hout anot her bus lock being issued. The uncore event s t hat m onit or GQ allocat ion ( UNC_GQ_ALLOC) and GQ t racker occupancy ( UNC_GQ_TRACKER_OCCUP) are list ed under t he event code 03H and 02H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. The select ion bet ween t he t hree t rackers is specified from t he Um ask value. The m nem onic of t hese derivat ive event s use t he not at ion: “ RT” signifying t he read t racker, “ WT”, t he writ e t racker and “ PPT” t he peer probe t racker. Lat ency can m easured by t he average durat ion of t he queue occupancy, if t he occupancy st ops as soon as t he dat a has been delivered. Thus t he rat io of UNC_GQ_TRACKER_OCCUP.X/ UNC_GQ_ALLOC.X m easures an average durat ion of queue occupancy, where ‘X’ represent s a specific Um ask value. The t ot al occupancy period of t he read t racker as m easured by: Tot al Read Period = UNC_GQ_TRACKER_OCCUP.RT/ UNC_GQ_ALLOC.RT I s longer t han t he dat a delivery lat ency due t o it including t im e for ext ra bookkeeping and cleanup. The m easurem ent : LLC response Lat ency = UNC_GQ_TRACKER_OCCUP.RT_TO_LLC_RESP / UNC_GQ_ALLOC.RT_TO_LLC_RESP is essent ially a const ant . I t does not include t he t ot al t im e t o snoop and ret rieve a m odified line from anot her core for exam ple, j ust t he t im e t o scan t he L3 and see if t he line is or is not present in t his socket . An overall lat ency for an L3 hit is t he weight ed average of t hree t erm s:
• • •
The lat ency of a sim ple hit , where t he line has only been used by t he core m aking t he request . The lat encies for accessing clean lines by m ult iple cores. The lat encies for accessing dirt y lines t hat have been accessed by m ult iple cores.
These t hree com ponent s of t he L3 hit for loads can be decom posed using t he derivat ive event s of OFFCORE_RESPONSE:
• • •
OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HI T_NO_OTHER_CORE. OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HI T_OTHER_CORE_HI T. OFFCORE_RESPONSE_0.DEMAND_DATA.L3_HI T_OTHER_CORE_HI TM.
The event OFFCORE_RESPONSE_0.DEMAND_DATA.LOCAL_CACHE should be used as t he denom inat or t o obt ain lat encies. The individual lat encies could have t o be m easured wit h m icrobenchm arks, but t he use of t he precise lat ency event will be far m ore effect ive as any bandwidt h loading effect s will be included. The L3 m iss com ponent is t he weight ed average over t hree t erm s:
• • •
The lat encies of L3 hit s in a cache on anot her socket ( t his is described in t he previous paragraph) . The lat encies t o local DRAM. The lat encies t o rem ot e DRAM.
The local dram access and t he rem ot e socket access can be decom posed wit h m ore uncore event s. Miss t o fill lat ency = UNC_GQ_TRACKER_OCCUP.RT_LLC_MI SS / UNC_GQ_ALLOC.RT_LLC_MI SS The uncore GQ event s using Um ask value associat ed wit h * RTI D* m nem onic allow t he m onit oring of a sub com ponent of t he Miss t o fill lat ency associat ed wit h t he com m unicat ions bet ween t he GQ and t he QHL. There are uncore PMU event s which m onit or cycles when t he t hree t rackers are not em pt y ( > = 1 ent ry) or full. These event s are list ed under t he event code 00H and 01H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. Because t he uncore PMU generally does not different iat e which processor core causes a part icular event ing condit ion, t he t echnique of dividing t he lat encies by t he average queue occupancy in order t o det erm ine a penalt y does not work for t he uncore. Overlapping ent ries from different cores do not result in overlapping penalt ies and t hus a reduct ion in st alled cycles. Each core suffers t he full lat ency independent ly.
B-34
USING PERFORMANCE MONITORING EVENTS
To evaluat e t he correct ion on a per core basis one needs t he num ber of cycles t here is an ent ry from t he core in quest ion. A * NOT_EMPTY_CORE_N t ype event would needed. There is no such event . Consequent ly, in t he cycle decom posit ion one m ust use t he full lat ency for t he est im at e of t he penalt y. As has been st at ed before it is best t o use t he PEBS lat ency event as t he dat a sources are also collect ed wit h t he lat ency for t he individual sam ple. The individual com ponent s of t he read t racker, discussed above, can also be m onit ored as busy or full by set t ing t he cm ask value t o 1 or 32 and applying it t o t he assort ed read t racker occupancy event s.
Table B-15. Uncore PMU Events for Occupancy Cycles Uncore PMU Events
Cmask
Umask
Event Code
UNC_GQ_TRACKER_OCCUP.RT_L3_MISS_FULL
32
02H
02H
UNC_GQ_TRACKER_OCCUP.RT_TO_L3_RESP_FULL
32
04H
02H
UNC_GQ_TRACKER_OCCUP.RT_TO_RTID_ACCQUIRED_FULL
32
08H
02H
UNC_GQ_TRACKER_OCCUP.RT_L3_MISS_BUSY
1
02H
02H
UNC_GQ_TRACKER_OCCUP.RT_TO_L3_RESP_BUSY
1
04H
02H
UNC_GQ_TRACKER_OCCUP.RT_TO_RTID_ACCQUIRED_BUSY
1
08H
02H
B.4.5.2
Global Queue Port Events
The GQ dat a buffer t raffic cont rols t he flow of dat a t o and from different sub- syst em s via separat e port s:
• • • •
Core t raffic: t wo port s handles dat a t raffic, each port dedicat ed t o a pair of processor cores. L3 t raffic: one port service L3 dat a t raffic. I nt el QPI t raffic: one service t raffic t o QPI logic. I MC t raffic: one service dat a t raffic t o int egrat ed m em ory cont roller.
The port s for L3 and core t raffic t ransfer a fixed num ber of bit s per cycle. However t he I nt el® QuickPat h I nt erconnect prot ocols can result in eit her 8 or 16 byt es being t ransferred on t he read I nt el QPI and I MC port s. Consequent ly t hese event s cannot be used t o m easure t ot al dat a t ransfers and bandwidt hs. The uncore PMU event s t hat can dist inguish t raffic flow are list ed under t he event code 04H and 05H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.4.5.3
Global Queue Snoop Events
Cacheline request s from t he cores or from a rem ot e package or t he I / O Hub are handled by t he GQ. When t he uncore receives a cacheline request from one of t he cores, t he GQ first checks t he L3 t o see if t he line is on t he package. Because t he L3 is inclusive, t his answer can be quickly ascert ained. I f t he line is in t he L3 and was owned by t he request ing core, dat a can be ret urned t o t he core from t he L3 direct ly. I f t he line is being used by m ult iple cores, t he GQ will snoop t he ot her cores t o see if t here is a m odified copy. I f so t he L3 is updat ed and t he line is sent t o t he request ing core. I n t he event of an L3 m iss, t he GQ m ust send out request s t o t he local m em ory cont roller ( or over t he I nt el QPI links) for t he line. A request t hrough t he I nt el QPI t o a rem ot e L3 ( or rem ot e DRAM) m ust be m ade if dat a exist s in a rem ot e L3 or does not exist in local DRAM. As each physical package has it s own local int egrat ed m em ory cont roller t he GQ m ust ident ify t he “ hom e” locat ion of t he request ed cacheline from t he physical address. I f t he address ident ifies hom e as being on t he local package t hen t he GQ m akes a sim ult aneous request t o t he local m em ory cont roller. I f hom e is ident ified as belonging t o t he rem ot e package, t he request sent over t he I nt el QPI will also access t he rem ot e I MC.
B-35
USING PERFORMANCE MONITORING EVENTS
The GQ handles t he snoop responses for t he cacheline request s t hat com e in from t he I nt el® QuickPat h I nt erconnect . These snoop t raffic correspond t o t he queue ent ries in t he peer probe t racker. The snoop responses are divided int o request s for locally hom ed dat a and rem ot ely hom ed dat a. I f t he line is in a m odified st at e and t he GQ is responding t o a read request , t he line also m ust be writ t en back t o m em ory. This would be a wast ed effort for a response t o a RFO as t he line will j ust be m odified again, so no Writ eback is done for RFOs. The snoop responses of local hom e event s t hat can be m onit ored by an uncore PMU are list ed under event code 06H in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. The snoop responses of rem ot ely hom e event s are list ed under event code 07H. Som e relat ed event s count t he MESI t ransit ions in response t o snoops from ot her caching agent s ( processors or I OH) . Som e of t hese rely on program m ing MSR so t hey can only be m easured one at a t im e, as t here is only one MSR. The I nt el perform ance t ools will schedule t his correct ly by rest rict ing t hese event s t o a single general uncore count er.
B.4.5.4
L3 Events
Alt hough t he num ber of L3 hit s and m isses can be det erm ined from t he GQ t racker allocat ion event s, Several uncore PMU event is sim pler t o use. They are list ed under event code 08H and 09H in t he uncore event list of Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B. The MESI st at es breakdown of lines allocat ed and vict im ized can also be m onit ored wit h LI NES_I N, LI NES_OUT event s in t he uncore using event code 0AH and 0BH. Det ails are list ed in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.4.6
Intel QuickPath Interconnect Home Logic (QHL)
When a dat a m isses L3 and causing t he GQ of t he uncore t o send out a t ransact ion request , t he I nt el QPI fabric will fulfill t he request eit her from t he local DRAM cont roller or from a rem ot e DRAM cont roller in anot her physical package. The GQ m ust ident ify t he “ hom e” locat ion of t he request ed cacheline from t he physical address. I f t he address ident ifies hom e as being on t he local package t hen t he GQ m akes a sim ult aneous request t o t he local m em ory cont roller, t he I nt egrat ed m em ory cont roller ( I MC) . I f hom e is ident ified as belonging t o t he rem ot e package, t he request is sent t o t he I nt el QPI first and t hen t o access t he rem ot e I MC. The I nt el QPI logic and I MC are dist inct unit s in t he uncore sub- syst em . The I nt el QPI logic dist inguish t he local I MC relat ive t o rem ot e I MC using t he concept of “ caching agent ” and “ hom e agent “ . Specifically, t he I nt el QPI prot ocol considers each socket as having a “ caching agent ” : and a “ hom e agent ” :
• •
Caching Agent is t he GQ and L3 in t he uncore ( or an I OH if present ) . Hom e Agent is t he I MC.
An L3 m iss result in sim ult aneous queries for t he line from all t he Caching Agent s and t he Hom e agent ( wherever it is) . QHL request s can be superseded when anot her source can supply t he required line m ore quickly. L3 m isses t o locally hom ed lines, due t o on package request s, are sim ult aneously direct ed t o t he QHL and I nt el QPI . I f a rem ot e caching agent supplies t he line first t hen t he request t o t he QHL is sent a signal t hat t he t ransact ion is com plet e. I f t he rem ot e caching agent ret urns a m odified line in response t o a read request t hen t he dat a in dram m ust be updat ed wit h a writ eback of t he new version of t he line. There is a sim ilar flow of cont rol signals when t he I nt el QPI sim ult aneously sends a snoop request for a locally hom ed line t o bot h t he GQ and t he QHL. I f t he L3 has t he line, t he QHL m ust be signaled t hat t he t ransact ion was com plet ely by t he L3/ GQ. I f t he line in L3 ( or t he cores) was m odified and t he snoop request from t he rem ot e package was for a load, t hen a writ eback m ust be com plet ed by t he QHL and t he QHL forwards t he line t o t he I nt el QPI t o com plet e t he t ransact ion.
B-36
USING PERFORMANCE MONITORING EVENTS
Uncore PMU provides event s for m onit oring t hese cacheline access and writ eback t raffic in t he uncore by using t he QHL opcode m at ching capabilit y. The opcode m at ching facilit y is described in Chapt er 33, “ Handling Boundary Condit ions in a Virt ual Machine Monit or ” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3C. The uncore PMU event t hat uses t he opcode m at ching capabilit y is list ed under event code 35H. Several of t he m ore useful set t ing t o program QHL opcode m at ching is shown in Table B- 16.
Table B-16. Common QHL Opcode Matching Facility Programming Load Latency Precise Events
MSR 0x396
Umask
Event Code
UNC_ADDR_OPCODE_MATCH.IOH.NONE
0
1H
35H
UNC_ADDR_OPCODE_MATCH.IOH.RSPFWDI
40001900_00000000
1H
35H
UNC_ADDR_OPCODE_MATCH.IOH.RSPFWDS
40001A00_00000000
1H
35H
UNC_ADDR_OPCODE_MATCH.IOH.RSPIWB
40001D00_00000000
1H
35H
UNC_ADDR_OPCODE_MATCH.REMOTE.NONE
0
2H
35H
UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDI
40001900_00000000
2H
35H
UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDS
40001A00_00000000
2H
35H
UNC_ADDR_OPCODE_MATCH.REMOTE.RSPIWB
40001D00_00000000
2H
35H
UNC_ADDR_OPCODE_MATCH.LOCAL.NONE
0
4H
35H
UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDI
40001900_00000000
1H
35H
UNC_ADDR_OPCODE_MATCH.LOCAL.RSPFWDS
40001A00_00000000
1H
35H
UNC_ADDR_OPCODE_MATCH.LOCAL.RSPIWB
40001D00_00000000
1H
35H
These predefined opcode m at ch encodings can be used t o m onit or HI TM accesses. I t is t he only event t hat allows profiling t he code request ing HI TM t ransfers. The diagram s Figure B- 8 t hrough Figure B- 15 show a series of I nt el QPI prot ocol exchanges associat ed wit h Dat a Reads and Reads for Ownership ( RFO) , aft er an L3 m iss, under a variet y of com binat ions of t he local hom e of t he cacheline, and t he MESI st at e in t he rem ot e cache. Of part icular not e are t he cases where t he dat a com es from t he rem ot e QHL even when t he dat a was in t he rem ot e L3. These are t he Read Dat a wit h t he rem ot e L3 having t he line in an M st at e.
B-37
USING PERFORMANCE MONITORING EVENTS
Cores
Cores DRd
Uncore
Uncore [Broadcast
Rspl
Q P I
to LLC] l sp R
IMC
L L C
[Fill complete to socket 2] Speculative mem Rd
[Sending Req to Local Home (socket 2 owns this address)]
QHL
DataC_E_CMP
Snoop
SnpData
RdData
Q P I
[Send
S np D at a
M is s
C ac he
agents]
R sp l
Lo ok up C ac he
other caching
a at D np l S sp R
L L C
GQ
snoops to all
up e s at is ok st Lo e M E he ch in a ac te C C ca ] llo E A -> [I
GQ
QHL Data
IMC
Socket 2
Socket 1
Figure B-8. RdData Request after LLC Miss to Local Home (Clean Rsp)
Cores
Cores DRd (1)
Uncore
Uncore
(8 )
to LLC]
[Rspl indicates clean snoop] Speculative mem Rd (7)
IMC
QHL Data (9)
[Send Request to CHL]
Q P I
D R at dD aC at _E a (6 _c ) m p (1 0)
Rspl (9)
C le an
R sp
Snoop
[Send complete and Data to socket 2 to allocate in E state]
Socket 1
RdData (5) DataC_E_cmp (11)
Q P I
R D dD at aC at a _E (4 _c ) m p (1 2)
(7 ) Lo ok up
[Send
) (6
C ac he
a at D np S
L L C
GQ
) (2 3) e ( up at s st is ok E Lo e M in ) he ch te 3 ac Ca ca ] (1 C llo E A -> [I
GQ
[Sending Req to Remote Home (socket 1 owns this address)]
QHL
L L C
IMC Socket 2
Figure B-9. RdData Request after LLC Miss to Remote Home (Clean Rsp)
B-38
USING PERFORMANCE MONITORING EVENTS
Cores
Cores DRd (1)
Uncore
IMC
RsplWb, WblData (9)
QHL Data (9)
Q P I
RdData (5) DataC_E_cmp (11)
Q P I
R D dD at aC at a _E (4 _c ) m p (1 2)
(8 )
WB
Speculative mem Rd (7)
D [S R at dD R en aC at to eq d _E a u C (6 H es _c ) L] t m p (1 0)
(7 ) Lo ok up H M itm -> R I, sp D at a
) (6
[Data written back to Home RsplWb is a NDR response. Hint to home that wb data follows shortly which is WblData.]
GQ
to LLC]
) (2 3) e up s ( at st is ok E M Lo e in ) he ch te 3 ac Ca ca ] (1 C llo E A -> [I
C ac he
Snoop
GQ a at D np S
L L C
Uncore
[Send
[Sending Req to Remote Home (socket 1 owns this address)]
[Send complete and Data to socket 2 to allocate in E state]
QHL
L L C
IMC Socket 2
Socket 1
Figure B-10. RdData Request after LLC Miss to Remote Home (Hitm Response)
Cores
Cores DRd
S np D at a
Q P I
SnpData
a at D b bl lW W sp R
RsplWb WblData
Q P I
[Sending Req to Remote Home (socket 2 owns this address)]
QHL Socket 1
DataC_E_cmp
Lo ok up H M itm -> R I, sp D at a
C ac he
b lW sp R
[Data written back to Home RsplWb is a NDR response. Hint to home that wb data follows shortly which is WblData.]
GQ
e up s at st is ok E M Lo e in he ch te ac Ca ca ] C llo E A -> [I
to LLC] a at D a np at S D bl W
IMC
[Broadcast snoops to all other caching agents]
Snoop
GQ
L L C
Uncore
[Send
RdData
Uncore
L L C
[Send complete to socket 2]
WB
Speculative memRd
QHL Data
IMC
Socket 2
Figure B-11. RdData Request after LLC Miss to Local Home (Hitm Response)
B-39
USING PERFORMANCE MONITORING EVENTS
Cores
Cores DRd [Broadcast snoops to all other caching agents]
[RspFwdS indicates Hit snoop response and data forwarded to Peer agent]
SnpData DataC_F RspFwdS
Q P I
IMC
[Sending Req to Local Home (socket 2 owns this address)]
QHL
L L C
[Send complete to socket 2]
dS w F sp R
[DataC_F indicates data forwarded to Peer agent in F state]
CMP
Q P I
GQ
RdData
a dS at w D pF np _F Rs S aC at D
C ac he H Lo E it R ,F ok s -> p up S ,D at a
to LLC]
e up at s st is ok F Lo e M in he ch te ac Ca ca ] C llo F A -> [I
Snoop
GQ
L L C
Uncore
[Send
S np D D at at a aC _F
Uncore
Speculative memRd
QHL Data
IMC
Socket 2
Socket 1
Figure B-12. RdData Request after LLC Miss to Local Home (Hit Response)
Cores
Cores RFO
Uncore
Snoop
Data
R dI nv D at O aC w n _E _c m p
Rspl
C le an
QHL
[Home sends cmp and Data to socket 2 to allocate in E state]
Socket 1
RdInvOwn DataC_E_cmp
Q P I
R D dI at nv aC O _E w n _c m p
(S ,F ,I
->
I)
Lo ok up C ac he
n w vO In np S
Speculative mem Rd
IMC
Q P I
[Send Request to CHL]
Rspl indicates Clean snoop Response
GQ
to LLC]
e up at s st is ok E Lo e M in he ch te ac Ca ca ] C llo E A -> [I
GQ
L L C
Uncore
[Send
L L C
[Sending Req to Remote Home (socket 1 owns this address)]
QHL
IMC Socket 2
Figure B-13. RdInvOwn Request after LLC Miss to Remote Home (Clean Res)
B-40
USING PERFORMANCE MONITORING EVENTS
Cores
Cores RFO
Uncore
Snoop
Q P I
[S R e to eq nd C ue H s L] t R dI cm nv p O w n
RsplFwdI
[Send Data to socket 2 to allocate in M state]
RdInvOwn DataC_M cmp
Q P I
L L C
[Sending Req to Remote Home (socket 1 owns this address)]
Speculative mem Rd
IMC
R D dIn at v aC O w _M n cm p
C ac he H D IT Lo at M ok a ( M up -> I) ,
Indicates to Home that Data has already been forwarded to socket 2
e at up s st is ok M M Lo e in he ch te ac Ca ca ] C llo M A -> [I
GQ
to LLC]
n w vO In np S _M aC at D
L L C
Uncore
[Send
GQ
QHL
QHL
IMC
Data
Socket 2
Socket 1
Figure B-14. RdInvOwn Request after LLC Miss to Remote Home (Hitm Res)
Cores
Cores RFO
SnpInvOwn DataC_E RspFwdI
[Sending Req to Local Home (socket 2 owns this address)]
QHL Socket 1
L L C
[Send complete to socket 2]
dI w F sp R
[Send Data to socket 2 to allocate in E state]
Q P I
cmp
Q P I
RdInvOwn
I) ,
Lo ok up ->
C ac he
H D IT at (E a
Indicates to Home that Data has already been forwarded to socket 2
GQ
e up s at st is ok E M Lo e in he ch te ac Ca ca ] C llo E A -> [I
to LLC] n w dI vO w In pF np _E Rs S aC at D
IMC
[Broadcast snoops to all other caching agents]
Snoop
GQ
L L C
Uncore
[Send
S np In vO D at w aC n _E
Uncore
Speculative memRd
QHL Data
IMC
Socket 2
Figure B-15. RdInvOwn Request after LLC Miss to Local Home (Hit Res)
Whet her t he line is locally or rem ot ely “ hom ed” it has t o be writ t en back t o dram before t he originat ing GQ receives t he line, so it always appears t o com e from a QHL. The RFO does not do t his. However, when responding t o a rem ot e RFO ( SnpI nvOwn) and t he line is in an S or F st at e, t he cacheline get s invalidat ed and t he line is sent from t he QHL. The point is t hat t he dat a source m ight not always be so obvious.
B.4.7
Measuring Bandwidth From the Uncore
Read bandwidt h can be m easured on a per core basis using event s like OFFCORE_RESPONSE_0.DATA_I N.LOCAL_DRAM and OFFCORE_RESPONSE_0.DATA_I N.REMOTE_DRAM. The t ot al bandwidt h includes writ es and t hese cannot
B-41
USING PERFORMANCE MONITORING EVENTS
be m onit ored from t he core as t hey are m ost ly caused by evict ions of m odified lines in t he L3. Thus a line used and m odified by one core can end up being writ t en back t o dram when it is evict ed due t o a read on anot her core doing som e com plet ely unrelat ed t ask. Modified cached lines and writ ebacks of uncached lines ( e.g. writ t en wit h non t em poral st ream ing st ores) are handled different ly in t he uncore and t heir writ ebacks increm ent various event s in different ways. All full lines writ t en t o DRAM are count ed by t he UNC_I MC_WRI TES.FULL.* event s. This includes t he writ ebacks of m odified cached lines and t he writ es of uncached lines, for exam ple generat ed by nont em poral SSE st ores. The uncached line writ ebacks from a rem ot e socket will be count ed by UNC_QHL_REQUESTS.REMOTE_WRI TES. The uncached writ ebacks from t he local cores are not count ed by UNC_QHL_REQUESTS.LOCAL_WRI TES, as t his event only count s writ ebacks of locally cached lines. The UNC_I MC_NORMAL_READS.* event s only count t he reads. The UNC_QHL_REQUESTS.LOCAL_READS and t he UNC_QHL_REQUESTS.REMOTE_READS count t he reads and t he “ I nvt oE” t ransact ions, which are issued for t he uncacheable writ es, eg USWC/ UC writ es. This allows t he evaluat ion of t he uncacheable writ es, by com put ing t he difference of UNC_QHL_REQUESTS.LOCAL_READS + UNC_QHL_REQUESTS.REMOTE_READS – UNC_I MC_NORMAL_READS.ANY. These uncore PMU event s t hat are useful for bandwidt h evaluat ion are list ed under event code 20H, 2CH, 2FH in Chapt er 19, “ Perform ance Monit oring Event s” of I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
B.5
PERFORMANCE TUNING TECHNIQUES FOR INTEL® MICROARCHITECTURE CODE NAME SANDY BRIDGE
This sect ion covers various perform ance t uning t echniques using perform ance m onit oring event s. Som e t echniques can be adapt ed in general t o ot her m icroarchit ect ures, m ost of t he perform ance event s are specific t o I nt el m icroarchit ect ure code nam e Sandy Bridge.
B.5.1
Correlating Performance Bottleneck to Source Location
Perform ance analysis t ools oft en sam ple event s t o ident ify hot spot s of inst ruct ion point er addresses t o help program m ers ident ify source locat ions of pot ent ial perform ance bot t lenecks. The sam pling t echnique requires a service rout ine t o respond t o t he perform ance m onit oring int errupt ( PMI ) generat ed from an overflow condit ion of t he perform ance count er. There is a finit e delay bet ween t he perform ance m onit oring event det ect ion of t he event ing condit ion relat ive t o t he capt ure of t he inst ruct ion point er address. This is known as “ skid“ . I n ot her words, t he event skid is t he dist ance bet ween t he inst ruct ion or inst ruct ions t hat caused t he issue and t he inst ruct ion where t he event is t agged. There are a few t hings t o not e in general on skid:
•
•
Precise event s have a defined event skid of 1 inst ruct ion t o t he next inst ruct ion ret ired. I n t he case when t he offending inst ruct ion is a branch, t he event is t agged wit h t he branch t arget , which can be separat ed from t he branch inst ruct ion. Thus sam pling wit h precise event s is likely t o have less noise in pin- point ing source locat ions of bot t lenecks. Using a perform ance event wit h event ing condit ion t hat carries a larger perform ance im pact generally has a short er skid and vice versa. The following exam ples illust rat e t his rule: — A st ore forward block issue can cause a penalt y of m ore t han 10 cycles. Sam pling a st ore forward block event alm ost always t ags t o t he next couple of inst ruct ions aft er t he blocked load.
•
— On t he ot her hand, sam pling loads t hat forwarded successfully wit h no penalt y will have m uch larger skids, and less helpful for perform ance t uning. The closer t he event ing condit ion is t o t he ret irem ent of t he inst ruct ion, t he short er t he skid. The event s in t he front end of t he pipeline t end t o t ag t o inst ruct ions furt her from t he responsible inst ruct ion t han event s t hat are t aken at execut ion or ret irem ent .
B-42
USING PERFORMANCE MONITORING EVENTS
• •
Cycles count ed wit h t he event CPU_CLK_UNHALTED.THREAD oft en t ag in great er count s on t he inst ruct ion aft er larger bot t lenecks in t he pipeline. I f cycles are accum ulat ed on an inst ruct ion t his is probably due t o a bot t leneck on t he inst ruct ion at t he previous inst ruct ion. I t is very difficult t o det erm ine t he source of issues wit h a low cost t hat occur in t he front end. Front end event s can also skid t o I Ps t hat precede t he act ual inst ruct ions t hat are causing t he issue.
B.5.2
Hierarchical Top-Down Performance Characterization Methodology and Locating Performance Bottlenecks
I nt el m icroarchit ect ure code nam e Sandy Bridge has int roduced several perform ance event s which help narrow down which port ion of t he m icroarhcit ect ure pipeline is st alled. This st art s wit h a hierarchical approach t o charact erize a workload of where CPU cycles are spent in t he m icroarchit ect ure pipelines. At t he t op level, t here are 4 areas t o at t ribut e CPU cycles: which are described below. To det erm ine what port ion of t he pipeline is st alled, t he t echnique looks at a buffer t hat queues t he m icro- ops supplied by t he front end and feeds t he out- of- order back end ( see Sect ion 2.3.1) . This buffer is called t he m icro- op queue. From t he m icro- op queue viewpoint , t here m ay be four different t ypes of st alls:
• • • •
Front end st alls - The front end is delivering less t han four m icro- ops per cycle when t he back end of t he pipeline is request ing m icro- ops. When t hese st alls happen, t he renam e/ allocat e part of t he OOO engine will st arved. Thus, execut ion is said t o be front end bound. Back end st alls – No m icro- ops are being delivered from t he m icro- op queue due t o lack of required resources for accept ing m ore m icro- ops in t he back end of t he pipeline. When t hese st alls happen, execut ion is said t o be back end bound. Bad speculat ion - The pipeline perform s speculat ive execut ion of inst ruct ions t hat never successfully ret ire. The m ost com m on case is a branch m ispredict ion where t he pipeline predict s a branch t arget in order t o keep t he pipeline full inst ead of wait ing for t he branch t o execut e. I f t he processor predict ion is incorrect it has t o flush t he pipeline wit hout ret iring t he speculat ed inst ruct ions. Ret iring – The m icro- op queue delivers m icro- ops t hat event ually ret ire. I n t he com m on case, t he m icro- ops originat e from t he program code. One except ion is wit h assist s where t he m icrocode sequencer generat es m icro- ops t o deal wit h issues in t he pipeline.
The following figure illust rat es how t he execut ion opport unit ies are logically divided.
I t is possible t o est im at e t he am ount of execut ion slot s spent in each cat egory using t he following form ulas in conj unct ion wit h core PMU perform ance event s in I nt el m icroarchit ect ure code nam e Sandy Bridge: %FE_Bound = 100 * (IDQ_UOPS_NOT_DELIVERED.CORE / N ) ; %Bad_Speculation = 100 * ( (UOPS_ISSUED.ANY – UOPS_RETIRED.RETIRE_SLOTS + 4 * INT_MISC.RECOVERY_CYCLES ) / N) ; B-43
USING PERFORMANCE MONITORING EVENTS
%Retiring = 100 * ( UOPS_RETIRED.RETIRE_SLOTS/ N) ; %BE_Bound = 100 * (1 – (FE_Bound + Retiring + Bad_Speculation) ) ; N represent s t ot al execut ion slot s opport unit ies. Execut ion opport unit ies are t he num ber of cycles m ult iplied by four.
•
N = 4* CPU_CLK_UNHALTED.THREAD
The following sect ions explain t he source for penalt y cycles in t hree cat egories: back end st alls, front end st alls and bad speculat ion. They use form ulas t hat can be applied t o process, m odule, funct ion, and inst ruct ion granularit y.
B.5.2.1
Back End Bound Characterization
Once t he % BE_Bound m et ric raises concern, a user m ay need t o drill down t o t he next level of possible issues in t he back end. Our m et hodology exam ines back end st alls based on execut ion unit occupat ion at every cycle. Nat urally, opt im al perform ance m ay be achieved when all execut ion resources are kept busy. Current ly, t his m et hodology split s ba ck e n d bou n d issues int o t wo cat egories: m e m or y bou n d and cor e bou n d. “ Mem ory bound” corresponds t o st alls relat ed t o t he m em ory subsyst em . For exam ple, cache m isses m ay event ually cause execut ion st arvat ion. On t he ot her hand, “ core bound” which corresponds t o st alls due t o eit her t he Execut ion- or OOO- clust ers, is a bit t rickier. These st alls can m anifest eit her wit h execut ion st arvat ion or non- opt im al execut ion port s ut ilizat ion. For exam ple, a long lat ency divide operat ion m ay serialize t he execut ion causing execut ion st arvat ion for som e period, while pressure on an execut ion port t hat serves specific t ypes of uops, m ight m anifest as sm all num ber of port s ut ilized in a cycle. To calculat e t his, we use perform ance m onit oring event s at t he execut ion unit s: % BE_ Bound_ a t _ EX E = (CYCLE_ACTIVITY.CYCLES_NO_EXECUTE + UOPS_EXECUTED.THREAD:c1 UOPS_EXECUTED.THREAD:c2) / CLOCKS CYCLE_ACTI VI TY.CYCLES_NO_EXECUTE count s com plet e st arvat ion cycles where no uop is execut ed what soever. UOPS_EXECUTED.THREAD: c1 and UOPS_EXECUTED.THREAD: c2 count cycles where at least 1- and 2uops were execut ed in a cycle, respect ively. Hence t he event count difference m easures t he cycles when t he OOO back end could execut e only 1 uop. The % BE_ Bou n d_ a t _ EXE m et ric is count ed at execut ion unit pipest ages so t he num ber would not m at ch t he Backend_Bound rat io which is done at t he allocat ion st age. However, redundancy is good here as one can use bot h count ers t o confirm t he execut ion is indeed back end bound ( bot h should be high) .
B.5.2.2
Core Bound Characterization
A “ back end bound” workload can be ident ified as “ core bound” by t he following m et ric: % Cor e _ Bou n d = % Backend_Bound_at _EXE - % Mem ory_Bound The m et ric “ % Mem ory_Bound” is descr ibed in Sect ion B.5.2.3. Once a workload is ident ified as “ core bound”, t he user m ay want t o drill down int o OOO or Execut ion relat ed issues t hrough t heir t ransit ional t arget ed perform ance count er, like, for exam ple, execut ion port s pressure, or use of FP- chained longlat ency arit hm et ic operat ions.
B.5.2.3
Memory Bound Characterization
More prim it ive m et hods of charact erizing perform ance issues in t he m em ory pipeline t end t o use naïve calculat ions t o est im at e t he penalt y of m em ory st alls. Usually t he num ber of m isses t o a given cache level access is m ult iplied by a pre- defined lat ency for t hat cache level per t he CPU specificat ions, in order t o get an est im at ion for t he penalt y. While t his m ight work for an in- order processor, it oft en over- est im at es t he cont ribut ion of m em ory accesses on CPU cycles for highly out- of- order processors, because m em ory accesses t end t o overlap and t he scheduler m anages t o hide a good port ion of t he lat ency. The
B-44
USING PERFORMANCE MONITORING EVENTS
scheduler m ight be able t o hide som e of t he m em ory access st alls by keeping t he execut ion st alls busy wit h uops t hat do not require t he m em ory access dat a. Thus penalt y for a m em ory access is when t he scheduler has not hing m ore ready t o dispat ch and t he execut ion unit s get st arved as a result . I t is likely t hat furt her uops are eit her wait ing for m em ory access dat a, or depend on ot her non- dispat ched uops. I n I nt el m icroarchit ect ure code nam e I vy Bridge, a new perform ance m onit oring event “ CYCLE_ACTI VI TY.STALLS_LDM_PENDI NG” is provided t o est im at e t he exposure of m em ory accesses. We use it t o define t he “ m e m or y bou nd” m et ric. This event m easures t he cycles when t here is a noncom plet ed in- flight m em ory dem and load coincident wit h execut ion st arvat ion. Not e we account only for dem and load operat ions as uops do not t ypically wait for ( direct ) com plet ion of st ores or HW prefet ches: % M e m or y_ Bou n d = CYCLE_ACTI VI TY.STALLS_LDM_PENDI NG / CLOCKS I f a workload is m em ory bound, it is possible t o furt her charact erize it s perform ance charact erist ic wit h respect t o t he cont ribut ion of t he cache hierarchy and DRAM syst em m em ory. L1 cache has t ypically t he short est lat ency which is com parable t o ALU unit s' st alls t hat are t he short est am ong all st alls. Yet in cert ain cases, like loads blocked on older st ores, a load m ight suffer high lat ency while event ually being sat isfied by t he L1. There are no fill- buffers allocat ed for L1 hit s; inst ead we'll use t he LDM st alls sub- event as it account s for any non- com plet ed load. % L1 Bou nd = ( CYCLE_ACTI VI TY.STALLS_LDM_PENDI NG - CYCLE_ACTI VI TY.STALLS_L1D_PENDI NG) / CLOCKS As explained above, L2 Bound is det ect ed as: % L2 Bou nd = ( CYCLE_ACTI VI TY.STALLS_L1D_PENDI NG - CYCLE_ACTI VI TY.STALLS_L2_PENDI NG) / CLOCKS I n principle, L3 Bound can be calculat ed sim ilarly by subt ract ing out t he L3 m iss cont ribut ion. However an equivalent event t o m easure L3_PENDI NG is not available. Nevert heless, we can infer an est im at e using L3_HI T and L3_MI SS load count event s in conj unct ion wit h a correct ion fact or. This est im at ion could be t olerat ed as t he lat encies are longer on L3 and Mem ory. The correct ion fact or MEM_L3_WEI GHT is approxim at ely t he ext ernal m em ory t o L3 cache lat ency rat io. A fact or of 7 can be used for t he t hird generat ion I nt el Core processor fam ily. Not e t his correct ion fact or has som e dependency on CPU and Mem ory frequencies. % L3 Bou nd = CYCLE_ACTI VI TY.STALLS_L2_PENDI NG * L3_Hit _fract ion / CLOCKS Where L3_Hit _fract ion is: MEM_LOAD_UOPS_RETI RED.LLC_HI T / ( MEM_LOAD_UOPS_RETI RED.LLC_HI T+ MEM_L3_WEI GHT* MEM_LOAD_UOPS_MI SC_RETI RED.LLC_MI SS) To est im at e t he exposure of DRAM t raffic on t hird generat ion I nt el Core processors, t he rem ainder of L2_PENDI NG is used for MEM Bound: % M EM Bou n d = CYCLE_ACTI VI TY.STALLS_L2_PENDI NG * L3_Miss_fract ion / CLOCKS Where L3_Miss_fract ion is: WEI GHT* MEM_LOAD_UOPS_MI SC_RETI RED.LLC_MI SS / ( MEM_LOAD_UOPS_RETI RED.LLC_HI T+ WEI GHT* MEM_LOAD_UOPS_MI SC_RETI RED.LLC_MI SS) Som et im es it is m eaningful t o refer t o all m em ory st alls out side t he core as Uncore Bound: % Un cor e Bou n d = CYCLE_ACTI VI TY.STALLS_L2_PENDI NG / CLOCKS
B.5.3
Back End Stalls
Back end st alls have t wo m ain sources: m em ory sub- syst em st alls and execut ion st alls. As a first st ep t o underst anding t he source of back end st alls, use t he resource st all event . Before put t ing m icro- ops int o t he scheduler, t he renam e st age has t o have cert ain resources allocat ed. When an applicat ion encount ers a significant bot t leneck at t he back end of t he pipeline, it runs out of t hese resources as t he pipeline backs up. The RESOURCE_STALLS event t racks st all cycles when a resource could not be allocat ed. The event breaks up each resource int o a separat e sub- event so you can
B-45
USING PERFORMANCE MONITORING EVENTS
t rack which resource is not available for allocat ion. Count ing t hese event s can help ident ifying t he reason for issues in t he back end of t he pipeline. The resource st all rat ios described below can be accom plished at process, m odule, funct ion and even inst ruct ion granularit ies wit h t he cycles, count ed by CPU_CLK_UNHALTED.THREAD, represent ing t he penalt y t agged at t he sam e granularit y. Usages of Specific Event s RESOURCE_STALLS.ANY - Count s st all cycles t hat t he renam e st age is unable t o put m icro- ops int o t he scheduler, due t o lack of resources t hat have t o be allocat ed at t his st age. The event skid t ends t o be low since it is close t o t he ret irem ent of t he blocking inst ruct ion. This event account s for all st alls count ed by ot her RESOURCE_STALL sub event s and also includes t he sub- event s of RESOURCE_STALLS2. I f t his rat io is high, count t he included sub- event s t o get a bet t er isolat ion of t he reason for t he st all. %RESOURCE.STALLS.COST = 100 * RESOURCE_STALLS.ANY / CPU_CLK_UNHALTED.THREAD; RESOURCE_STALLS.SB - Occurs when a st ore m icro- op is ready for allocat ion and all st ore buffer ent ries are in use, usually due t o long lat ency st ores in progress. Typically t his event t ags t o t he I P aft er t he st ore inst ruct ion t hat is st alled at allocat ion. %RESOURCE.STALLS.SB.COST = 100 * RESOURCE_STALLS.SB / CPU_CLK_UNHALTED.THREAD; RESOURCE_STALLS.LB - Count s cycles in which a load m icro- op is ready for allocat ion and all load buffer ent ries are t aken, usually due t o long lat ency loads in progress. I n m any cases t he queue t o t he scheduler becom es full by m icro- ops t hat depend on t he long lat ency loads, before t he load buffer get s full. %RESOURCE.STALLS.LB.COST = 100 * RESOURCE_STALLS.LB / CPU_CLK_UNHALTED.THREAD; I n t he above cases t he event RESOURCE_STALLS.RS will oft en count in parallel. The best m et hodology t o furt her invest igat e loss in dat a localit y is t he high cache line replacem ent st udy described in Sect ion B.5.4.2, concent rat ing on L1 DCache replacem ent s first RESOURCE_STALLS.RS - Scheduler slot s are t ypically t he first resource t hat runs out when t he pipeline is backed up. However, t his can be due t o alm ost any bot t leneck in t he back end, including long lat ency loads and inst ruct ions backed up at t he execut e st age. Thus it is recom m ended t o invest igat e ot her resource st alls, before digging int o t he st alls t agged t o lack of scheduler ent ries. The skid t ends t o be low on t his event . %RESOURCE.STALLS.RS.COST = 100 * RESOURCE_STALLS.RS/ CPU_CLK_UNHALTED.THREAD; RESOURCE_STALLS.ROB - Count s cycles when allocat ion st alls because all t he reorder buffer ( ROB) ent ries are t aken. This event occurs less frequent ly t han t he RESOURCE_STALLS.RS and t ypically indicat es t hat t he pipeline is being backed up by a m icro- op t hat is holding all younger m icro- ops from ret iring because t hey have t o ret ire in order. %RESOURCE.STALLS.ROB.COST = 100 * RESOURCE_STALLS.ROB/ CPU_CLK_UNHALTED.THREAD; RESOURCE_STALLS2.BOB_FULL - Count s when allocat ion is st alled due t o a branch m icro- op t hat is ready for allocat ion, but t he num ber of branches in progress in t he processor has reached t he lim it . %RESOURCE.STALLS.BOB.COST = 100 * RESOURCE_STALLS2.BOB/ CPU_CLK_UNHALTED.THREAD;
B.5.4
Memory Sub-System Stalls
The following subsect ions discusses using specific perform ance m onit oring event s in I nt el m icroarchit ect ure code nam e Sandy Bridge t o ident ify st alls in t he m em ory sub- syst em s.
B-46
USING PERFORMANCE MONITORING EVENTS
B.5.4.1
Accounting for Load Latency
The breakdown of load operat ion localit y can be accom plished at any granularit y including process, m odule, funct ion and inst ruct ion. When you find t hat a load inst ruct ion is a bot t leneck, invest igat e it furt her wit h t he precise load breakdown. I f t his does not explain t he bot t leneck, check for ot her issues which can im pact loads. You can use t hese event s t o est im at e t he cost s of t he load causing a bot t leneck, and t o obt ain a percent age breakdown of m em ory hierarchy level. Not all t ools provide support for precise event sam pling. I f t he precise version ( event nam e ends wit h a suffix PS) of t hese event is not support ed in a given t ool, you can use t he non- precise version. The precise load event s t ag t he event t o t he next instruct ion ret ired ( I P+ 1) . See t he load lat ency at each hierarchy level in Table 2- 17. Re quir e d e ve nt s MEM_LOAD_UOPS_RETI RED.L1_HI T_PS - Count s dem and loads t hat hit t he first level of t he dat a cache, t he L1 DCache. Dem and loads are non speculat ive load m icro- ops. MEM_LOAD_UOPS_RETI RED.L2_HI T_PS - Count s dem and loads t hat hit t he 2nd level cache, t he L2. MEM_LOAD_UOPS_RETI RED.LLC_HI T_PS - Count s dem and loads t hat hit t he 3rd level shared cache, t he LLC. MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_MI SS - Count s dem and loads t hat hit t he 3rd level shared cache and are assum ed t o be present also in a cache of anot her core but t he cache line was already evict ed from t here. MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_HI T_PS - Count s dem and loads t hat hit a cache line in a cache of anot her core and t he cache line has not been m odified. MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_HI TM_PS - Count s dem and loads t hat hit a cache line in t he cache of anot her core and t he cache line has been writ t en t o by t hat ot her core. This event is im port ant for m any perform ance bot t lenecks t hat can occur in m ult i- t hreaded applicat ions, such as lock cont ent ion and false sharing. MEM_LOAD_UOPS_MI SC_RETI RED.LLC_MI SS_PS - Count s dem and loads t hat m issed t he LLC. This m eans t hat t he load is usually sat isfied from m em ory in client syst em . MEM_LOAD_UOPS_RETI RED.HI T_LFB_PS - Count s dem and loads t hat hit in t he line fill buffer ( LFB) . A LFB ent ry is allocat ed every t im e a m iss occurs in t he L1 DCache. When a load hit s at t his locat ion it m eans t hat a previous load, st ore or hardware prefet ch has already m issed in t he L1 DCache and t he dat a fet ch is in progress. Therefore t he cost of a hit in t he LFB varies. This event m ay count cache- line split loads t hat m iss in t he L1 DCache but do not m iss t he LLC. On 32- byt e I nt el AVX loads, all loads t hat m iss in t he L1 DCache show up as hit s in t he L1 DCache or hit s in t he LFB. They never show hit s on any ot her level of m em ory hierarchy. Most loads arise from t he line fill buffer ( LFB) when I nt el AVX loads m iss in t he L1 DCache. Pr e cise Loa d Br e a k dow n The percent age breakdown of each load source can be t agged at any granularit y including a single I P, funct ion, m odule, or process. This is part icularly useful at a single inst ruct ion t o det erm ine t he breakdown of where t he load was found in t he cache hierarchy. The following form ula shows how t o calculat e t he percent age of t im e a load was sat isfied by t he LLC. Sim ilar form ulas can be used for all ot her hierarchy levels. %LocL3.HIT = 100 * MEM_LOAD_UOPS_RETI RED.LLC_HIT_PS / $SumOf_PRECISE_LOADS; $SumOf_PRECISE_LOADS = MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS +MEM_LOAD_UOPS_RETIRED.L1_HIT_PS + MEM_LOAD_UOPS_RETIRED.L2_HIT_PS +MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS +
B-47
USING PERFORMANCE MONITORING EVENTS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS + MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS + MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS; Est im a t e d Loa d Pe na lt y The form ulas below help est im at ing t o what degree loads from a cert ain m em ory hierarchy are responsible for a slowdown. The CPU_CLK_UNHALTED.THREAD program m able event represent s t he penalt y in cycles t agged at t he sam e granularit y. At t he inst ruct ion level, t he cycle cost of an expensive load t ends t o only skid one I P, sim ilar t o t he precise event . The calculat ions below apply t o any granularit y process, m odule, funct ion or inst ruct ion, since t he event s are precise. Anyt hing represent ing 10% , or higher, of t he t ot al clocks in a granularit y of int erest should be invest igat ed. I f t he code has highly dependent loads you can use t he MEM_LOAD_UOPS_RETI RED.L1_HI T_PS event t o det erm ine if t he loads are hit by t he five cycle lat ency of t he L1 DCache. Est im at ed cost of L2 lat ency %L2.COST = 12 * MEM_LOAD_UOPS_RETI RED.L2_HIT_PS / CPU_CLK_UNHALTED.THREAD; Est im at ed cost of L3 hit s %L3.COST = 26 * MEM_LOAD_UOPS_RETI RED.L3_HIT_PS / CPU_CLK_UNHALTED.THREAD; Est im at ed cost of hit s in t he cache of ot her cores %HIT.COST = 43* MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_HIT_PS / CPU_CLK_UNHALTED.THREAD; Est im at ed cost of m em ory lat ency %MEMORY.COST = 200 * MEM_LOAD_UOPS_MI SC_RETI RED.LLC_MISS_PS / CPU_CLK_UNHALTED.THREAD; Act ual m em ory lat ency can vary great ly depending on m em ory param et ers. The am ount of concurrent m em ory t raffic oft en reduces t he effect cost of a given m em ory hierarchy. Typically, t he est im at es above m ay be on t he pessim ist ic side ( like point er- chasing sit uat ions) . Oft en, cache m isses will m anifest as delaying and bunching on t he ret irem ent of inst ruct ions. The precise loads breakdown can provide est im at es of t he dist ribut ion of hierarchy levels where t he load is sat isfied. Given a significant im pact from a part icular cache level, t he first st ep is t o find where heavy cache line replacem ent s are occurring in t he code. This could coincide wit h your hot port ions of code det ect ed by t he m em ory hierarchy breakdown, but oft en does not . For inst ance, regular t raversal of a large dat a st ruct ure can unint ent ionally clear out levels of cache. I f hit s of non m odified or m odified dat a in anot her core have high est im at ed cost and are hot at locat ions in t he code, it can be due t o locking, sharing or false sharing issues bet ween t hreads. I f load lat ency in m em ory hierarchy levels furt her from t he L1 DCache does not j ust ify t he am ount of cycles spent on a load, t ry one of t he following:
•
•
Elim inat e unneeded load operat ions such as spilling general purpose regist ers t o XMM regist ers rat her t han m em ory. Cont inue searching for issues im pact ing load inst ruct ions described in Sect ion B.5.4.4.
B.5.4.2
Cache-line Replacement Analysis
When an applicat ion has m any cache m isses, it is a good idea t o det erm ine where cache lines are being replaced at t he highest frequency. The inst ruct ions responsible for high am ount of cache replacem ent s are not always where t he applicat ion is spending t he m aj orit y of it s t im e, since replacem ent s can be driven by t he hardware prefet chers and st ore operat ions which in t he com m on case do not hold up t he pipeline. Typically t raversing large arrays or dat a st ruct ures can cause heavy cache line replacem ent s.
B-48
USING PERFORMANCE MONITORING EVENTS
Required event s: L1D.REPLACEMENT - Replacem ent s in t he 1st level dat a cache. L2_LI NES_I N.ALL - Cache lines being brought int o t he L2 cache. Usages of event s: I dent ifying t he replacem ent s t hat pot ent ially cause perform ance loss can be done at process, m odule, and funct ion level. Do it in t wo st eps:
• •
Use t he precise load breakdown t o ident ify t he m em ory hierarchy level at which loads are sat isfied and cause t he highest penalt y. I dent ify, using t he form ulas below, which port ion of code causes t he m aj orit y of t he replacem ent s in t he level below t he one t hat sat isfies t hese high penalt y loads.
For exam ple, if t here is high penalt y due t o loads hit t ing t he LLC, check t he code which is causing replacem ent s in t he L2 and t he L1. I n t he form ulas below, t he nom inat ors are t he replacem ent s account ed for a m odule or funct ion. The sum of t he replacem ent s in t he denom inat ors is t he sum of all replacem ent s in a cache level for all processes. This enables you t o ident ify t he port ion of code t hat causes t he m aj orit y of t he replacem ent s. L1 D Ca ch e Re pla ce m e nt s %L1D.REPLACEMENT = L1D.REPLACEMENT / SumOverAllProcesses(L1D.REPLACEMENT ); L2 Ca che Re pla ce m e nt s %L2.REPLACEMENT = L2_LINES_IN.ALL / SumOverAllProcesses(L2_LINES_IN.ALL );
B.5.4.3
Lock Contention Analysis
The am ount of cont ent ion on locks is crit ical in scalabilit y analysis of m ult i- t hreaded applicat ions. A t ypical ring3 lock alm ost always result s in t he execut ion of an at om ic inst ruct ion. An at om ic inst ruct ion is eit her an XCHG inst ruct ion involving a m em ory address or one of t he following inst ruct ions wit h m em ory dest inat ion and lock prefix: ADD, ADC, AND, BTC, BTR, BTS, CMPXCHG, CMPXCH8B, DEC, I NC, NEG, NOT, OR, SBB, SUB, XOR or XADD. Precise event s enable you t o get an idea of t he cont ent ion on any lock. Many locking API s st art by an at om ic inst ruct ion in ring3 and back off a cont ended lock by j um ping int o ring0. This m eans m any locking API s can be very cost ly in low cont ent ion scenarios. To est im at e t he am ount of cont ent ion on a locked inst ruct ion, you can m easure t he num ber of t im es t he cache line cont aining t he m em ory dest inat ion of an at om ic inst ruct ion is found m odified in anot her core. Required event s: MEM_UOPS_RETI RED.LOCK_LOADS_PS - Count s t he num ber of at om ic inst ruct ions which are ret ired wit h a precise skid of I P+ 1. MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_HI TM_PS - Count s t he occurrences t hat t he load hit s a m odified cache line in anot her core. This event is im port ant for m any perform ance bot t lenecks t hat can occur in m ult i- core syst em s, such as lock cont ent ion, and false sharing. Usages of event s: The lock cont ent ion fact or gives t he percent age of locked operat ions execut ed t hat cont end wit h anot her core and t herefore have a high penalt y. Usually a lock cont ent ion fact or over 5% is wort h invest igat ing on a hot lock. A heavily cont ended lock m ay im pact t he perform ance of m ult iple t hreads. %LOCK.CONTENTION = 100 * MEM_LOAD_UOPS_LLC_HI T_RETI RED.XSNP_HITM_PS / MEM_UOPS_RETIRED.LOCK_LOAD_PS;
B.5.4.4
Other Memory Access Issues
St or e For w a r din g Block e d
B-49
USING PERFORMANCE MONITORING EVENTS
When st ore forwarding is not possible t he dependent loads are blocked. The average penalt y for st ore forward block is 13 cycles. Since m any cases of st ore forwarding blocked were fixed in prior archit ect ures, t he m ost com m on case in code t oday involves st oring t o a sm aller m em ory space t han an ensuing larger load. Required event s: LD_BLOCKS.STORE_FORWARD - Count s t he num ber of t im es a st ore forward opport unit y is blocked due t o t he inabilit y of t he archit ect ure t o forward a sm all st ore t o a larger load and som e rare alignm ent cases. Usages of Event s: Use t he following form ula t o est im at e t he cost of t he st ore forward block. The event LD_BLOCKS.STORE_FORWARD t ends t o be t agged t o t he next I P aft er t he at t em pt ed load, so it is recom m ended t o look at t his issue at t he inst ruct ion level. However it is possible t o inspect t he rat io at any granularit y: process, m odule, funct ion or I P. %STORE.FORWARD.BLOCK.COST = 100 *LD_BLOCKS.STORE_FORWARD * 13 / CPU_CLK_UNHALTED.THREAD; Aft er you find a load t hat is blocked from st ore forwarding, you need t o find t he locat ion of t he st ore. Typically, about 60% of all st ore forwarded blocked issue are caused by st ores in t he last 10 inst ruct ions execut ed prior t o t he load. The m ost com m on case where we see st ore forward blocked is a sm all st ore t hat is unable t o forward t o a larger load. For exam ple t he code below generat ed writ es t o a byt e point er address and t hen reads from a four byt e ( dword) m em ory space: and and
byte ptr [ebx],7f dword ptr [ebx], ecx
To fix a st ore forward block it 's usually best t o fix t he st ore operat ion and not t he load. Ca che Line Split s St art ing from t he I nt el m icroarchit ect ure code nam e Nehalem , t he L1 DCache has split regist ers which enable it t o handle loads and st ores t hat span t wo cache lines in a fast er m anner. This put s t he cost of split loads at about five cycles, as long as split regist ers are available, inst ead of t he 20 cycles required in earlier m icroarchit ect ures. Handling of split st ores handling is usually hidden, but if t here are m any of t hem t hey can st all allocat ion due t o a full st ore buffer, or t hey can consum e split regist ers t hat m ay be needed for handling split loads. You can st ill get solid quant ifiable gains from elim inat ing cache line split s. Required event s: MEM_UOPS_RETI RED.SPLI T_LOADS_PS - Count s t he num ber of dem and loads t hat span t wo cache lines. The event is precise. MEM_UOPS_RETI RED.SPLI T_STORES_PS - Count s t he num ber of st ores t hat span t wo cache lines. The event is precise. Usages of event s: Finding split loads is fairly easy because t hey usually t ag t he m aj orit y of t heir cost t o t he next I P which is execut ed. The rat io below can be used at any granularit y: process, m odule, funct ion, and I P aft er split . %SPLIT.LOAD.COST = 100 * MEM_UOPS_RETI RED.SPLIT_STORES_PS * 5 / CPU_CLK_UNHALTED.THREAD; Split st ore penalt y is m ore difficult t o find using an est im at ed cost , because in t ypical cases st ores do not push out t he ret irem ent of inst ruct ions. To det ect significant am ount of split st ores divide t heir num ber by t he t ot al num ber of st ores ret ired at t hat I P. SPLIT.STORE.RATIO = MEM_UOPS_RETI RED.SPLIT_STORES_PS / MEM_UOPS_RETI RED.ANY_STORES_PS;
B-50
USING PERFORMANCE MONITORING EVENTS
4 k Alia sin g A 4k aliasing conflict bet ween loads and st ores causes a reissue on t he load. Five cycles is used as an est im at e in t he m odel below. Required Event s: LD_BLOCKS_PARTI AL.ADDRESS_ALI AS - Count s t he num ber of loads t hat have part ial address m at ch wit h preceding st ores, causing t he load t o be reissued. Usages of event s: %4KALIAS.COST = 100 * LD_BLOCK_PARTI AL.ADDRESS_ALIAS * 5 / CPU_CLK_UNHALTED.THREAD;
Loa d a nd St or e Addr e ss Tr a nsla t ion There are t wo levels of t ranslat ion look- aside buffer ( TLB) for linear t o physical address t ranslat ion. A m iss in t he DTLB, t he first level TLB, t hat hit s in t he STLB, t he second level TLB, incurs a seven cycle penalt y. Missing in t he STLB requires t he processor t o walk t hrough page t able ent ries t hat cont ain t he address t ranslat ion. These walks have variable cost depending on t he locat ion of t he page t able ent ries. The walk durat ion is a fairly good est im at e of t he cost of STLB m isses. Required event s: DTLB_LOAD_MI SSES.STLB_HI T - Count s loads t hat m iss t he DTLB and hit in t he STLB. This event has a low skid and hence can be used at t he I P level. DTLB_LOAD_MI SSES.WALK_DURATI ON - Durat ion of a page walks in cycles following STLB m isses. Event skid is t ypically one inst ruct ion, enabling you t o det ect t he issue at inst ruct ion, funct ion, m odule or process granularit ies. MEM_UOPS_RETI RED.STLB_MI SS_LOADS_PS - Precise event for loads which have t heir t ranslat ion m iss t he STLB. The event count s only t he first load from a page t hat init iat es t he page walk. Usage of event s: Cost of STLB hit s on loads: %STLB.HIT.COST = 100 * DTLB_LOAD_MI SSES.STLB_HI T * 7/ CPU_CLK_UNHALTED.THREAD; Cost of page walks: %STLB.LOAD.MISS.WALK.COST = 100 * DTLB_LOAD_MI SSES.WALK_DURATI ON / CPU_CLK_UNHALTED.THREAD; Use t he precise STLB m iss event at t he I P level t o det erm ine exact ly which inst ruct ion and source line suffers from frequent STLB m isses. %STLB.LOAD.MISS = 100 * MEM_UOPS_RETI RED.STLB_MI SS_LOADS_PS/ MEM_UOPS_RETI RED.ANY_LOADS_PS; Large walk durat ions, of hundreds of cycles, are an indicat ion t hat t he page t ables have been t hrown out of t he LLC. To det erm ine t he average cost of a page walk use t he following rat io: STLB.LOAD.MISS.AVGCOST = DTLB_LOAD_MI SSES.WALK_DURATI ON / DTLB_LOAD_MI SSES.WALK_COMPLETED; To a lesser ext ent t han loads, STLB m isses on st ores can be a bot t leneck. I f t he st ore it self is a large bot t leneck, cycles will t ag t o t he next I P aft er t he st ore. %STLB.STORE.MISS = 100 * MEM_UOPS_RETI RED.STLB_MI SS_STORES_PS/ MEM_UOPS_RETI RED.ANY_STORES_PS;
B-51
USING PERFORMANCE MONITORING EVENTS
Reducing DTLB/ STLB m isses increases dat a localit y. One m ay consider using an com m ercial- grade m em ory allocat ors t o im prove dat a localit y. Com pilers which offer profile guided opt im izat ions m ay reorder global variables t o increase dat a localit y, if t he com piler can operat e on t he whole m odule. For issues wit h a large am ount of t im e spent in page walks, server and HPC applicat ions m ay be able t o use large pages for dat a.
B.5.5
Execution Stalls
The following subsect ions discusses using specific perform ance m onit oring event s in I nt el m icroarchit ect ure code nam e Sandy Bridge t o ident ify st alls in t he out- of- order engine.
B.5.5.1
Longer Instruction Latencies
Som e m icroarchit ect ural changes m anifest ed in longer lat ency for som e legacy inst ruct ions in exist ing code. I t is possible t o det ect som e of t hese sit uat ions:
• •
Three- operand slow LEA inst ruct ions ( see Sect ion 3.5.1.3) . Flags m erge m icro- op - These m erges are prim arily required by “ shift cl” inst ruct ions ( see Sect ion 3.5.2.6) .
These event s t end t o have a skid as high as a 10 inst ruct ions because t he event ing condit ion is det ect ed early in t he pipeline. Event usage: To use t his event effect ively wit hout being dist ract ed by t he event skid, you can use it t o locat e perform ance issue at t he process, m odule and funct ion granularit ies, but not at t he inst ruct ion level. To ident ify issue at t he inst ruct ion I P granularit y, one can perform st at ic analysis on t he funct ions ident ified by t his event . To est im at e t he cont ribut ion of t hese event s t o t he code lat ency, divide t hem by t he cycles at t he sam e granularit y. To est im at e t he overall im pact , st art wit h t he t ot al cycles due t o t hese issues and if significant cont inue t o search for t he exact reason using t he sub event s. Tot al cycles spent in t he specified scenarios: Flags Merge m icro- op rat io: %FLAGS.MERGE.UOP = 100 * PARTI AL_RAT_STALLS.FLAGS_MERGE_UOP_CYCLES / CPU_CLK_UNHALTED.THREAD; Slow LEA inst ruct ions allocat ed: %SLOW.LEA.WINDOW = 100 * PARTI AL_RAT_STALLS.SLOW_LEA_WI NDOW / CPU_CLK_UNHALTED.THREAD;
B.5.5.2
Assists
Assist s usually involve t he m icrocode sequencer t hat helps handle t he assist . Det erm ining t he num ber of cycles where m icrocode is generat ed from t he m icrocode sequencer is oft en a good m et hodology t o det erm ine t he t ot al cost of t he assist . I f t he overcall cost of assist s are high, a breakdown of assist s int o specific t ypes will be useful. Est im at ing t he t ot al cost of assist s using m icrocode sequencer cycles: %ASSISTS.COST = 100 * I DQ.MS_CYCLES / CPU_CLK_UNHALTED.THREAD; Floa t in g- poin t a ssist s: Denorm al input s for X87 inst ruct ions require an FP assist , pot ent ially cost ing hundreds of cycles. %FP.ASSISTS = 100 *FP_ASSIST.ANY / INST_RETIRED.ANY;
B-52
USING PERFORMANCE MONITORING EVENTS
Transit ions bet ween I nt el SSE and I nt el AVX: The t ransit ions bet ween SSE and AVX code are explained in det ail in Sect ion 11.3.1. The t ypical cost is about 75 cycles. %AVX2SSE.TRANSITION.COST = 75 * OTHER_ASSI STS.AVX_TO_SSE / CPU_CLK_UNHALTED.THREAD; %SSE2AVX.TRANSITION.COST = 75 * OTHER_ASSI STS.SSE_TO_AVX / CPU_CLK_UNHALTED.THREAD;
32- byt e AVX st ore inst ruct ions t hat span t wo pages require an assist t hat cost s roughly 150 cycles. A large am ount of m icrocode t agged t o t he I P aft er a 32- byt e AVX st ore is a good sign t hat an assist has occurred. %AVX.STORE.ASSIST.COST = 150 * OTHER_ASSI STS.AVX_STORE / CPU_CLK_UNHALTED.THREAD;
B.5.6
Bad Speculation
This sect ion discusses m ispredict ed branch inst ruct ions result ing in a pipeline flush.
B.5.6.1
Branch Mispredicts
The largest challenge wit h m ispredict ed branches is finding t he branch which caused t hem . Branch m ispredict ions incur penalt y of about 20 cycles. The cost varies based upon t he m ispredict ion, and whet her t he correct pat h is found in t he Decoded I Cache or in t he legacy decode pipeline. Required Event s: BR_MI SP_RETI RED.ALL_BRANCHES_PS is a precise event t hat count s branches t hat incorrect ly predict ed t he branch t arget . Since t his is a precise event t hat skids t o t he next inst ruct ion, it t ags t o t he first inst ruct ion in t he correct pat h aft er t he branch m ispredict ion. This st udy can be perform ed at t he process, m odule, funct ion or inst ruct ion granularit y. Usages of Event s: Use t he following rat io t o est im at e t he cost of m ispredict ed branches: %BR.MISP.COST = 20 * BR_MI SP_RETI RED.ALL_BRANCHES_PS / CPU_CLK_UNHALTED.THREAD;
B.5.7
Front End Stalls
St alls in t he front end should not be invest igat ed unless t he analysis in Sect ion B.5.2 showed at least 30% of a granularit y being bound in t he front end. This sect ion explains t he m ain issues t hat can cause delays in t he front end of t he pipeline. Event s det ect ed in t he front end have unpredict able skid. Therefore do not t ry and associat e t he penalt y at t he I P level. St ay at t he funct ion, m odule, and process level for t hese event s.
B.5.7.1
Understanding the Micro-op Delivery Rate
Usages of Count ers The event I DQ_UOPS_NOT_DELI VERED count s when t he m axim um of four m icro- ops are not delivered t o t he renam e st age, while it is request ing m icro- ops. When t he pipeline is backed up t he renam e st age does not request any furt her m icro- ops from t he front end. The diagram above shows how t his event t racks m icro- ops bet ween t he m icro- op queue and t he renam e st age.
B-53
USING PERFORMANCE MONITORING EVENTS
You can use t he I DQ_UOPS_NOT_DELI VERED event t o breakdown t he dist ribut ion of cycles when 0, 1, 2, 3 m icro- ops are delivered from t he front end. Percent age of cycles t he front end is effect ive, or execut ion is back end bound: %FE.DELIVERING = 100 * ( CPU_CLK_UNHALTED.THREAD IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE) / CPU_CLK_UNHALTED.THREAD; Percent age of cycles t he front end is delivering t hree m icro- ops per cycle: %FE.DELIVER.3UOPS = 100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_3_UOP_DELIV.CORE IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE) / CPU_CLK_UNHALTED.THREAD; Percent age of cycles t he front end is delivering t wo m icro- ops per cycle: %FE.DELIVER.2UOPS = 100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_2_UOP_DELIV.CORE IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE) / CPU_CLK_UNHALTED.THREAD; Percent age of cycles t he front end is delivering one m icro- ops per cycle: %FE.DELIVER.1UOPS = 100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_LE_1_UOP_DELIV.CORE IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE) / CPU_CLK_UNHALTED.THREAD; Percent age of cycles t he front end is delivering zero m icro- ops per cycle: %FE.DELIVER.0UOPS = 100 * ( IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE ) / CPU_CLK_UNHALTED.THREAD; B-54
USING PERFORMANCE MONITORING EVENTS
Average Micro- ops Delivered per Cycle: This rat io assum es t hat t he front end could pot ent ially deliver four m icro- ops per cycle when bound in t he back end. AVG.uops.per.cycle = (4 * (%FE.DELIVERING) + 3 * (%FE.DELIVER.3UOPS) + 2 * (%FE.DELIVER.2UOPS) + (%FE.DELIVER.1UOPS ) ) / 100 Seeing t he dist ribut ion of t he m icro- ops being delivered in a cycle is a hint at t he front end bot t lenecks t hat m ight be occurring. I ssues such as LCPs and penalt ies from swit ching from t he decoded I Cache t o t he legacy decode pipeline t end t o result in zero m icro- ops being delivered for several cycles. Fet ch bandwidt h issues and decoder st alls result in less t han four m icro- ops delivered per cycle.
B.5.7.2
Understanding the Sources of the Micro-op Queue
The m icro- op queue can get m icro- ops from t he following sources:
• • •
Decoded I Cache. Legacy decode pipeline. Microcode sequencer ( MS) .
A t ypical dist ribut ion is approxim at ely 80% of t he m icro- ops com ing from t he Decoded I Cache, 15% com ing from legacy decode pipeline and 5% com ing from t he m icrocode sequencer. Excessive m icro- ops com ing from t he legacy decode pipeline can be a warning sign t hat t he Decoded I Cache is not working effect ively. A large port ion of m icro- ops com ing from t he m icrocode sequencer m ay be benign, such as com plex inst ruct ions, or st ring operat ions, but can also be due t o code assist s handling undesired sit uat ions like I nt el SSE t o I nt el AVX code t ransit ions. Descript ion of Count ers Required: I DQ.DSB_UOPS - Micro- ops delivered t o t he m icro- op queue from t he Decoded I Cache. I DQ.MI TE_UOPS - Micro- ops delivered t o t he m icro- op queue from t he legacy decode pipeline. I DQ.MS_UOPS - Micro- ops delivered from t he m icrocode sequencer. Usage of Count ers: Percent age of m icro- ops com ing from Decoded I Cache: %UOPS.DSB = IDQ.DSB_UOPS / ALL_IDQ_UOPS; Percent age of m icro- ops com ing from legacy decoder pipeline: %UOPS.MITE = IDQ.MITE_UOPS / ALL_IDQ_UOPS; Percent age of m icro- ops com ing from m icro- sequencer: %UOPS.MS = IDQ.MS_UOPS / ALL_IDQ_UOPS; ALL_ I D Q_ UOPS = ( IDQ.DSB_UOPS + IDQ.MITE_UOPS + IDQ.MS_UOPS); I f your applicat ion is not bound in t he front end t hen whet her m icro- ops are com ing from t he legacy decode pipeline or Decoded I Cache is of lesser im port ance. Excessive m icro- ops com ing from t he m icrocode sequencer are wort h invest igat ing furt her t o see if assist s m ight be a problem . Cases t o invest igat e are list ed below:
•
( % FE_BOUND > 30% ) and ( % UOPS.DSB < 70% ) We use a t hreshold of 30% t o define a “ front end bound” case. This t hreshold m ay be applicable t o m any sit uat ions, but m ay also vary som ewhat across different workloads. — I nvest igat e why m icro- ops are not com ing from t he Decoded I Cache.
•
— I nvest igat e issues which can im pact t he legacy decode pipeline. ( % FE_BOUND > 30% ) and ( % UOP_DSB > 70% )
B-55
USING PERFORMANCE MONITORING EVENTS
— I nvest igat e swit ches from Decoded I Cache t o legacy decode pipeline since it m ay be swit ching t o run port ions of code t hat are t oo sm all t o be effect ive. — Look at t he am ount of bad speculat ion, since branch m ispredict ions st ill im pact FE perform ance. — Det erm ine t he average num ber of m icro- ops being delivered per 32- byt e chunk hit . I f t here are m any t aken branches from one 32- byt e chunk int o anot her, it im pact s t he m icro- ops being delivered per cycle.
•
— Micro- op delivery from t he Decoded I Cache m ay be an issue which is not covered. ( % FE_BOUND < 20% ) and ( % UOPS_MS> 25% ) We use a t hreshold of 20% t o define a “ front end not bound” case. This t hreshold m ay be applicable t o m any sit uat ions, but m ay also vary som ewhat across different workloads. The following st eps can help det erm ine why m icro- ops cam e from t he m icrocode, in order of m ost com m on t o least com m on. — Long lat ency inst ruct ions - Any inst ruct ion over four m icro- ops st art s t he m icrocode sequencer. Som e inst ruct ions such as t ranscendent als can generat e m any m icro- ops from t he m icrocode. — St ring operat ions - st ring operat ions can produce a large am ount of m icrocode. I n som e cases t here are assist s which can occur due t o st ring operat ions such as REP MOVSB wit h t rip count great er t han 3, which cost s 70+ cycles. — Assist s - See Sect ion B.5.5.2.
B.5.7.3
The Decoded ICache
The Decoded I Cache has m any advant ages over t he legacy decode pipeline. I t elim inat es m any bot t lenecks of t he legacy decode pipeline such as inst ruct ions decoded int o m ore t han one m icro- op and lengt h changing prefix ( LCP) st alls. A swit ch t o t he legacy decode pipeline from t he Decoded I Cache only occurs when a lookup in t he Decoded I Cache fails and usually cost s anywhere from zero t o t hree cycles in t he front end of t he pipeline. Required event s: The Decoded I Cache event s all have large skids and t he exact inst ruct ion where t hey are t agged is usually not t he source of t he problem so only look for t his issue at t he process, m odule and funct ion granularit ies. DSB2MI TE_SWI TCHES.PENALTY_CYCLES - Count s t he cycles at t ribut ed t o t he swit ch from t he Decoded I Cache t o t he legacy decode pipeline, excluding cycles when t he m icro- op queue cannot accept m icroops because it is back end bound. DSB2MI TE_SWI TCHES.COUNT - Count s t he num ber of swit ches bet ween t he Decoded I Cache and t he legacy decode pipeline. DSB_FI LL.ALL_CANCEL - Count s when fills t o t he Decoded I Cache are canceled. DSB_FI LL.EXCEED_DSB_LI NES- Count s when a fill is canceled because t he allocat ed lines for Decoded I Cache has exceeded t hree for t he 32- byt e chunk. Usage of Event s: Since t hese st udies involve front end event s, do not t ry t o t ag t he event t o a specific inst ruct ion. Det erm ining cost of swit ches from t he Decoded I Cache t o t he legacy decode pipeline. %DSB2MITE.SWITCH.COST = 100 * DSB2MITE_SWITCHES.PENALTY_CYCLES / CPU_CLK_UNHALTED.THREAD; Det erm ining t he average cost per Decoded I Cache swit ch t o t he legacy front end: AVG.DSB2MITE.SWITCH.COST = DSB2MITE_SWITCHES.PENALTY_CYCLES / DSB2MITE_SWITCHES.COUNT; D e t e r m ining ca u se s of m isse s in t h e de code d I Ca ch e
B-56
USING PERFORMANCE MONITORING EVENTS
There are no part ial hit s in t he Decoded I Cache. I f any m icro- op t hat is part of t hat lookup on t he 32byt e chunk is m issing, a Decoded I Cache m iss occurs on all m icro- ops for t hat t ransact ion. There are t hree prim ary reasons for m issing m icro- ops in t he Decoded I Cache:
• • •
Port ions of a 32- byt e chunk of code were not able t o fit wit hin t hree ways of t he Decoded I Cache. A frequent ly run port ion of your code sect ion is t oo large for t he Decoded I Cache. This case is m ore com m on on server applicat ions since client applicat ions t end t o have a sm aller set of code which is “ hot ”. The Decoded I Cache is get t ing flushed for exam ple when an I TLB ent ry is evict ed.
To det erm ine if a port ion of t he 32- byt e code is unable t o fit int o t hree lines wit hin t he Decoded I Cache use t he DSB_FI LL.EXCEED_DSB_LI NESevent at t he process, m odule or funct ion granularit ies %DSB.EXCEED.WAY.LIMIT = 100 * DSB_FILL.EXCEED_DSB_LINES/ DSB_FILL.ALL_CANCEL;
B.5.7.4
Issues in the Legacy Decode Pipeline
I f a large percent age of t he m icro- ops going t o t he m icro- op queue are being delivered from t he legacy decode pipeline, you should check t o see if t here are bot t lenecks im pact ing t hat st age. The m ost com m on bot t lenecks in t he legacy decode pipeline are:
• •
Fet ch not providing enough inst ruct ions. This happens when hot code is poorly aligned. For exam ple if t he hot code being fet ched t o be run is on t he 15t h byt e, t hen only one byt e is fet ched. Lengt h changing prefix st alls in t he inst ruct ion lengt h decoder.
I nst ruct ions t hat are decoded int o t wo t o four m icro- ops m ay int roduce a bubble in t he decoder t hroughput . I f t he inst ruct ion queue, preceding t he decoders, becom es full, t his indicat es t hat t hese inst ruct ions m ay cause a penalt y. %ILD.STALL.COST = 100 * ILD_STALL.LCP * 3 / CPU_CLK_UNHALTED.THREAD;
B.5.7.5
Instruction Cache
Applicat ions wit h large hot code sect ions t end t o run int o m any issues wit h t he inst ruct ion cache. This is m ore t ypical in server applicat ions. Required event s: I CACHE.MI SSES - Count s t he num ber of inst ruct ion byt e fet ches t hat m iss t he I Cache Usage of event s: To det erm ine whet her I Cache m isses are causing t he issue, com pare t hem t o t he inst ruct ions ret ired event count , using t he sam e granularit y ( process, m odel, or funct ion) . Anyt hing over 1% of inst ruct ions ret ired can be a significant issue. ICACHE.PER.INST.RET = ICACHE.MISSES / INST_RETIRED.ANY; I f I Cache m isses are causing a significant problem , t ry t o reduce t he size of your hot code sect ion, using t he profile guided opt im izat ions. Most com pilers have opt ions for t ext reordering which helps reduce t he num ber of pages and, t o a lesser ext ent , t he num ber of pages your applicat ion is covering. I f t he applicat ion m akes significant use of m acros, t ry t o eit her convert t hem t o funct ions, or use int elligent linking t o elim inat e repeat ed code.
B-57
USING PERFORMANCE MONITORING EVENTS
B.6
USING PERFORMANCE EVENTS OF INTEL® CORE™ SOLO AND INTEL® CORE™ DUO PROCESSORS
There are perform ance event s specific t o t he m icroarchit ect ure of I nt el Core Solo and I nt el Core Duo processors. See also: Chapt er 19 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) .
B.6.1
Understanding the Results in a Performance Counter
Each perform ance event det ect s a well- defined m icroarchit ect ural condit ion occurring in t he core while t he core is act ive. A core is act ive when:
• •
I t ’s running code ( excluding t he halt inst ruct ion) . I t ’s being snooped by t he ot her core or a logical processor on t he plat form . This can also happen when t he core is halt ed.
Som e m icroarchit ect ural condit ions are applicable t o a sub- syst em shared by m ore t han one core and som e perform ance event s provide an event m ask ( or unit m ask) t hat allows qualificat ion at t he physical processor boundary or at bus agent boundary. Som e event s allow qualificat ions t hat perm it t he count ing of m icroarchit ect ural condit ions associat ed wit h a part icular core versus count s from all cores in a physical processor ( see L2 and bus relat ed event s in Chapt er 19 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) . When a m ult i- t hreaded workload does not use all cores cont inuously, a perform ance count er count ing a core- specific condit ion m ay progress t o som e ext ent on t he halt ed core and st op progressing or a unit m ask m ay be qualified t o cont inue count ing occurrences of t he condit ion at t ribut ed t o eit her processor core. Typically, one can adj ust t he highest t wo bit s ( bit s 15: 14 of t he I A32_PERFEVTSELx MSR) in t he unit m ask field t o dist inguish such asym m et ry ( See Chapt er 17, “ Debug, Branch Profile, TSC, and Qualit y of Service,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) . There are t hree cycle- count ing event s which will not progress on a halt ed core, even if t he halt ed core is being snooped. These are: Unhalt ed core cycles, Unhalt ed reference cycles, and Unhalt ed bus cycles. All t hree event s are det ect ed for t he unit select ed by event 3CH. Som e event s det ect m icroarchit ect ural condit ions but are lim it ed in t heir abilit y t o ident ify t he originat ing core or physical processor. For exam ple, bus_drdy_clocks m ay be program m ed wit h a unit m ask of 20H t o include all agent s on a bus. I n t his case, t he perform ance count er in each core will report nearly ident ical values. Perform ance t ools int erpret ing count s m ust t ake int o account t hat it is only necessary t o equat e bus act ivit y wit h t he event count from one core ( and not use not t he sum from each core) . The above is also applicable when t he core- specificit y sub field ( bit s 15: 14 of I A32_PERFEVTSELx MSR) wit hin an event m ask is program m ed wit h 11B. The result of report ed by perform ance count er on each core will be nearly ident ical.
B.6.2
Ratio Interpretation
Rat ios of t wo event s are useful for analyzing various charact erist ics of a workload. I t m ay be possible t o acquire such rat ios at m ult iple granularit ies, for exam ple: ( 1) per- applicat ion t hread, ( 2) per logical processor, ( 3) per core, and ( 4) per physical processor. The first rat io is m ost useful from a soft ware developm ent perspect ive, but requires m ult i- t hreaded applicat ions t o m anage processor affinit y explicit ly for each applicat ion t hread. The ot her opt ions provide insight s on hardware ut ilizat ion. I n general, collect m easurem ent s ( for all event s in a rat io) in t he sam e run. This should be done because:
•
•
I f m easuring rat ios for a m ult i- t hreaded workload, get t ing result s for all event s in t he sam e run enables you t o underst and which event count er values belongs t o each t hread. Som e event s, such as writ ebacks, m ay have non- det erm inist ic behavior for different runs. I n such a case, only m easurem ent s collect ed in t he sam e run yield m eaningful rat io values.
B-58
USING PERFORMANCE MONITORING EVENTS
B.6.3
Notes on Selected Events
This sect ion provides event- specific not es for int erpret ing perform ance event s list ed in Chapt er 19 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B.
•
• •
• • • •
• •
L2 _ Re j e ct _ Cycle s, e ve nt num be r 3 0 H — This event count s t he cycles during which t he L2 cache rej ect ed new access request s. L2 _ N o_ Re que st _ Cycle s, e ve nt nu m be r 3 2 H — This event count s cycles during which no request s from t he L1 or prefet ches t o t he L2 cache were issued. Unha lt e d_ Cor e _ Cycle s, e ve nt num be r 3 C, unit m a sk 0 0 H — This event count s t he sm allest unit of t im e recognized by an act ive core. I n m any operat ing syst em s, t he idle t ask is im plem ent ed using HLT inst ruct ion. I n such operat ing syst em s, clock t icks for t he idle t ask are not count ed. A t ransit ion due t o Enhanced I nt el SpeedSt ep Technology m ay change t he operat ing frequency of a core. Therefore, using t his event t o init iat e t im e- based sam pling can creat e art ifact s. Un h a lt e d_ Re f_ Cycle s, e ve n t n u m be r 3 C, u nit m a sk 0 1 H — This event guarant ees a uniform int erval for each cycle being count ed. Specifically, count s increm ent at bus clock cycles while t he core is act ive. The cycles can be convert ed t o core clock dom ain by m ult iplying t he bus rat io which set s t he core clock frequency. Se r ia l_ Ex e cu t ion _ Cycle s, e ve nt n um be r 3 C, u n it m a sk 0 2 H — This event count s t he bus cycles during which t he core is act ively execut ing code ( non- halt ed) while t he ot her core in t he physical processor is halt ed. L1 _ Pr e f_ Re q, e ve nt num be r 4 FH , unit m a sk 0 0 H — This event count s t he num ber of t im es t he Dat a Cache Unit ( DCU) request s t o prefet ch a dat a cache line from t he L2 cache. Request s can be rej ect ed when t he L2 cache is busy. Rej ect ed request s are re- subm it t ed. D CU_ Snoop_ t o_ Sha r e , e ve n t num be r 7 8 H , unit m a sk 0 1 H — This event count s t he num ber of t im es t he DCU is snooped for a cache line needed by t he ot her core. The cache line is m issing in t he L1 inst ruct ion cache or dat a cache of t he ot her core; or it is set for read- only, when t he ot her core want s t o writ e t o it . These snoops are done t hrough t he DCU st ore port . Frequent DCU snoops m ay conflict wit h st ores t o t he DCU, and t his m ay increase st ore lat ency and im pact perform ance. Bus_ N ot _ I n _ Use , e ve nt nu m be r 7 D H , un it m a sk 0 0 H — This event count s t he num ber of bus cycles for which t he core does not have a t ransact ion wait ing for com plet ion on t he bus. Bus_ Sn oops, e ve nt nu m be r 7 7 H , un it m a sk 0 0 H — This event count s t he num ber of CLEAN, HI T, or HI TM responses t o ext ernal snoops det ect ed on t he bus. I n a single- processor syst em , CLEAN and HI T responses are not likely t o happen. I n a m ult iprocessor syst em t his event indicat es an L2 m iss in one processor t hat did not find t he m issed dat a on ot her processors. I n a single- processor syst em , an HI TM response indicat es t hat an L1 m iss ( inst ruct ion or dat a) found t he m issed cache line in t he ot her core in a m odified st at e. I n a m ult iprocessor syst em , t his event also indicat es t hat an L1 m iss ( inst ruct ion or dat a) found t he m issed cache line in anot her core in a m odified st at e.
B.7
DRILL-DOWN TECHNIQUES FOR PERFORMANCE ANALYSIS
Soft ware perform ance int ert wines code and m icroarchit ect ural charact erist ics of t he processor. Perform ance m onit oring event s provide insight s t o t hese int eract ions. Each m icroarchit ect ure oft en provides a large set of perform ance event s t hat t arget different sub- syst em s wit hin t he m icroarchit ect ure. Having a m et hodical approach t o select key perform ance event s will likely im prove a program m er ’s underst anding of t he perform ance bot t lenecks and im prove t he efficiency of code- t uning effort . Recent generat ions of I nt el 64 and I A- 32 processors feat ure m icroarchit ect ures using an out- of- order execut ion engine. They are also accom panied by an in- order front end and ret irem ent logic t hat enforces program order. Superscalar hardware, buffering and speculat ive execut ion oft en com plicat es t he int erpret at ion of perform ance event s and soft ware- visible perform ance bot t lenecks.
B-59
USING PERFORMANCE MONITORING EVENTS
This sect ion discusses a m et hodology of using perform ance event s t o drill down on likely areas of perform ance bot t leneck. By narrowed down t o a sm all set of perform ance event s, t he program m er can t ake advant age of I nt el VTune Perform ance Analyzer t o correlat e perform ance bot t lenecks wit h source code locat ions and apply coding recom m endat ions discussed in Chapt er 3 t hrough Chapt er 8. Alt hough t he general principles of our m et hod can be applied t o different m icroarchit ect ures, t his sect ion will use perform ance event s available in processors based on I nt el Core m icroarchit ect ure for sim plicit y. Perform ance t uning usually cent ers around reducing t he t im e it t akes t o com plet e a well- defined workload. Perform ance event s can be used t o m easure t he elapsed t im e bet ween t he st art and end of a workload. Thus, reducing elapsed t im e of com plet ing a workload is equivalent t o reducing m easured processor cycles. The drill- down m et hodology can be sum m arized as four phases of perform ance event m easurem ent s t o help charact erize int eract ions of t he code wit h key pipe st ages or sub- syst em s of t he m icroarchit ect ure. The relat ion of t he perform ance event drill- down m et hodology t o t he soft ware t uning feedback loop is illust rat ed in Figure B- 16.
Start_to_Finish View
RS View
Execution View
Total_Cycles_Completion
Issuing_uops
Non_retiring_uops
Tuning Consistency
Retiring_uops
Store Fwd
Stalls Drill-down
Tuning Focus
Not_Issuing_uops
Code Layout, Branch Misprediction
Vectorize w/ SIMD
Apply one fix at time; repeat from the top
Stalled
LCP
Cache Miss
...
Identify hot spot code, apply fix
OM19805
Figure B-16. Performance Events Drill-Down and Software Tuning Feedback Loop Typically, t he logic in perform ance m onit oring hardware m easures m icroarchit ect ural condit ions t hat varies across different count ing dom ains, ranging from cycles, m icro- ops, address references, inst ances, et c. The drill- down m et hodology at t em pt s t o provide an int uit ive, cycle- based view across different phases by m aking suit able approxim at ions t hat are described below:
• •
Tot a l cycle m e a sur e m e nt — This is t he st art t o finish view of t ot al num ber of cycle t o com plet e t he workload of int erest . I n t ypical perform ance t uning sit uat ions, t he m et ric Tot al_cycles can be m easured by t he event CPU_CLK_UNHALTED.CORE. See Chapt er 19, “ Perform ance Monit oring Event s,” of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) . Cycle com posit ion a t issue por t — The reservat ion st at ion ( RS) dispat ches m icro- ops for execut ion so t hat t he program can m ake forward progress. Hence t he m et ric Tot al_cycles can be
B-60
USING PERFORMANCE MONITORING EVENTS
•
•
decom posed as consist ing of t wo exclusive com ponent s: Cycles_not _issuing_uops represent ing cycles t hat t he RS is not issuing m icro- ops for execut ion, and Cycles_issuing_uops cycles t hat t he RS is issuing m icro- ops for execut ion. The lat t er com ponent includes m icro- ops in t he archit ect ed code pat h or in t he speculat ive code pat h. Cycle com posit ion of OOO e x e cu t ion — The out- of- order engine provides m ult iple execut ion unit s t hat can execut e m icro- ops in parallel. I f one execut ion unit st alls, it does not necessarily im ply t he program execut ion is st alled. Our m et hodology at t em pt s t o const ruct a cycle- com posit ion view t hat approxim at es t he progress of program execut ion. The t hree relevant m et rics are: Cycles_st alled, Cycles_not _ret iring_uops, and Cycles_ret iring_uops. Ex e cut ion st a ll a na lysis — From t he cycle com posit ions of overall program execut ion, t he program m er can narrow down t he select ion of perform ance event s t o furt her pin- point unproduct ive int eract ion bet ween t he workload and a m icro- archit ect ural sub- syst em .
When cycles lost t o a st alled m icroarchit ect ural sub- syst em , or t o unproduct ive speculat ive execut ion are ident ified, t he program m er can use VTune Analyzer t o correlat e each significant perform ance im pact t o source code locat ion. I f t he perform ance im pact of st alls or m ispredict ion is insignificant , VTune can also ident ify t he source locat ions of hot funct ions, so t he program m er can evaluat e t he benefit s of vect orizat ion on t hose hot funct ions.
B.7.1
Cycle Composition at Issue Port
Recent processor m icroarchit ect ures em ploy out - of- order engines t hat execut e st ream s of m icro- ops nat ively, while decoding program inst ruct ions int o m icro- ops in it s front end. The m et ric Tot al_cycles alone, is opaque wit h respect t o decom posing cycles t hat are product ive or non- product ive for program execut ion. To est ablish a consist ent cycle- based decom posit ion, we const ruct t wo m et rics t hat can be m easured using perform ance event s available in processors based on I nt el Core m icroarchit ect ure. These are:
•
•
Cycle s_ not _ issu in g_ uops — This can be m easured by t he event RS_UOPS_DI SPATCHED, set t ing t he I NV bit and specifying a count er m ask ( CMASK) value of 1 in t he t arget perform ance event select ( I A32_PERFEVTSELx) MSR ( See Chapt er 18 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) . I n VTune Analyzer, t he special values for CMASK and I NV is already configured for t he VTune event nam e RS_UOPS_DI SPATCHED.CYCLES_NONE. Cycle s_ issuing_ u ops — This can be m easured using t he event RS_UOPS_DI SPATCHED, clear t he I NV bit and specifying a count er m ask ( CMASK) value of 1 in t he t arget perform ance event select MSR
Not e t he cycle decom posit ion view here is approxim at e in nat ure; it does not dist inguish specificit ies, such as whet her t he RS is full or em pt y, t ransient sit uat ions of RS being em pt y but som e in- flight uops is get t ing ret ired.
B.7.2
Cycle Composition of OOO Execution
I n an OOO engine, speculat ive execut ion is an im port ant part of m aking forward progress of t he program . But speculat ive execut ion of m icro- ops in t he shadow of m ispredict ed code pat h represent unproduct ive work t hat consum es execut ion resources and execut ion bandwidt h. Cycles_not _issuing_uops, by definit ion, represent s t he cycles t hat t he OOO engine is st alled ( Cycles_st alled) . As an approxim at ion, t his can be int erpret ed as t he cycles t hat t he program is not m aking forward progress. The m icro- ops t hat are issued for execut ion do not necessarily end in ret irem ent . Those m icro- ops t hat do not reach ret irem ent do not help forward progress of program execut ion. Hence, a furt her approxim at ion is m ade in t he form alism of decom posit ion of Cycles_issuing_uops int o:
•
Cycle s_ non_ r e t ir ing_ uops — Alt hough t here isn’t a direct event t o m easure t he cycles associat ed wit h non- ret iring m icro- ops, we will derive t his m et ric from available perform ance event s, and several assum pt ions:
B-61
USING PERFORMANCE MONITORING EVENTS
— A const ant issue rat e of m icro- ops flowing t hrough t he issue port . Thus, we define: uops_rat e” = “ Dispat ch_uops/ Cycles_issuing_uops, where Dispat ch_uops can be m easured wit h RS_UOPS_DI SPATCHED, clearing t he I NV bit and t he CMASK. — We approxim at e t he num ber of non- product ive, non- ret iring m icro- ops by [ non_product ive_uops = Dispat ch_uops - execut ed_ret ired_uops] , where execut ed_ret ired_uops represent product ive m icro- ops cont ribut ing t owards forward progress t hat consum ed execut ion bandwidt h. — The execut ed_ret ired_uops can be approxim at ed by t he sum of t wo cont ribut ions: num _ret ired_uops ( m easured by t he event UOPS_RETI RED.ANY) and num _fused_uops ( m easured by t he event UOPS_RETI RED.FUSED) .
•
Thus, Cycles_non_ret iring_uops = non_product ive_uops / uops_rat e. Cycle s_ r e t ir ing_ uops — This can be derived from Cycles_ret iring_uops = num _ret ired_uops / uops_rat e.
The cycle- decom posit ion m et hodology here does not dist inguish sit uat ions where product ive uops and non- product ive m icro- ops m ay be dispat ched in t he sam e cycle int o t he OOO engine. This approxim at ion m ay be reasonable because heurist ically high cont ribut ion of non- ret iring uops likely correlat es t o sit uat ions of congest ions in t he OOO engine and subsequent ly cause t he program t o st all. Evaluat ions of t hese t hree com ponent s: Cycles_non_ret iring_uops, Cycles_st alled, Cycles_ret iring_uops, relat ive t o t he Tot al_cycles, can help st eer t uning effort in t he following direct ions:
• • •
I f t he cont ribut ion from Cycles_non_ret iring_uops is high, focusing on code layout and reducing branch m ispredict ions will be im port ant . I f bot h t he cont ribut ions from Cycles_non_ret iring_uops and Cycles_st alled are insignificant , t he focus for perform ance t uning should be direct ed t o vect orizat ion or ot her t echniques t o im prove ret irem ent t hroughput of hot funct ions. I f t he cont ribut ions from Cycles_st alled is high, addit ional drill- down m ay be necessary t o locat e bot t lenecks t hat lies deeper in t he m icroarchit ect ure pipeline.
B.7.3
Drill-Down on Performance Stalls
I n som e sit uat ions, it m ay be useful t o evaluat e cycles lost t o st alls associat ed wit h various st ress point s in t he m icroarchit ect ure and sum up t he cont ribut ions from each candidat e st ress point s. This approach im plies a very gross sim plificat ion and int roduce com plicat ions t hat m ay be difficult t o reconcile wit h t he superscalar nat ure and buffering in an OOO engine. Due t o t he variat ions of count ing dom ains associat ed wit h different perform ance event s, cycle- based est im at ion of perform ance im pact at each st ress point m ay carry different degree of errors due t o overest im at ion of exposures or under- est im at ions. Over- est im at ion is likely t o occur when overall perform ance im pact for a given cause is est im at ed by m ult iplying t he per- inst ance- cost t o an event count t hat m easures t he num ber of occurrences of t hat m icroarchit ect ural condit ion. Consequent ly, t he sum of m ult iple cont ribut ions of lost cycles due t o different st ress point s m ay exceed t he m ore accurat e m et ric Cycles_st alled. However an approach t hat sum s up lost cycles associat ed wit h individual st ress point m ay st ill be beneficial as an it erat ive indicat or t o m easure t he effect iveness of code t uning loop effort when t uning code t o fix t he perform ance im pact of each st ress point . The rem aining of t his sub- sect ion will discuss a few com m on causes of perform ance bot t lenecks t hat can be count ed by perform ance event s and fixed by following coding recom m endat ions described in t his m anual. The following it em s discuss several com m on st ress point s of t he m icroarchit ect ure:
•
L2 M iss I m pa ct — An L2 load m iss m ay expose t he full lat ency of m em ory sub- syst em . The lat ency of accessing syst em m em ory varies wit h different chipset , generally on t he order of m ore t han a hundred cycles. Server chipset t end t o exhibit longer lat ency t han deskt op chipset s. The num ber L2 cache m iss references can be m easured by MEM_LOAD_RETI RED.L2_LI NE_MI SS. An est im at ion of overall L2 m iss im pact by m ult iplying syst em m em ory lat ency wit h t he num ber of L2 m isses ignores t he OOO engine’s abilit y t o handle m ult iple out st anding load m isses. Mult iplicat ion of lat ency and num ber of L2 m isses im ply each L2 m iss occur serially.
B-62
USING PERFORMANCE MONITORING EVENTS
•
• • •
To im prove t he accuracy of est im at ing L2 m iss im pact , an alt ernat ive t echnique should also be considered, using t he event BUS_REQUEST_OUTSTANDI NG wit h a CMASK value of 1. This alt ernat ive t echnique effect ively m easures t he cycles t hat t he OOO engine is wait ing for dat a from t he out st anding bus read request s. I t can overcom e t he over- est im at ion of m ult iplying m em ory lat ency wit h t he num ber of L2 m isses. L2 H it I m pa ct — Mem ory accesses from L2 will incur t he cost of L2 lat ency ( See Table 2- 27) . The num ber cache line references of L2 hit can be m easured by t he difference bet ween t wo event s: MEM_LOAD_RETI RED.L1D_LI NE_MI SS - MEM_LOAD_RETI RED.L2_LI NE_MI SS. An est im at ion of overall L2 hit im pact by m ult iplying t he L2 hit lat ency wit h t he num ber of L2 hit references ignores t he OOO engine’s abilit y t o handle m ult iple out st anding load m isses. L1 D TLB M iss I m pa ct — The cost of a DTLB lookup m iss is about 10 cycles. The event MEM_LOAD_RETI RED.DTLB_MI SS m easures t he num ber of load m icro- ops t hat experienced a DTLB m iss. LCP I m pa ct — The overall im pact of LCP st alls can be direct ly m easured by t he event I LD_STALLS. The event I LD_STALLS m easures t he num ber of t im es t he slow decoder was t riggered, t he cost of each inst ance is 6 cycles St or e for w a r din g st a ll I m pa ct — When a st ore forwarding sit uat ion does not m eet address or size requirem ent s im posed by hardware, a st all occurs. The delay varies for different st ore forwarding st all sit uat ions. Consequent ly, t here are several perform ance event s t hat provide fine- grain specificit y t o det ect different st ore- forwarding st all condit ions. These include: — A load blocked by preceding st ore t o unknown address: This sit uat ion can be m easure by t he event Load_Blocks.St a. The per- inst ance cost is about 5 cycles. — Load part ially overlaps wit h proceeding st ore or 4- KByt e aliased address bet ween a load and a proceeding st ore: t hese t wo sit uat ions can be m easured by t he event Load_Blocks.Overlap_st ore. — A load spanning across cache line boundary: This can be m easured by Load_Blocks.Unt il_Ret ire. The per- inst ance cost is about 20 cycles.
B.8
EVENT RATIOS FOR INTEL CORE MICROARCHITECTURE
Appendix B.8 provides exam ples of using perform ance event s t o quickly diagnose perform ance bot t lenecks. This sect ion provides addit ional inform at ion on using perform ance event s t o evaluat e m et rics t hat can help in wide range of perform ance analysis, workload charact erizat ion, and perform ance t uning. Not e t hat m any perform ance event nam es in t he I nt el Core m icroarchit ect ure carry t he form at of XXXX.YYY. This not at ion derives from t he general convent ion t hat XXXX t ypically corresponds t o a unique event select code in t he perform ance event select regist er ( I A32_PERFEVSELx) , while YYY corresponds t o a unique sub- event m ask t hat uniquely defines a specific m icroarchit ect ural condit ion ( See Chapt er 18 and Chapt er 19 of t he I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 3B) .
B.8.1
Clocks Per Instructions Retired Ratio (CPI)
1. Clocks Per I nst ruct ion Ret ired Rat io ( CPI ) : CPU_CLK_UNHALTED.CORE / I NST_RETI RED.ANY. The I nt el Core m icroarchit ect ure is capable of reaching CPI as low as 0.25 in ideal sit uat ions. But m ost of t he code has higher CPI The great er value of CPI for a given workload indicat e it has m ore opport unit y for code t uning t o im prove perform ance. The CPI is an overall m et ric, it does not provide specificit y of what m icroarchit ect ural sub- syst em m ay be cont ribut ing t o a high CPI value. The following subsect ions defines a list of event rat ios t hat are useful t o charact erize int eract ions wit h t he front end, execut ion, and m em ory.
B-63
USING PERFORMANCE MONITORING EVENTS
B.8.2
Front End Ratios
2. RS Full Rat io: RESOURCE_STALLS.RS_FULL / CPU_CLK_UNHALTED.CORE * 100 3. ROB Full Rat io: RESOURCE_STALLS.ROB_FULL / CPU_CLK_UNHALTED.CORE * 100 4. Load or St ore Buffer Full Rat io: RESOURCE_STALLS.LD_ST / CPU_CLK_UNHALTED.CORE * 100 When t here is a low value for t he ROB Full Rat io, RS Full Rat io, and Load St ore Buffer Full Rat io, and high CPI it is likely t hat t he front end cannot provide inst ruct ions and m icro- ops at a rat e high enough t o fill t he buffers in t he out- of- order engine, and t herefore it is st arved wait ing for m icro- ops t o execut e. I n t his case check furt her for ot her front end perform ance issues.
B.8.2.1
Code Locality
5. I nst ruct ion Fet ch St all: CYCLES_L1I _MEM_STALLED / CPU_CLK_UNHALTED.CORE * 100 The I nst ruct ion Fet ch St all rat io is t he percent age of cycles during which t he I nst ruct ion Fet ch Unit ( I FU) cannot provide cache lines for decoding due t o cache and I nst ruct ion TLB ( I TLB) m isses. A high value for t his rat io indicat es pot ent ial opport unit ies t o im prove perform ance by reducing t he working set size of code pages and inst ruct ions being execut ed, hence im proving code localit y. 6. I TLB Miss Rat e: I TLB_MI SS_RETI RED / I NST_RETI RED.ANY A high I TLB Miss Rat e indicat es t hat t he execut ed code is spread over t oo m any pages and cause m any I nst ruct ions TLB m isses. Ret ired I TLB m isses cause t he pipeline t o nat urally drain, while t he m iss st alls fet ching of m ore inst ruct ions. 7. L1 I nst ruct ion Cache Miss Rat e: L1I _MI SSES / I NST_RETI RED.ANY A high value for L1 I nst ruct ion Cache Miss Rat e indicat es t hat t he code working set is bigger t han t he L1 inst ruct ion cache. Reducing t he code working set m ay im prove perform ance. 8. L2 I nst ruct ion Cache Line Miss Rat e: L2_I FETCH.SELF.I _STATE / I NST_RETI RED.ANY L2 I nst ruct ion Cache Line Miss Rat e higher t han zero indicat es inst ruct ion cache line m isses from t he L2 cache m ay have a not iceable perform ance im pact of program perform ance.
B.8.2.2
Branching and Front End
9. BACLEAR Perform ance I m pact : 7 * BACLEARS / CPU_CLK_UNHALTED.CORE A high value for BACLEAR Perform ance I m pact rat io usually indicat es t hat t he code has m any branches such t hat t hey cannot be consum ed by t he Branch Predict ion Unit . 10. Taken Branch Bubble: ( BR_TKN_BUBBLE_1+ BR_TKN_BUBBLE_2) / CPU_CLK_UNHALTED.CORE A high value for Taken Branch Bubble rat io indicat es t hat t he code cont ains m any t aken branches com ing one aft er t he ot her and cause bubbles in t he front end. This m ay affect perform ance only if it is not covered by execut ion lat encies and st alls lat er in t he pipe.
B.8.2.3
Stack Pointer Tracker
11. ESP Synchronizat ion: ESP.SYNCH / ESP.ADDI TI ONS The ESP Synchronizat ion rat io calculat es t he rat io of ESP explicit use ( for exam ple by load or st ore inst ruct ion) and im plicit uses ( for exam ple by PUSH or POP inst ruct ion) . The expect ed rat io value is 0.2 or lower. I f t he rat io is higher, consider rearranging your code t o avoid ESP synchronizat ion event s.
B.8.2.4
Macro-fusion
12. Macro- Fusion: UOPS_RETI RED.MACRO_FUSI ON / I NST_RETI RED.ANY The Macro- Fusion rat io calculat es how m any of t he ret ired inst ruct ions were fused t o a single m icro- op. You m ay find t his rat io is high for a 32- bit binary execut able but significant ly lower for t he equivalent 64-
B-64
USING PERFORMANCE MONITORING EVENTS
bit binary, and t he 64- bit binary perform s slower t han t he 32- bit binary. A possible reason is t he 32- bit binary benefit ed from m acro- fusion significant ly.
B.8.2.5
Length Changing Prefix (LCP) Stalls
13. LCP Delays Det ect ed: I LD_STALL / CPU_CLK_UNHALTED.CORE A high value of t he LCP Delays Det ect ed rat io indicat es t hat m any Lengt h Changing Prefix ( LCP) delays occur in t he m easured code.
B.8.2.6
Self Modifying Code Detection
14. Self Modifying Code Clear Perform ance I m pact : MACHI NE_NUKES.SMC * 150 / CPU_CLK_UNHALTED.CORE * 100 A program t hat writ es int o code sect ions and short ly aft erwards execut es t he generat ed code m ay incur severe penalt ies. Self Modifying Code Perform ance I m pact est im at es t he percent age of cycles t hat t he program spends on self- m odifying code penalt ies.
B.8.3
Branch Prediction Ratios
Appendix B.8.2.2 discusses branching t hat im pact s t he front end perform ance. This sect ion describes event rat ios t hat are com m only used t o charact erize branch m ispredict ions.
B.8.3.1
Branch Mispredictions
15. Branch Mispredict ion Perform ance I m pact : RESOURCE_STALLS.BR_MI SS_CLEAR / CPU_CLK_UNHALTED.CORE * 100 Wit h t he Branch Mispredict ion Perform ance I m pact , you can t ell t he percent age of cycles t hat t he processor spends in recovering from branch m ispredict ions. 16. Branch Mispredict ion per Micro- Op Ret ired: BR_I NST_RETI RED.MI SPRED/ UOPS_RETI RED.ANY The rat io Branch Mispredict ion per Micro- Op Ret ired indicat es if t he code suffers from m any branch m ispredict ions. I n t his case, im proving t he predict abilit y of branches can have a not iceable im pact on t he perform ance of your code. I n addit ion, t he perform ance im pact of each branch m ispredict ion m ight be high. This happens if t he code prior t o t he m ispredict ed branch has high CPI , such as cache m isses, which cannot be parallelized wit h following code due t o t he branch m ispredict ion. Reducing t he CPI of t his code will reduce t he m ispredict ion perform ance im pact . See ot her rat ios t o ident ify t hese cases. You can use t he precise event BR_I NST_RETI RED.MI SPRED t o det ect t he act ual t arget s of t he m ispredict ed branches. This m ay help you t o ident ify t he m ispredict ed branch.
B.8.3.2
Virtual Tables and Indirect Calls
17. Virt ual Table Usage: BR_I ND_CALL_EXEC / I NST_RETI RED.ANY A high value for t he rat io Virt ual Table Usage indicat es t hat t he code includes m any indirect calls. The dest inat ion address of an indirect call is hard t o predict . 18. Virt ual Table Misuse: BR_CALL_MI SSP_EXEC / BR_I NST_RETI RED.MI SPRED A high value of Branch Mispredict ion Perform ance I m pact rat io ( Rat io 15) t oget her wit h high Virt ual Table Misuse rat io indicat e t hat significant t im e is spent due t o m ispredict ed indirect funct ion calls. I n addit ion t o explicit use of funct ion point ers in C code, indirect calls are used for im plem ent ing inherit ance, abst ract classes, and virt ual m et hods in C+ + .
B-65
USING PERFORMANCE MONITORING EVENTS
B.8.3.3
Mispredicted Returns
19. Mispredict ed Ret urn I nst ruct ion Rat e: BR_RET_MI SSP_EXEC/ BR_RET_EXEC The processor has a special m echanism t hat t racks CALL- RETURN pairs. The processor assum es t hat every CALL inst ruct ion has a m at ching RETURN inst ruct ion. I f a RETURN inst ruct ion rest ores a ret urn address, which is not t he one st ored during t he m at ching CALL, t he code incurs a m ispredict ion penalt y.
B.8.4
Execution Ratios
This sect ion covers event rat ios t hat can provide insight s t o t he int eract ions of m icro- ops wit h RS, ROB, execut ion unit s, and so fort h.
B.8.4.1
Resource Stalls
A high value for t he RS Full Rat io ( Rat io 2) indicat es t hat t he Reservat ion St at ion ( RS) oft en get s full wit h m icro- ops due t o long dependency chains. The m icro- ops t hat get int o t he RS cannot execut e because t hey wait for t heir operands t o be com put ed by previous m icro- ops, or t hey wait for a free execut ion unit t o be execut ed. This prevent s exploit ing t he parallelism provided by t he m ult iple execut ion unit s. A high value for t he ROB Full Rat io ( Rat io 3) indicat es t hat t he reorder buffer ( ROB) oft en get s full wit h m icro- ops. This usually im plies on long lat ency operat ions, such as L2 cache dem and m isses.
B.8.4.2
ROB Read Port Stalls
20. ROB Read Port St all Rat e: RAT_STALLS.ROB_READ_PORT / CPU_CLK_UNHALTED.CORE The rat io ROB Read Port St all Rat e ident ifies ROB read port st alls. However it should be used only if t he num ber of resource st alls, as indicat ed by Resource St all Rat io, is low.
B.8.4.3
Partial Register Stalls
21. Part ial Regist er St alls Rat io: RAT_STALLS.PARTI AL_CYCLES / CPU_CLK_UNHALTED.CORE* 100 Frequent accesses t o regist ers t hat cause part ial st alls increase access lat ency and decrease perform ance. Part ial Regist er St alls Rat io is t he percent age of cycles when part ial st alls occur.
B.8.4.4
Partial Flag Stalls
22. Part ial Flag St alls Rat io: RAT_STALLS.FLAGS / CPU_CLK_UNHALTED.CORE Part ial flag st alls have high penalt y and t hey can be easily avoided. However, in som e cases, Part ial Flag St alls Rat io m ight be high alt hough t here are no real flag st alls. There are a few inst ruct ions t hat part ially m odify t he RFLAGS regist er and m ay cause part ial flag st alls. The m ost popular are t he shift inst ruct ions ( SAR, SAL, SHR, and SHL) and t he I NC and DEC inst ruct ions.
B.8.4.5
Bypass Between Execution Domains
23. Delayed Bypass t o FP Operat ion Rat e: DELAYED_BYPASS.FP / CPU_CLK_UNHALTED.CORE 24. Delayed Bypass t o SI MD Operat ion Rat e: DELAYED_BYPASS.SI MD / CPU_CLK_UNHALTED.CORE 25. Delayed Bypass t o Load Operat ion Rat e: DELAYED_BYPASS.LOAD / CPU_CLK_UNHALTED.CORE Dom ain bypass adds one cycle t o inst ruct ion lat ency. To ident ify frequent dom ain bypasses in t he code you can use t he above rat ios.
B.8.4.6
Floating-Point Performance Ratios
26. Float ing- Point I nst ruct ions Rat io: X87_OPS_RETI RED.ANY / I NST_RETI RED.ANY * 100
B-66
USING PERFORMANCE MONITORING EVENTS
Significant float ing- point act ivit y indicat es t hat specialized opt im izat ions for float ing- point algorit hm s m ay be applicable. 27. FP Assist Perform ance I m pact : FP_ASSI ST * 80 / CPU_CLK_UNHALTED.CORE * 100 Float ing- Point assist is act ivat ed for non- regular FP values like denorm als and NANs. FP assist is ext rem ely slow com pared t o regular FP execut ion. Different assist s incur different penalt ies. FP Assist Perform ance I m pact est im at es t he overall im pact . 28. Divider Busy: I DLE_DURI NG_DI V / CPU_CLK_UNHALTED.CORE * 100 A high value for t he Divider Busy rat io indicat es t hat t he divider is busy and no ot her execut ion unit or load operat ion is in progress for m any cycles. Using t his rat io ignores L1 dat a cache m isses and L2 cache m isses t hat can be execut ed in parallel and hide t he divider penalt y. 29. Float ing- Point Cont rol Word St all Rat io: RESOURCE_STALLS.FPCW / CPU_CLK_UNHALTED.CORE * 100 Frequent m odificat ions t o t he Float ing- Point Cont rol Word ( FPCW) m ight significant ly decrease perform ance. The m ain reason for changing FPCW is for changing rounding m ode when doing FP t o int eger conversions.
B.8.5
Memory Sub-System - Access Conflicts Ratios
A high value for Load or St ore Buffer Full Rat io ( Rat io 4) indicat es t hat t he load buffer or st ore buffer are frequent ly full, hence new m icro- ops cannot ent er t he execut ion pipeline. This can reduce execut ion parallelism and decrease perform ance. 30. Load Rat e: L1D_CACHE_LD.MESI / CPU_CLK_UNHALTED.CORE One m em ory read operat ion can be served by a core each cycle. A high “ Load Rat e” indicat es t hat execut ion m ay be bound by m em ory read operat ions. 31. St ore Order Block: STORE_BLOCK.ORDER / CPU_CLK_UNHALTED.CORE * 100 St ore Order Block rat io is t he percent age of cycles t hat st ore operat ions, which m iss t he L2 cache, block com m it t ing dat a of lat er st ores t o t he m em ory sub- syst em . This behavior can furt her cause t he st ore buffer t o fill up ( see Rat io 4) .
B.8.5.1
Loads Blocked by the L1 Data Cache
32. Loads Blocked by L1 Dat a Cache Rat e: LOAD_BLOCK.L1D/ CPU_CLK_UNHALTED.CORE A high value for “ Loads Blocked by L1 Dat a Cache Rat e” indicat es t hat load operat ions are blocked by t he L1 dat a cache due t o lack of resources, usually happening as a result of m any sim ult aneous L1 dat a cache m isses.
B.8.5.2
4K Aliasing and Store Forwarding Block Detection
33. Loads Blocked by Overlapping St ore Rat e: LOAD_BLOCK.OVERLAP_STORE/ CPU_CLK_UNHALTED.CORE 4K aliasing and st ore forwarding block are t wo different scenarios in which loads are blocked by preceding st ores due t o different reasons. Bot h scenarios are det ect ed by t he sam e event : LOAD_BLOCK.OVERLAP_STORE. A high value for “ Loads Blocked by Overlapping St ore Rat e” indicat es t hat eit her 4K aliasing or st ore forwarding block m ay affect perform ance.
B.8.5.3
Load Block by Preceding Stores
34. Loads Blocked by Unknown St ore Address Rat e: LOAD_BLOCK.STA / CPU_CLK_UNHALTED.CORE A high value for “ Loads Blocked by Unknown St ore Address Rat e” indicat es t hat loads are frequent ly blocked by preceding st ores wit h unknown address and im plies perform ance penalt y.
B-67
USING PERFORMANCE MONITORING EVENTS
35. Loads Blocked by Unknown St ore Dat a Rat e: LOAD_BLOCK.STD / CPU_CLK_UNHALTED.CORE A high value for “ Loads Blocked by Unknown St ore Dat a Rat e” indicat es t hat loads are frequent ly blocked by preceding st ores wit h unknown dat a and im plies perform ance penalt y.
B.8.5.4
Memory Disambiguation
The m em ory disam biguat ion feat ure of I nt el Core m icroarchit ect ure elim inat es m ost of t he non- required load blocks by st ores wit h unknown address. When t his feat ure fails ( possibly due t o flaky load - st ore disam biguat ion cases) t he event LOAD_BLOCK.STA will be count ed and also MEMORY_DI SAMBI GUATI ON.RESET.
B.8.5.5
Load Operation Address Translation
36. L0 DTLB Miss due t o Loads - Perform ance I m pact : DTLB_MI SSES.L0_MI SS_LD * 2 / CPU_CLK_UNHALTED.CORE High num ber of DTLB0 m isses indicat es t hat t he dat a set t hat t he workload uses spans a num ber of pages t hat is bigger t han t he DTLB0. The high num ber of m isses is expect ed t o im pact workload perform ance only if t he CPI ( Rat io 1) is low - around 0.8. Ot herwise, it is likely t hat t he DTLB0 m iss cycles are hidden by ot her lat encies.
B.8.6
Memory Sub-System - Cache Misses Ratios
B.8.6.1
Locating Cache Misses in the Code
I nt el Core m icroarchit ect ure provides you wit h precise event s for ret ired load inst ruct ions t hat m iss t he L1 dat a cache or t he L2 cache. As precise event s t hey provide t he inst ruct ion point er of t he inst ruct ion following t he one t hat caused t he event . Therefore t he inst ruct ion t hat com es im m ediat ely prior t o t he point ed inst ruct ion is t he one t hat causes t he cache m iss. These event s are m ost helpful t o quickly ident ify on which loads t o focus t o fix a perform ance problem . The event s are: MEM_LOAD_RETI RE.L1D_MI SS MEM_LOAD_RETI RE.L1D_LI NE_MI SS MEM_LOAD_RETI RE.L2_MI SS MEM_LOAD_RETI RE.L2_LI NE_MI SS
B.8.6.2
L1 Data Cache Misses
37. L1 Dat a Cache Miss Rat e: L1D_REPL / I NST_RETI RED.ANY A high value for L1 Dat a Cache Miss Rat e indicat es t hat t he code m isses t he L1 dat a cache t oo oft en and pays t he penalt y of accessing t he L2 cache. See also Loads Blocked by L1 Dat a Cache Rat e ( Rat io 32) . You can count separat ely cache m isses due t o loads, st ores, and locked operat ions using t he event s L1D_CACHE_LD.I _STATE, L1D_CACHE_ST.I _STATE, and L1D_CACHE_LOCK.I _STATE, accordingly.
B.8.6.3
L2 Cache Misses
38. L2 Cache Miss Rat e: L2_LI NES_I N.SELF.ANY / I NST_RETI RED.ANY A high L2 Cache Miss Rat e indicat es t hat t he running workload has a dat a set larger t han t he L2 cache. Som e of t he dat a m ight be evict ed wit hout being used. Unless all t he required dat a is brought ahead of t im e by t he hardware prefet cher or soft ware prefet ching inst ruct ions, bringing dat a from m em ory has a significant im pact on t he perform ance. 39. L2 Cache Dem and Miss Rat e: L2_LI NES_I N.SELF.DEMAND / I NST_RETI RED.ANY
B-68
USING PERFORMANCE MONITORING EVENTS
A high value for L2 Cache Dem and Miss Rat e indicat es t hat t he hardware prefet chers are not exploit ed t o bring t he dat a t his workload needs. Dat a is brought from m em ory when needed t o be used and t he workload bears m em ory lat ency for each such access.
B.8.7
Memory Sub-system - Prefetching
B.8.7.1
L1 Data Prefetching
The event L1D_PREFETCH.REQUESTS is count ed whenever t he DCU at t em pt s t o prefet ch cache lines from t he L2 ( or m em ory) t o t he DCU. I f you expect t he DCU prefet chers t o work and t o count t his event , but inst ead you det ect t he event MEM_LOAD_RETI RE.L1D_MI SS, it m ight be t hat t he I P prefet cher suffers from load inst ruct ion address collision of several loads.
B.8.7.2
L2 Hardware Prefetching
Wit h t he event L2_LD.SELF.PREFETCH.MESI you can count t he num ber of prefet ch request s t hat were m ade t o t he L2 by t he L2 hardware prefet chers. The act ual num ber of cache lines prefet ched t o t he L2 is count ed by t he event L2_LD.SELF.PREFETCH.I _STATE.
B.8.7.3
Software Prefetching
The event s for soft ware prefet ching cover each level of prefet ching separat ely. 40. Useful Prefet chT0 Rat io: SSE_PRE_MI SS.L1 / SSE_PRE_EXEC.L1 * 100 41. Useful Prefet chT1 and Prefet chT2 Rat io: SSE_PRE_MI SS.L2 / SSE_PRE_EXEC.L2 * 100 A low value for any of t he prefet ch usefulness rat ios indicat es t hat som e of t he SSE prefet ch inst ruct ions prefet ch dat a t hat is already in t he caches. 42. Lat e Prefet chT0 Rat io: LOAD_HI T_PRE / SSE_PRE_EXEC.L1 43. Lat e Prefet chT1 and Prefet chT2 Rat io: LOAD_HI T_PRE / SSE_PRE_EXEC.L2 A high value for any of t he lat e prefet ch rat ios indicat es t hat soft ware prefet ch inst ruct ions are issued t oo lat e and t he load operat ions t hat use t he prefet ched dat a are wait ing for t he cache line t o arrive.
B.8.8
Memory Sub-system - TLB Miss Ratios
44. TLB m iss penalt y: PAGE_WALKS.CYCLES / CPU_CLK_UNHALTED.CORE * 100 A high value for t he TLB m iss penalt y rat io indicat es t hat m any cycles are spent on TLB m isses. Reducing t he num ber of TLB m isses m ay im prove perform ance. This rat io does not include DTLB0 m iss penalt ies ( see Rat io 37) . The following rat ios help t o focus on t he kind of m em ory accesses t hat cause TLB m isses m ost frequent ly See “ I TLB Miss Rat e” ( Rat io 6) for TLB m isses due t o inst ruct ion fet ch. 45. DTLB Miss Rat e: DTLB_MI SSES.ANY / I NST_RETI RED.ANY A high value for DTLB Miss Rat e indicat es t hat t he code accesses t oo m any dat a pages wit hin a short t im e, and causes m any Dat a TLB m isses. 46. DTLB Miss Rat e due t o Loads: DTLB_MI SSES.MI SS_LD / I NST_RETI RED.ANY A high value for DTLB Miss Rat e due t o Loads indicat es t hat t he code accesses loads dat a from t oo m any pages wit hin a short t im e, and causes m any Dat a TLB m isses. DTLB m isses due t o load operat ions m ay have a significant im pact , since t he DTLB m iss increases t he load operat ion lat ency. This rat io does not include DTLB0 m iss penalt ies ( see Rat io 37) . To precisely locat e load inst ruct ions t hat caused DTLB m isses you can use t he precise event MEM_LOAD_RETI RE.DTLB_MI SS.
B-69
USING PERFORMANCE MONITORING EVENTS
47. DTLB Miss Rat e due t o St ores: DTLB_MI SSES.MI SS_ST / I NST_RETI RED.ANY A high value for DTLB Miss Rat e due t o St ores indicat es t hat t he code accesses t oo m any dat a pages wit hin a short t im e, and causes m any Dat a TLB m isses due t o st ore operat ions. These m isses can im pact perform ance if t hey do not occur in parallel t o ot her inst ruct ions. I n addit ion, if t here are m any st ores in a row, som e of t hem m issing t he DTLB, it m ay cause st alls due t o full st ore buffer.
B.8.9
Memory Sub-system - Core Interaction
B.8.9.1
Modified Data Sharing
48. Modified Dat a Sharing Rat io: EXT_SNOOP.ALL_AGENTS.HI TM / I NST_RETI RED.ANY Frequent occurrences of m odified dat a sharing m ay be due t o t wo t hreads using and m odifying dat a laid in one cache line. Modified dat a sharing causes L2 cache m isses. When it happens unint ent ionally ( aka false sharing) it usually causes dem and m isses t hat have high penalt y. When false sharing is rem oved code perform ance can dram at ically im prove. 49. Local Modified Dat a Sharing Rat io: EXT_SNOOP.THI S_AGENT.HI TM / I NST_RETI RED.ANY Modified Dat a Sharing Rat io indicat es t he am ount of t ot al m odified dat a sharing observed in t he syst em . For syst em s wit h several processors you can use Local Modified Dat a Sharing Rat io t o indicat es t he am ount of m odified dat a sharing bet ween t wo cores in t he sam e processor. ( I n syst em s wit h one processor t he t wo rat ios are sim ilar) .
B.8.9.2
Fast Synchronization Penalty
50. Locked Operat ions I m pact : ( L1D_CACHE_LOCK_DURATI ON + 20 * L1D_CACHE_LOCK.MESI ) / CPU_CLK_UNHALTED.CORE * 100 Fast synchronizat ion is frequent ly im plem ent ed using locked m em ory accesses. A high value for Locked Operat ions I m pact indicat es t hat locked operat ions used in t he workload have high penalt y. The lat ency of a locked operat ion depends on t he locat ion of t he dat a: L1 dat a cache, L2 cache, ot her core cache or m em ory.
B.8.9.3
Simultaneous Extensive Stores and Load Misses
51. St ore Block by Snoop Rat io: ( STORE_BLOCK.SNOOP / CPU_CLK_UNHALTED.CORE) * 100 A high value for “ St ore Block by Snoop Rat io” indicat es t hat st ore operat ions are frequent ly blocked and perform ance is reduced. This happens when one core execut es a dense st ream of st ores while t he ot her core in t he processor frequent ly snoops it for cache lines m issing in it s L1 dat a cache.
B.8.10
Memory Sub-system - Bus Characterization
B.8.10.1
Bus Utilization
52. Bus Ut ilizat ion: BUS_TRANS_ANY.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS * 100 Bus Ut ilizat ion is t he percent age of bus cycles used for t ransferring bus t ransact ions of any t ype. I n single processor syst em s m ost of t he bus t ransact ions carry dat a. I n m ult iprocessor syst em s som e of t he bus t ransact ions are used t o coordinat e cache st at es t o keep dat a coherency. 53. Dat a Bus Ut ilizat ion: BUS_DRDY_CLOCKS.ALL_AGENTS / CPU_CLK_UNHALTED.BUS * 100 Dat a Bus Ut ilizat ion is t he percent age of bus cycles used for t ransferring dat a am ong all bus agent s in t he syst em , including processors and m em ory. High bus ut ilizat ion indicat es heavy t raffic bet ween t he processor( s) and m em ory. Mem ory sub- syst em lat ency can im pact t he perform ance of t he program . For com put e- int ensive applicat ions wit h high bus ut ilizat ion, look for opport unit ies t o im prove dat a and code
B-70
USING PERFORMANCE MONITORING EVENTS
localit y. For ot her t ypes of applicat ions ( for exam ple, copying large am ount s of dat a from one m em ory area t o anot her) , t ry t o m axim ize bus ut ilizat ion. 54. Bus Not Ready Rat io: BUS_BNR_DRV.ALL_AGENTS * 2 / CPU_CLK_UNHALTED.BUS * 100 Bus Not Ready Rat io est im at es t he percent age of bus cycles during which new bus t ransact ions cannot st art . A high value for Bus Not Ready Rat io indicat es t hat t he bus is highly loaded. As a result of t he Bus Not Ready ( BNR) signal, new bus t ransact ions m ight defer and t heir lat ency will have higher im pact on program perform ance. 55. Burst Read in Bus Ut ilizat ion: BUS_TRANS_BRD.SELF * 2 / CPU_CLK_UNHALTED.BUS * 100 A high value for Burst Read in Bus Ut ilizat ion indicat es t hat bus and m em ory lat ency of burst read operat ions m ay im pact t he perform ance of t he program . 56. RFO in Bus Ut ilizat ion: BUS_TRANS_RFO.SELF * 2 / CPU_CLK_UNHALTED.BUS * 100 A high value for RFO in Bus Ut ilizat ion indicat es t hat lat ency of Read For Ownership ( RFO) t ransact ions m ay im pact t he perform ance of t he program . RFO t ransact ions m ay have a higher im pact on t he program perform ance com pared t o ot her burst read operat ions ( for exam ple, as a result of loads t hat m issed t he L2) . See also Rat io 31.
B.8.10.2
Modified Cache Lines Eviction
57. L2 Modified Lines Evict ion Rat e: L2_M_LI NES_OUT.SELF.ANY / I NST_RETI RED.ANY When a new cache line is brought from m em ory, an exist ing cache line, possibly m odified, is evict ed from t he L2 cache t o m ake space for t he new line. Frequent evict ions of m odified lines from t he L2 cache increase t he lat ency of t he L2 cache m isses and consum e bus bandwidt h. 58. Explicit WB in Bus Ut ilizat ion: BUS_TRANS_WB.SELF * 2 / CPU_CLK_UNHALTED.BUS* 100 Explicit Writ e- back in Bus Ut ilizat ion considers m odified cache line evict ions not only from t he L2 cache but also from t he L1 dat a cache. I t represent s t he percent age of bus cycles used for explicit writ e- backs from t he processor t o m em ory.
B-71
USING PERFORMANCE MONITORING EVENTS
B-72
APPENDIX C INSTRUCTION LATENCY AND THROUGHPUT This appendix cont ains t ables showing t he lat ency and t hroughput are associat ed wit h com m only used inst ruct ions1 . The inst ruct ion t im ing dat a varies across processors fam ily/ m odels. I t cont ains t he following sect ions: •
Appe n dix C.1 , “Ove r vie w ” — Provides an overview of issues relat ed t o inst ruct ion select ion and scheduling.
•
Appe ndix C.2 , “D e finit ions” — Present s definit ions.
•
Appe n dix C.3 , “La t e ncy a nd Thr ough pu t ” — List s inst ruct ion t hroughput , lat ency associat ed wit h com m only- used inst ruct ion.
C.1
OVERVIEW
This appendix provides inform at ion t o assem bly language program m ers and com piler writ ers. The inform at ion aids in t he select ion of inst ruct ion sequences ( t o m inim ize chain lat ency) and in t he arrangem ent of inst ruct ions ( assist s in hardware processing) . The perform ance im pact of applying t he inform at ion has been shown t o be on t he order of several percent . This is for applicat ions not dom inat ed by ot her perform ance fact ors, such as: •
Cache m iss lat encies.
•
Bus bandwidt h.
•
I / O bandwidt h.
I nst ruct ion select ion and scheduling m at t ers when t he program m er has already addressed t he perform ance issues discussed in Chapt er 2: •
Observe st ore forwarding rest rict ions.
•
Avoid cache line and m em ory order buffer split s.
•
Do not inhibit branch predict ion.
•
Minim ize t he use of xchg inst ruct ions on m em ory locat ions.
While several it em s on t he above list involve select ing t he right inst ruct ion, t his appendix focuses on t he following issues. These are list ed in priorit y order, t hough which it em cont ribut es m ost t o perform ance varies by applicat ion: •
Maxim ize t he flow ofµops int o t he execut ion core. I nst ruct ions which consist of m ore t han four µops require addit ional st eps from m icrocode ROM. I nst ruct ions wit h longer m icro- op flows incur a delay in t he front end and reduce t he supply of m icro- ops t o t he execut ion core. I n Pent ium 4 and I nt el Xeon processors, t ransfers t o m icrocode ROM oft en reduce how efficient ly µops can be packed int o t he t race cache. Where possible, it is advisable t o select inst ruct ions wit h four or fewer µops. For exam ple, a 32- bit int eger m ult iply wit h a m em ory operand fit s in t he t race cache wit hout going t o m icrocode, while a 16- bit int eger m ult iply t o m em ory does not .
•
Avoid resource conflict s. I nt erleaving inst ruct ions so t hat t hey don’t com pet e for t he sam e port or execut ion unit can increase t hroughput . For exam ple, alt ernat e PADDQ and PMULUDQ ( each has a t hroughput of one issue per t wo clock cycles) . When int erleaved, t hey can achieve an effect ive t hroughput of one inst ruct ion per cycle because t hey use t he sam e port but different execut ion unit s.
1. Although instruction latency may be useful in some limited situations (e.g., a tight loop with a dependency chain that exposes instruction latency), software optimization on super-scalar, out-of-order microarchitecture, in general, will benefit much more on increasing the effective throughput of the larger-scale code path. Coding techniques that rely on instruction latency alone to influence the scheduling of instruction is likely to be sub-optimal as such coding technique is likely to interfere with the out-of-order machine or restrict the amount of instruction-level parallelism.
INSTRUCTION LATENCY AND THROUGHPUT
Select ing inst ruct ions wit h fast t hroughput also helps t o preserve issue port bandwidt h, hide lat ency and allows for higher soft ware perform ance. •
Minim ize t he lat ency of dependency chains t hat areon t he crit ical pat h. For exam ple, an operat ion t o shift left by t wo bit s execut es fast er when encoded as t wo adds t han when it is encoded as a shift . I f lat ency is not an issue, t he shift result s in a denser byt e encoding.
I n addit ion t o t he general and specific rules, coding guidelines and t he inst ruct ion dat a provided in t his m anual, you can t ake advant age of t he soft ware perform ance analysis and t uning t oolset available at ht t p: / / developer.int el.com / soft ware/ product s/ index.ht m . The t ools include t he I nt el VTune Perform ance Analyzer, wit h it s perform ance- m onit oring capabilit ies.
C.2
DEFINITIONS
The dat a is list ed in several t ables. The t ables cont ain t he following: •
I n st r u ct ion N a m e — The assem bly m nem onic of each inst ruct ion.
•
La t e n cy — The num ber of clock cycles t hat are required for t he execut ion core t o com plet e t he execut ion of all of t he µops t hat form an inst ruct ion.
•
Th r ou ghpu t — The num ber of clock cycles required t o wait before t he issue port s are free t o accept t he sam e inst ruct ion again. For m any inst ruct ions, t he t hroughput of an inst ruct ion can be significant ly less t han it s lat ency.
•
The case of RDRAND inst ruct ion lat ency and t hroughput is an except ion t o t he definit ions above, because t he hardware facilit y t hat execut es t he RDRAND inst ruct ion resides in t he uncore and is shared by all processor cores and logical processors in a physical package. The soft ware observable lat ency and t hroughput using t he sequence of “ rdrand followby j nc” in a single- t hread scenario can be as low as ~ 100 cycles. I n a t hird- generat ion I nt el Core processor based on I nt el m icroarchit ect ure code nam e I vy Bridge, t he t ot al bandwidt h t o deliver random num bers via RDRAND by t he uncore is about 500 MByt es/ sec. Wit hin t he sam e processor core m icroarchit ect ure and different uncore im plem ent at ions, RDRAND lat ency/ t hroughput can vary across I nt el Core and I nt el Xeon processors.
C.3
LATENCY AND THROUGHPUT
This sect ion present s t he lat ency and t hroughput inform at ion for com m only- used inst ruct ions including: MMX t echnology, St ream ing SI MD Ext ensions, subsequent generat ions of SI MD inst ruct ion ext ensions, and m ost of t he frequent ly used general- purpose int eger and x87 float ing- point inst ruct ions. Due t o t he com plexit y of dynam ic execut ion and out- of- order nat ure of t he execut ion core, t he inst ruct ion lat ency dat a m ay not be sufficient t o accurat ely predict realist ic perform ance of act ual code sequences based on adding inst ruct ion lat ency dat a. •
•
I nst ruct ion lat ency dat a is useful when t uning adependency chain. However, dependency chains lim it t he out- of- order core’s abilit y t o execut e m icro- ops in parallel. I nst ruct ion t hroughput dat a are useful when t uning parallel code unencum bered by dependency chains. Num eric dat a in t he t ables is: — Approxim at e and subj ect t o change in fut ure im plem ent at ions of t he m icroarchit ect ure. — Not m eant t o be used as reference for inst ruct ion- level perform ance benchm arks. Com parison of inst ruct ion- level perform ance of m icroprocessors t hat are based on different m icroarchit ect ures is a com plex subj ect and requires inform at ion t hat is beyond t he scope of t his m anual.
Com parisons of lat ency and t hroughput dat a bet ween different m icroarchit ect ures can be m isleading. Appendix C.3.1 provides lat ency and t hroughput dat a for t he regist er- t o- regist er inst ruct ion t ype. Appendix C.3.3 discusses how t o adj ust lat ency and t hroughput specificat ions for t he regist er- t om em ory and m em ory- t o- regist er inst ruct ions. I n som e cases, t he lat ency or t hroughput figures given are j ust one half of a clock. This occurs only for t he double- speed ALUs. C-2
INSTRUCTION LATENCY AND THROUGHPUT
C.3.1
Latency and Throughput with Register Operands
I nst ruct ion lat ency and t hroughput dat a are present ed in Table C- 4 t hrough Table C- 18. Tables include AESNI , SSE4.2, SSE4.1, Supplem ent al St ream ing SI MD Ext ension 3, St ream ing SI MD Ext ension 3, St ream ing SI MD Ext ension 2, St ream ing SI MD Ext ension, MMX t echnology and m ost com m on I nt el 64 and I A- 32 inst ruct ions. I nst ruct ion lat ency and t hroughput for different processor m icroarchit ect ures are in separat e colum ns. Processor inst ruct ion t im ing dat a is im plem ent at ion specific; it can vary bet ween m odel encodings wit hin t he sam e fam ily encoding ( e.g. m odel = 3 vs m odel < 2) . Separat e set s of inst ruct ion lat ency and t hroughput are shown in t he colum ns for CPUI D signat ure 0xF2n and 0xF3n. The colum n represent ed by 0xF3n also applies t o I nt el processors wit h CPUI D signat ure 0xF4n and 0xF6n. The not at ion 0xF2n represent s t he hex value of t he lower 12 bit s of t he EAX regist er report ed by CPUI D inst ruct ion wit h input value of EAX = 1; ‘F’ indicat es t he fam ily encoding value is 15, ‘2’ indicat es t he m odel encoding is 2, ‘n’ indicat es it applies t o any value in t he st epping encoding. I nt el Core Solo and I nt el Core Duo processors are represent ed by 06_0EH. Processors bases on 65 nm I nt el Core m icroarchit ect ure are represent ed by 06_0FH. Processors based on Enhanced I nt el Core m icroarchit ect ure are represent ed by 06_17H and 06_1DH. CPUI D fam ily/ Model signat ures of processors based on I nt el m icroarchit ect ure code nam e Nehalem are represent ed by 06_1AH, 06_1EH, 06_1FH, and 06_2EH. I nt el m icroarchit ect ure code nam e West m ere are represent ed by 06_25H, 06_2CH and 06_2FH. I nt el m icroarchit ect ure code nam e Sandy Bridge are represent ed by 06_2AH, 06_2DH. I nt el m icroarchit ect ure code nam e I vy Bridge are represent ed by 06_3AH, 06_3EH. I nt el m icroarchit ect ure code nam e Haswell are represent ed by 06_3CH, 06_45H and 06_46H.
Table C-1. CPUID Signature Values of Of Recent Intel Microarchitectures DisplayFamily_DisplayModel
Recent Intel Microarchitectures
06_4EH, 06_5EH
Skylake microarchitecture
06_3DH, 06_47H, 06_56H
Broadwell microarchitecture
06_3CH, 06_45H, 06_46H, 06_3FH
Haswell microarchitecture
06_3AH, 06_3EH
Ivy Bridge microarchitecture
06_2AH, 06_2DH
Sandy Bridge microarchitecture
06_25H, 06_2CH, 06_2FH
Intel microarchitecture Westmere
06_1AH, 06_1EH, 06_1FH, 06_2EH
Intel microarchitecture Nehalem
06_17H, 06_1DH
Enhanced Intel Core microarchitecture
06_0FH
Intel Core microarchitecture
I nst ruct ion lat ency varies by m icroarchit ect ures. Table C- 2 list s SI MD ext ensions int roduct ion in recent m icroarchit ect ures. Each m icroarchit ect ure m ay be associat ed wit h m ore t han one signat ure value given by t he CPUI D’s “ display_fam ily” and “ display_m odel”. Not all inst ruct ion set ext ensions are enabled in all processors associat ed wit h a part icular fam ily/ m odel designat ion. To det erm ine whet her a given inst ruct ion set ext ension is support ed, soft ware m ust use t he appropriat e CPUI D feat ure flag as described in I nt el® 64 and I A- 32 Archit ect ures Soft ware Developer’s Manual, Volum e 2A.
C-3
INSTRUCTION LATENCY AND THROUGHPUT
.
Table C-2. Instruction Extensions Introduction by Microarchitectures (CPUID Signature) SIMD Instruction Extensions
DisplayFamily_DisplayModel 06_4EH, 06_5EH
06_3DH, 06_47H, 06_56H
06_3CH, 06_45H, 06_46H, 06_3FH
06_3AH, 06_3EH
06_2AH, 06_2DH
06_25H, 06_2CH, 06_2FH
06_1AH, 06_1EH, 06_1FH, 06_2EH
06_17H, 06_1DH
CLFLUSHOPT
Yes
No
No
No
No
No
No
No
ADX, RDSEED
Yes
Yes
No
No
No
No
No
No
AVX2, FMA, BMI1, BMI2
Yes
Yes
Yes
No
No
No
No
No
F16C, RDRAND, RWFSGSBASE
Yes
Yes
Yes
Yes
No
No
No
No
AVX
Yes
Yes
Yes
Yes
Yes
No
No
No
AESNI, PCLMULQDQ
Yes
Yes
Yes
Yes
Yes
Yes
No
No
SSE4.2, POPCNT
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
SSE4.1
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
SSSE3
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
SSE3
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
SSE2
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
SSE
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
MMX
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Table C-3. BMI1, BMI2 and General Purpose Instructions Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_4E, 06_5E
06_3D, 06_47, 06_56
ADCX
1
1
1
1
ADOX
1
1
1
1
RESEED
Similar to RDRAND
Similar to RDRAND
Similar to RDRAND
Similar to RDRAND
C-4
Throughput
INSTRUCTION LATENCY AND THROUGHPUT
Table C-4. 256-bit AVX2 Instructions Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C, 06_45, 06_46, 06_3F
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C, 06_45, 06_46, 06_3F
VEXTRACTI128 xmm1, ymm2, imm
1
1
1
1
1
1
VMPSADBW
4
6
6
2
2
2
VPACKUSDW/SSWB
1
1
1
1
1
1
VPADDB/D/W/Q
1
1
1
0.33
0.5
0.5
VPADDSB
1
1
1
0.5
0.5
0.5
VPADDUSB
1
1
1
0.5
0.5
0.5
VPALIGNR
1
1
1
1
1
1
VPAVGB
1
1
1
0.5
0.5
0.5
VPBLENDD
1
1
1
0.33
0.33
0.33
VPBLENDW
1
1
1
1
1
1
VPBLENDVB
1
2
2
1
2
2
VPBROADCASTB/D/SS/SD
3
3
3
1
1
1
VPCMPEQB/W/D
1
1
1
0.5
0.5
0.5
VPCMPEQQ
1
1
1
0.5
0.5
0.5
VPCMPGTQ
3
5
5
1
1
1
VPHADDW/D/SW
3
3
3
2
2
2
VINSERTI128 ymm1, ymm2, xmm, imm
3
3
3
1
1
1
VPMADDWD
5b
5
5
0.5
1
1
VPMADDUBSW
5b
5
5
0.5
1
1
VPMAXSD
1
1
1
0.5
0.5
0.5
VPMAXUD
1
1
1
0.5
0.5
0.5
VPMOVSX
3
3
3
1
1
1
VPMOVZX
3
3
3
1
1
1
VPMULDQ/UDQ
5b
5
5
0.5
1
1
VPMULHRSW
5b
5
5
0.5
1
1
VPMULHW/LW
5b
5
5
0.5
1
1
VPMULLD
10b
10
10
1
2
2
VPOR/VPXOR
1
1
1
0.33
0.33
0.33
VPSADBW
3
5
5
1
1
1
VPSHUFB
1
1
1
1
1
1
VPSHUFD
1
1
1
1
1
1
VPSHUFLW/HW
1
1
1
1
1
1
VPSIGNB/D/W/Q
1
1
1
0.5
0.5
0.5
VPERMD/PS
3
3
3
1
1
1
VPSLLVD/Q
2
2
2
0.5
2
2
Throughput
C-5
INSTRUCTION LATENCY AND THROUGHPUT
Table C-4. 256-bit AVX2 Instructions (Contd.) Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C, 06_45, 06_46, 06_3F
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C, 06_45, 06_46, 06_3F
VPSRAVD
2
2
2
0.5
2
2
VPSRAD/W ymm1, ymm2, imm8
1
1
1
1
1
1
VPSLLDQ ymm1, ymm2, imm8
1
1
1
1
1
1
VPSLLQ/D/W ymm1, ymm2, imm8
1
1
1
1
1
1
VPSLLQ/D/W ymm, ymm, ymm
4
4
4
1
1
1
VPUNPCKHBW/WD/DQ/QDQ
1
1
1
1
1
1
VPUNPCKLBW/WD/DQ/QDQ
1
1
1
1
1
1
ALL VFMA
4
5
5
0.5
0.5
0.5
1
2
2
>200e
2
2
VPMASKMOVD/Q mem,
ymmd,
Throughput
ymm
VPMASKMOVD/Q NUL, msk_0, ymm VPMASKMOVD/Q ymm,
ymmd,
mem
VPMASKMOVD/Q ymm, msk_0,
[base+index]f
11
8
8
1
2
2
>200
~200
~200
>200
~200
~200
b: includes 1-cycle bubble due to bypass. c: includes two 1-cycle bubbles due to bypass d: MASKMOV instruction timing measured with L1 reference and mask register selecting at least 1 or more elements. e: MASKMOV store instruction with a mask value selecting 0 elements and illegal address (NUL or non-NUL) incurs delay due to assist. f: MASKMOV Load instruction with a mask value selecting 0 elements and certain addressing forms incur delay due to assist.
Table C-5. Gather Timing Data from L1D* Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C/45/ 06_4E, 46/3F 06_5E
06_3D, 06_47, 06_56
06_3C/45/ 46/3F
VPGATHERDD/PS xmm, [vi128], xmm
~20
~17
~14
~4
~5
~7
VPGATHERQQ/PD xmm, [vi128], xmm
~18
~15
~12
~3
~4
~5
VPGATHERDD/PS ymm, [vi256], ymm
~22
~19
~20
~5
~6
~10
VPGATHERQQ/PD ymm, [vi256], ymm
~20
~16
~15
~4
~5
~7
Throughput
* Gather Instructions fetch data elements via memory references. The timing data shown applies to memory references that reside within the L1 data cache and all mask elements selected
C-6
INSTRUCTION LATENCY AND THROUGHPUT
Table C-6. BMI1, BMI2 and General Purpose Instructions Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C/45 06_4E, /46/3F 06_5E
06_3D, 06_47, 06_56
06_3C/45 /46/3F
ANDN
1
1
1
0.5
0.5
0.5
BEXTR
2
2
2
0.5
0.5
0.5
BLSI/BLSMSK/BLSR
1
1
1
0.5
0.5
0.5
BZHI
1
1
1
0.5
0.5
0.5
MULX r64, r64, r64
4
4
4
1
1
1
PDEP/PEXT r64, r64, r64
3
3
3
1
1
1
RORX r64, r64, r64
1
1
1
0.5
0.5
0.5
SALX/SARX/SHLX r64, r64, r64
1
1
1
0.5
0.5
0.5
LZCNT/TZCNT
3
3
3
1
1
1
Throughput
Table C-7. F16C,RDRAND Instructions Instruction
Latency 1
DisplayFamily_DisplayModel
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C/ 45/46/ 3F
06_3A/ 3E
06_4E, 06_5E
06_3D, 06_47, 06_56
06_3C/ 45/46/ 3F
06_3A/ 3E
RDRAND* r64
Varies
Varies
Varies