Parallel Computer Architecture: A Hardware/Software Approach 1558603433, 1865843830, 9781558603431

The most exciting development in parallel computer architecture is the convergence of traditionally disparate approaches

327 56 62MB

English Pages 1025 [1056] Year 1998;1999

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover......Page 1
About The Authors......Page 5
Title......Page 6
Copyright......Page 7
Contents......Page 12
Foreword......Page 10
Preface......Page 22
1. Introduction......Page 32
1.1 Why Parallel Architecture......Page 35
1.1.1 Application Trends......Page 37
1.1.2 Technology Trends......Page 43
1.1.3 Architectural Trends......Page 45
1.1.4 Supercomputers......Page 52
1.1.5 Summary......Page 54
1.2.1 Communication Architecture......Page 56
1.2.2 Shared Address Space......Page 59
1.2.3 Message Passing......Page 68
1.2.4 Convergence......Page 73
1.2.5 Data Parallel Processing......Page 75
1.2.6 Other Parallel Architectures......Page 78
1.2.7 A Generic Parallel Architecture......Page 81
1.3 Fundamental Design Issues......Page 83
1.3.2 Programming Model Requirements......Page 84
1.3.3 Communication and Replication......Page 89
1.3.4 Performance......Page 90
1.4 Concluding Remarks......Page 94
1.5 Historical References......Page 97
1.6 Exercises......Page 101
2. Parallel Programs......Page 106
2.1 Parallel Application Case Studies......Page 107
2.1.1 Simulating Ocean Currents......Page 108
2.1.2 Simulating the Evolution of Galaxies......Page 109
2.1.3 Visualizing Complex Scenes Using Ray Tracing......Page 110
2.1.4 Mining Data for Associations......Page 111
2.2 The Parallelization Process......Page 112
2.2.1 Steps in the Process......Page 113
2.2.2 Parallelizing Computation versus Data......Page 121
2.2.3 Goals of the Parallelization Process......Page 122
2.3.1 The Equation Solver Kernel......Page 123
2.3.2 Decomp asition......Page 124
2.3.3 Assignment......Page 129
2.3.4 Orchestration under the Data Parallel Model......Page 130
2.3.5 Orchestration under the Shared Address Space Model......Page 132
2.3.6 Orchestration under the Message-Passing Model......Page 139
2.4 Concluding Remarks......Page 147
2.5 Exercises......Page 148
3. Programming For Performance......Page 152
3.1.1 Load Balance and Synchronization Wait Time......Page 154
3.1.2 Reducing Inherent Communication......Page 162
3.1.3 Reducing the Extra Work......Page 166
3.1.4 Summary......Page 167
3.2 Data Access And Communication In A Multimemory System......Page 168
3.2.1 A Multiprocessor as an Extended Memory Hierarchy......Page 169
3.2.2 Artifactual Communication in the Extended Memory Hierarchy......Page 170
3.2.3 Artifactual Communication and Replication: The Working Set Perspective......Page 171
3.3.1 Reducing Artifactual Communication......Page 173
3.3.2 Structuring Communication to Reduce Cost......Page 181
3.4 Performance Factors From The Processor's Perspective......Page 187
3.5 The Parallel Application Case Studies: An In-Depth Look......Page 191
3.5.1 Ocean......Page 192
3.5.2 Barnes-Hut......Page 197
3.5.3 Raytrace......Page 205
3.3.4 Data Mining......Page 209
3.6 Implications For Programming Models......Page 213
3.6.2 Replication......Page 215
3.6.3 Overhead and Granularity of Communication......Page 217
3.6.4 Block Data Transfer......Page 218
3.6.6 Hardware Cost and Design Complexity......Page 219
3.6.8 Summary......Page 220
3.7 Concluding Remarks......Page 221
3.8 Exercises......Page 223
4. Workload-Driven Evaluation......Page 230
4.1.1 Basic Measures of Multiprocessor Performance......Page 233
4.1.2 Why Worry about Scaling?......Page 235
4.1.3 Key Issues in Scaling......Page 237
4.1.4 Scaling Models and Speedup Measures......Page 238
4.1.5 Impact of Scaling Models on the Equation Solver Kernel......Page 242
4.1.6 Scaling Workload Parameters......Page 244
4.2.1 Performance Isolation Using Microbenchmarks......Page 246
4.2.2 Choosing Workloads......Page 247
4.2.3 Evaluating a Fixed-Size Machine......Page 252
4.2.4 Varying Machine Size......Page 257
4.2.5 Choosing Performance Metrics......Page 259
4.3 Evaluating An Architectural Idea or Trade-Off......Page 262
4.3.1 Multiprocessor Simulation......Page 264
4.3.2 Scaling Down Problem and Machine Parameters for Simulation......Page 265
4.3.3 Dealing with the Parameter Space: An Example Evaluation......Page 269
4.4 Illustrating Workload Characterization......Page 274
4.4.1 Workload Case Studies......Page 275
4.4.2 Workload Characteristics......Page 284
4.5 Concluding Remarks......Page 293
4.6 Exercises......Page 294
5. Shared Memory Multiprocessors......Page 300
5.1.1 The Cache Coherence Problem......Page 304
5.1.2 Cache Coherence through Bus Snooping......Page 308
5.2 Memory Consistency......Page 314
5.2.1 Sequential Consistency......Page 317
5.2.2 Sufficient Conditions for Preserving Sequential Consistency......Page 320
5.3 Design Space For Snooping Protocols......Page 322
5.3.1 A Three-State (MSI) Write-Back Invalidation Protocol......Page 324
5.3.2 A Four-State (MESI) Write-Back Invalidation Protocol......Page 330
5.3.3 A Four-State (Dragon) Write-Back Update Protocol......Page 332
5.4 Assessing Protocol Design Trade-Offs......Page 336
5.4.1 Methodology......Page 337
5.4.2 Bandwidth Requirement under the MESI Protocol......Page 338
5.4.3 Impact of Protocol Optimizations......Page 342
5.4.4 Trade-Offs in Cache Block Size......Page 344
5.4.5 Updale-Based versus Invalidation-Based Protocols......Page 360
5.5 Synchronization......Page 365
5.5.1 Components of a Synchronization Event......Page 366
5.5.2 Role of the User and System......Page 367
5.5.3 Mutual Exclusion......Page 368
5.5.4 Point-to-Point Event Synchronization......Page 383
5.5.5 Global (Barrier) Event Synchronization......Page 384
5.5.6 Synchronization Summary......Page 389
5.6 Implications For Software......Page 390
5.7 Concluding Remarks......Page 397
5.8 Exercises......Page 398
6. Snoop-Based Multiprocessor Design......Page 408
6.1 Correctness Requirements......Page 409
6.2 Base Design: Single-Level Caches With An Atomic Bus......Page 411
6.2.1 Cache Controller and Tag Design......Page 412
6.2.2 Reporting Snoop Results......Page 413
6.2.3 Dealing with Write Backs......Page 415
6.2.5 Nonatomic State Transitions......Page 416
6.2.6 Serialization......Page 419
6.2.8 Livelock and Starvation......Page 421
6.2.9 Implementing Atomic Operations......Page 422
6.3 Multilevel Cache Hierarchies......Page 424
6.3.1 Maintaining Inclusion......Page 425
6.3.2 Propagating Transactions for Coherence in the Hierarchy......Page 427
6.4 Split-Transaction Bus......Page 429
6.4.2 Bus Design and Request-Response Matching......Page 431
6.4.3 Snoop Results and Conflicting Requests......Page 433
6.4.5 Path of a Cache Miss......Page 435
6.4.6 Serialization and Sequential Consistency......Page 437
6.4.7 Alternative Design Choices......Page 440
6.4.8 Split-Transaction Bus with Multilevel Caches......Page 441
6.4.9 Supporting Multiple Outstanding Misses from a Processor......Page 444
6.5 Case Studies: SGI Challenge And Sun Enterprise 6000......Page 446
6.5.1 SGI Powerpath-2 System Bus......Page 448
6.5.2 SGI Processor and Memory Subsystems......Page 451
6.5.3 SGI I/O Subsystem......Page 453
6.5.5 Sun Gigaplane System Bus......Page 455
6.5.6 Sun Processor and Memory Subsystem......Page 458
6.5.9 Application Performance......Page 460
6.6 Extending Cache Coherence......Page 464
6.6.1 Shared Cache Designs......Page 465
6.6.2 Coherence for Virtually Indexed Caches......Page 468
6.6.3 Translation Lookaside Buffer Coherence......Page 470
6.6.4 Snoop-Based Cache Coherence on Rings......Page 472
6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems......Page 476
6.8 Exercises......Page 477
7. Scalable Multiprocessors......Page 484
7.1 Scalability......Page 487
7.1.1 Bandwidth Scaling......Page 488
7.1.2 Latency Scaling......Page 491
7.1.3 Cost Scaling......Page 492
7.1.4 Physical Scaling......Page 493
7.1.5 Scaling in a Generic Parallel Architecture......Page 498
7.2 Realizing Programming Models......Page 499
7.2.1 Primitive Network Transactions......Page 501
7.2.2 Shared Address Space......Page 504
7.2.3 Message Passing......Page 507
7.2.4 Active Messages......Page 512
7.2.5 Common Challenges......Page 513
7.2.6 Communication Architecture Design Space......Page 516
7.3.1 Node-to-Network Interface......Page 517
7.3.3 A Case Study: nCUBE/2......Page 519
7.3.4 Typical LAN Interfaces......Page 521
7.4.1 Node-to-Network Interface......Page 522
7.4.2 Case Study: Thinking Machines CM-5......Page 524
7.4.3 User-Level Handlers......Page 525
7.5 Dedicated Message Processing......Page 527
7.5.1 Case Study: Intel Paragon......Page 530
7.5.2 Case Study: Meiko CS-2......Page 534
7.6 Shared Physical Address Space......Page 537
7.6.1 Case Study: CRAY T3D......Page 539
7.6.2 Case Study: CRAY T3E......Page 543
7.7 Clusters And Networks Of Workstations......Page 544
7.7.1 Case Study: Myrinet SBUS Lanai......Page 547
7.7.2 Case Study: PCI Memory Channel......Page 549
7.8.1 Network Transaction Performance......Page 553
7.8.2 Shared Address Space Operations......Page 558
7.8.3 Message-Passing Operations......Page 559
7.8.4 Application-Level Performance......Page 562
7.9.1 Algorithms for Locks......Page 569
7.9.2 Algorithms for Barriers......Page 573
7.11 Exercises......Page 579
8. Directory-Based Cache Coherence......Page 584
8.1 Scalable Cache Coherence......Page 589
8.2 Overview Of Directory - Based Approaches......Page 590
8.2.1 Operation of a Simple Directory Scheme......Page 591
8.2.2 Scaling......Page 595
8.2.3 Alternatives for Organizing Directories......Page 596
8.3.1 Data Sharing Patterns for Directory Schemes......Page 602
8.3.2 Local versus Remote Traffic......Page 609
8.4 Design Challenges For Directory Protocols......Page 610
8.4.1 Performance......Page 615
8.4.2 Correctness......Page 620
8.5 Memory-Based Directory Protocols: The SGI Origin System......Page 627
8.5.1 Cache Coherence Protocol......Page 628
8.5.2 Dealing with Correctness Issues......Page 635
8.5.3 Details of Directory Structure......Page 640
8.5.4 Protocol Extensions......Page 641
8.5.5 Overview of the 0rigin2000 Hardware......Page 643
8.5.6 Hub Implementation......Page 645
8.5.7 Performance Characteristics......Page 649
8.6 Cache-Based Directory Protocols: The Sequent NUMA-Q......Page 653
8.6.1 Cache Coherence Protocol......Page 655
8.6.2 Dealing with Correctness Issues......Page 663
8.6.3 Protocol Extensions......Page 665
8.6.4 Overview of NUMA-Q Hardware......Page 666
8.6.5 Protocol Interactions with SMP Node......Page 668
8.6.6 IQ-Link Implementation......Page 670
8.6.7 Performance Characteristics......Page 672
8.6.8 Comparison Case Study: The HAL S1 Multiprocessor......Page 674
8.7 Performance Parameters And Protocol Performance......Page 676
8.8 Synchronization......Page 679
8.8.1 Performance of Synchronization Algorithms......Page 680
8.8.2 Implementing Atomic Primitives......Page 682
8.9 Implications For Parallel Software......Page 683
8.10.1 Reducing Directory Storage Overhead......Page 686
8.10.2 Hierarchical Coherence......Page 690
8.11 Concluding Remarks......Page 700
8.12 Exercises......Page 703
9. Hardware/Software Trade-Offs......Page 710
9.1 Relaxed Memory Consistency Models......Page 712
9.1.1 The System Specification......Page 717
9.1.2 The Programmer’s Interface......Page 725
9.1.4 Consistency Models in Real Multiprocessor Systems......Page 729
9.2.1 Tertiary Caches......Page 731
9.2.2 Cache-Only Memory Architectures (COMA)......Page 732
9.3 Reducing Hardware Cost......Page 736
9.3.2 Access Control through Code Instrumentation......Page 738
9.3.3 Page-Based Access Control: Shared Virtual Memory......Page 740
9.3.4 Access Control through Language and Compiler Support......Page 752
9.4 Putting It All Together: A Taxonomy And Simple Coma......Page 755
9.4.1 Putting It All Together: Simple COMA and Stache......Page 757
9.5 Implications For Parallel Software......Page 760
9.6.1 Flexibility and Address Constraints in CC-NUMA Systems......Page 761
9.6.2 Implementing Relaxed Memory Consistency in Software......Page 763
9.7 Concluding Remarks......Page 770
9.8 Exercises......Page 771
10. Interconnection Network Design......Page 780
10.1 Basic Definitions......Page 781
10.2.1 Latency......Page 786
10.2.2 Bandwidth......Page 792
10.3.1 Links......Page 795
10.3.2 Switches......Page 798
10.4.1 Fully Connected Network......Page 799
10.4.3 Multidimensional Meshes and Tori......Page 800
10.4.4 Trees......Page 803
10.4.5 Butterflies......Page 805
10.4.6 Hypercubes......Page 809
10.5 Evaluating Design Trade-Offs In Network Topology......Page 810
10.5.1 Unloaded Latency......Page 811
10.5.2 Latency under Load......Page 816
10.6.1 Routing Mechanisms......Page 820
10.6.2 Deterministic Routing......Page 821
10.6.3 Deadlock Freedom......Page 822
10.6.4 Virtual Channels......Page 826
10.6.5 Up*-Down* Routing......Page 827
10.6.6 Turn-Model Routing......Page 828
10.6.7 Adaptive Routing......Page 830
10.7 Switch Design......Page 832
10.7.2 Internal Datapath......Page 833
10.7.3 Channel Buffers......Page 835
10.7.4 Output Scheduling......Page 839
10.7.5 Stacked Dimension Switches......Page 841
10.8.1 Parallel Computer Networks versus LANs and WANs......Page 842
10.8.2 Link-Level Flow Control......Page 844
10.8.3 End-to-End Flow Control......Page 847
10.9.1 CRAY T3D Network......Page 849
10.9.2 IBM SP-1, SP-2 Network......Page 851
10.9.3 Scalable Coherent Interface......Page 853
10.9.4 SGI Origin Network......Page 856
10.9.5 Myricom Network......Page 857
10.10 Concluding Remarks......Page 858
10.11 Exercises......Page 859
11. Latency Tolerance......Page 862
11.1 Overview Of Latency Tolerance......Page 865
11.1.1 Latency Tolerance and the Communication Pipeline......Page 867
11.1.2 Approaches......Page 868
11.1.3 Fundamental Requirements, Benefits, and Limitations......Page 871
11.2 Latency Tolerance In Explicit Message Passing......Page 878
11.2.3 Precommunication......Page 879
11.2.5 Multithreading......Page 881
11.3 Latency Tolerance In A Shared Address Space......Page 882
11.3.1 Structure of Communication......Page 883
11.4.1 Techniques and Mechanisms......Page 884
11.4.2 Policy Issues and Trade-Offs......Page 885
11.4.3 Performance Benefits......Page 887
11.5 Proceeding Past Long-Latency Events......Page 894
11.5.1 Proceeding Past Writes......Page 895
11.5.2 Proceeding Past Reads......Page 899
11.5.3 Summary......Page 907
11.6.1 Shared Address Space without Caching of Shared Data......Page 908
11.6.2 Cache - Coherent Shared Address Space......Page 910
11.6.3 Performance Benefits......Page 922
11.7 Multithreading In A Shared Address Space......Page 927
11.7.1 Techniques and Mechanisms......Page 929
11.7.2 Performance Benefits......Page 941
11.7.3 Implementation Issues for the Blocked Scheme......Page 945
11.7.4 Implementation Issues for the Interleaved Scheme......Page 948
11.7.5 Integrating Multithreading with Multiple-Issue Processors......Page 951
11.8 Lockup-Free Cache Design......Page 953
11.9 Concluding Remarks......Page 957
11.10 Exercises......Page 958
12. Future Directions......Page 966
12.1 Technology And Architecture......Page 967
12.1.1 Evolutionary Scenario......Page 968
12.1.2 Hitting a Wall......Page 971
12.1.3 Potential Breakthroughs......Page 975
12.2.1 Evolutionary Scenario......Page 986
12.2.2 Hitting a Wall......Page 991
12.2.3 Potential Breakthroughs......Page 992
A.2 TPC......Page 994
A3 SPLASH......Page 996
A.4 NAS Parallel Benchmarks......Page 997
A.5 PARKBENCH......Page 998
A.6 Other Ongoing Efforts......Page 999
References......Page 1000
Index......Page 1026

Parallel Computer Architecture: A Hardware/Software Approach
 1558603433, 1865843830, 9781558603431

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Recommend Papers