Parallel Dynamic and Transient Simulation of Large-Scale Power Systems: A High Performance Computing Solution 3030867811, 9783030867812

This textbook introduces methods of accelerating transient stability (dynamic) simulation and electromagnetic transient

123 35 34MB

English Pages 500 [492] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgements
Contents
Acronyms
1 Many-Core Processors
1.1 Introduction
1.2 Single-Thread, Multi-Thread, and Massive-Thread Programming
1.3 Performance of Parallel Programs
1.3.1 Amdahl's Law
1.3.2 Gustafson-Barsis' Law
1.4 NVIDIA® GPU Architecture
1.5 CUDA Abstraction
1.5.1 Performance Tuning
1.5.2 Heterogeneous Programming
1.5.3 Dynamic Parallelism
1.6 Programming of Multi-core CPUs
1.7 Summary
2 Large-Scale Transient Stability Simulation of AC Systems
2.1 Introduction
2.2 Definitions
2.2.1 Transient
2.2.2 Stability
2.2.3 Transient Stability
2.3 Transient Stability Problem and Classical Solutions
2.3.1 Transient Stability Modeling
2.4 Parallel Solution of Large-Scale DAE Systems
2.4.1 Tearing
2.4.2 Relaxation
2.5 Power System Specific Approaches
2.5.1 Diakoptics
2.5.2 Parallel-in-Space Methods
2.5.3 Parallel-in-Time Methods
2.5.4 Waveform Relaxation
2.5.5 Instantaneous Relaxation
2.5.6 Coherency-Based System Partitioning
2.5.7 Types of Parallelism Utilized for Transient Stability Simulation
2.6 SIMD-Based Standard Transient Stability Simulation on the GPU
2.6.1 GPU-Based Programming Models
2.6.1.1 Hybrid GPU-CPU Simulation
2.6.1.2 GPU-Only Simulation
2.6.2 Case Study
2.6.2.1 Simulation Accuracy Evaluation
2.6.2.2 Computational Efficiency Evaluation
2.6.2.3 Discussion
2.7 Multi-GPU Implementation of Large-Scale Transient Stability Simulation
2.7.1 Computing System Architecture
2.7.2 Multi-GPU Programming
2.7.3 Implementation of Parallel Transient Stability Methods on Tesla® S1070
2.7.3.1 Tearing Methods on Multiple GPUs
2.7.3.2 Relaxation Methods on Multiple GPUs
2.7.4 Case Studies
2.7.4.1 Transparency
2.7.4.2 Scalability
2.7.4.3 LU Factorization Timing
2.8 Summary
3 Large-Scale Electromagnetic Transient Simulation of AC Systems
3.1 Introduction
3.2 Massive-Threading Parallel Equipment and Numerical Solver Modules
3.2.1 Linear Passive Components
3.2.1.1 Massive-Thread Parallel Implementation
3.2.2 Transmission Line
3.2.2.1 Frequency-Domain Formulation
3.2.2.2 Time-Domain Implementation
3.2.2.3 Interpolation
3.2.2.4 Massive-Thread Parallel Implementation
3.2.3 Nonlinear Components
3.2.3.1 Formulation
3.2.3.2 Massive-Thread Parallel Implementation
3.2.4 Forward-Backward Substitution with LU Factorization
3.2.4.1 Formulation
3.2.4.2 Massive-Thread Parallel Implementation
3.2.5 Compensation Method Interface
3.2.6 Synchronous Machine
3.2.6.1 Electrical Part
3.2.6.2 Mechanical Part
3.2.6.3 Control System Interface
3.2.6.4 Massive-Thread Parallel Implementation
3.2.7 Transformer with Magnetic Saturation
3.3 Shattering Network Decomposition
3.3.1 First-Level Decomposition (Coarse-Grained)
3.3.2 Second-Level Decomposition (Fine-Grained)
3.3.2.1 Compensation Network Decomposition
3.3.2.2 Jacobian Domain Decomposition
3.3.3 Massively Parallel EMT Simulator
3.3.3.1 Simulator Framework
3.3.3.2 Fine-Grained EMT Simulation Implementation on GPUs
3.3.3.3 Linear Side
3.3.3.4 Nonlinear Side
3.3.4 Balance and Synchronization
3.4 Simulation Case Studies
3.4.1 Case Study A
3.4.2 Case Study B
3.5 Summary
4 Device-Level Modeling and Transient Simulation of Power Electronic Switches
4.1 Introduction
4.2 Nonlinear Behavioral Model
4.2.1 Diode Behavioral Model
4.2.1.1 Basic Model Description
4.2.1.2 Parallel Massive-Thread Mapping
4.2.2 IGBT Behavioral Model
4.2.2.1 Parallel Massive-Thread Mapping
4.2.3 Electrothermal Network
4.2.4 Complete IGBT/Diode Model
4.2.5 Model Validation
4.3 Physics-Based Model
4.3.1 Physics-Based Diode Model
4.3.1.1 Model Formulation
4.3.1.2 Physics-Based Diode EMT Model
4.3.1.3 Parallel Massive-Thread Mapping
4.3.2 Physics-Based Nonlinear IGBT Model
4.3.2.1 Model Formulation
4.3.2.2 Model Discretization and Linearization
4.3.2.3 Parallel Massive-thread Mapping
4.3.3 Model Validation
4.4 Nonlinear Dynamic Model
4.4.1 Dynamic Transistor Model
4.4.1.1 Parasitic Dynamics
4.4.1.2 Freewheeling Diode
4.4.2 IGBT EMT Model Derivation
4.4.2.1 Parallel Massive-Thread Mapping
4.4.3 Wideband SiC MOSFET Model
4.4.4 IGBT Model Validation
4.5 Predefined Curve-Fitting Model
4.5.1 Model Validation
4.6 High-Order Nonlinear Model Equivalent Circuit
4.7 Summary
5 Large-Scale Electromagnetic Transient Simulation of DC Grids
5.1 Introduction
5.2 Generic MTDC Grid Fine-Grained Partitioning
5.2.1 Level-One Partitioning: Universal Line Model
5.2.2 Level-Two Partitioning: TLM Link
5.2.3 MTDC Multi-Level Partitioning Scheme
5.2.4 Level-Three Partitioning: Coupled Voltage-CurrentSources
5.2.4.1 MMC Internal Reconfiguration
5.2.4.2 MMC GPU Kernel
5.2.4.3 Hybrid HVDC Circuit Breaker Modeling
5.2.4.4 HHB GPU Kernel
5.2.5 Kernel Dynamic Parallelism for Large-Scale MTDCGrids
5.2.6 GPU Implementation Results and Validation
5.2.6.1 Basic MMC Tests
5.2.6.2 Point-to-Point HVDC Transmission Tests
5.2.6.3 MTDC Grid Test Cases
5.3 General Nonlinear MMC Parallel Solution Method
5.3.1 Massive-Thread Parallel Implementation of Newton-Raphson Method
5.3.1.1 Algorithm
5.3.1.2 Massive-Thread Parallel Implementation
5.3.2 Block Jacobian Matrix Decomposition for MMC
5.3.2.1 Matrix Update Using a Relaxation Algorithm
5.3.2.2 Partial LU Decomposition for MMC
5.3.2.3 Blocked Forward and Backward Substitutions
5.3.3 Parallel Massive-Thread Mapping
5.3.4 Predictor-Corrector Variable Time-Stepping Scheme
5.3.5 Case Studies and Data Analysis
5.3.5.1 Test Case Setup
5.3.5.2 Results and Comparison
5.3.5.3 Execution Time and Speedup Comparison
5.4 TR-Based Nonlinear MMC Parallel Solution Method
5.4.1 Wind Farm Modeling
5.4.1.1 Induction Machine Model
5.4.1.2 DFIG Aggregation
5.4.2 Nonlinear MMC Modeling and Computation
5.4.2.1 IGBT/Diode Grouping
5.4.2.2 Fine-Grained MMC GPU Kernel Design
5.4.2.3 Parallel Simulation Architecture
5.4.3 GPU Implementation Results and Validation
5.4.3.1 Wind Farm Integration Dynamics
5.4.3.2 MTDC System Tests
5.5 Hierarchical MMC Device-Level Modeling
5.5.1 Heterogeneous Computing Architecture
5.5.1.1 Boundary Definition
5.5.1.2 CPU/GPU Computing Architecture
5.5.2 Heterogeneous CPU/GPU HPC Results and Validation
5.5.2.1 HBSM-MMC-Based DC Systems
5.5.2.2 CDSM MMC-HVDC
5.6 Summary
6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy
6.1 Introduction
6.2 Variable Time-Stepping Simulation
6.2.1 Variable Time-Stepping Schemes
6.2.1.1 Event-Correlated Criterion
6.2.1.2 Local Truncation Error
6.2.1.3 Newton-Raphson Iteration Count
6.2.2 Hybrid Time-Step Control and Synchronization
6.2.3 VTS-Based MMC Models
6.2.3.1 TSSM-Based MMC VTS Model
6.2.3.2 MMC Main Circuit VTS Model
6.2.3.3 NBM-Based VTS Submodule Model
6.2.4 VTS-Based MMC GPU Kernel Design
6.2.5 VTS Simulation Results and Validation
6.2.5.1 System Setup
6.2.5.2 VTS in Device- and System-Level Simulation on CPU
6.2.5.3 Nonlinear MTDC System Preview on GPU
6.3 Heterogeneous CPU-GPU Computing
6.3.1 Detailed Photovoltaic System EMT Model
6.3.1.1 Basic PV Unit
6.3.1.2 Scalable PV Array Model
6.3.2 PV-Integrated AC-DC Grid
6.3.2.1 AC Grid
6.3.2.2 Multi-Terminal DC System
6.3.2.3 PV Plant
6.3.2.4 EMT-Dynamic Co-simulation Interfaces
6.3.3 Heterogeneous Computing Framework
6.3.3.1 CPU-GPU Program Architecture
6.3.3.2 Co-simulation Implementation
6.3.4 EMT-Dynamic Co-simulation Results
6.3.4.1 PV Array
6.3.4.2 AC-DC Grid Interaction
6.4 Adaptive Sequential-Parallel Simulation
6.4.1 Wind Generation Model Reconfiguration
6.4.1.1 Induction Generator Model
6.4.1.2 Three-Phase Transformer
6.4.1.3 DFIG Converter System
6.4.2 Integrated AC-DC Grid Modeling
6.4.2.1 Detailed EMT Modeling
6.4.3 Adaptive Sequential-Parallel Processing
6.4.3.1 Heterogeneous CPU-GPU Processing Boundary Definition
6.4.3.2 Adaptive Sequential-Parallel Processing Framework
6.4.4 EMT-Dynamic Co-simulation Results
6.5 Summary
7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids
7.1 Introduction
7.2 Parallel-in-Time Modeling
7.2.1 Parareal Algorithm
7.2.2 Component Models
7.2.2.1 Inductor/Capacitor Models
7.2.2.2 Transformer Model
7.2.2.3 Generator Model
7.2.3 Transmission Line Model
7.3 Parareal Application to AC System EMT Simulation
7.3.1 Component-Based System Architecture
7.3.2 Fixed Algorithm Implementation
7.3.2.1 Initialization
7.3.2.2 Parallel Operation
7.3.2.3 Sequential Update
7.3.2.4 Output
7.3.3 Windowed Algorithm Implementation
7.3.4 Case Studies
7.4 Parallel-in-Time EMT Modeling of MMCs for DC GridSimulation
7.4.1 Modular Multilevel Converter Modeling
7.4.1.1 Three-Phase MMC Modeling
7.4.1.2 Ideal Switch Model
7.4.1.3 Transient Curve-Fitting Model
7.4.2 Parallel-in-Time Implementations
7.4.2.1 Parareal Implementation
7.4.2.2 Hybrid Model
7.4.2.3 Transmission Line-Based PiT+PiS Integration
7.4.3 Simulation Case Studies
7.4.3.1 Case 1: Single MMC
7.4.3.2 Case 2: CIGRÉ Multi-Terminal DC Grid
7.5 Hybrid Parallel-in-Time-and-Space Dynamic-EMT Simulation of AC-DC Grids
7.5.1 AC and DC System Modeling
7.5.2 GPU-Based AC Grid TS Simulation
7.5.3 CPU-Based MTDC Grid EMT Simulation
7.5.4 Case Studies and Performance Evaluation
7.5.4.1 Validation of PiT and AC-DC Co-simulation
7.5.4.2 Computational Performance
7.6 Summary
8 Multi-Physics Modeling and Simulation of AC-DC Grids
8.1 Introduction
8.2 Component-Level Thermo-Electromagnetic Nonlinear Transient Finite Element Modeling of Solid-State Transformer for DC Grid Studies
8.2.1 MMC Nonlinear Component-Level Modeling
8.2.1.1 Fine-Grained MMC Partitioning
8.2.1.2 MMC-Based DC Grid
8.2.2 Finite Element Transformer Model
8.2.2.1 Finite Element Model and Field-Circuit Coupling
8.2.2.2 Matrix-Free Finite Element Solution
8.2.2.3 Multi-Domain Interfacing
8.2.3 Coupled Field-Circuit SST Kernel
8.2.4 Massively Parallel Co-simulation Results
8.2.4.1 SST in DC Grid
8.2.4.2 Finite Element Results
8.3 Integrated Field-Transient Parallel Simulation of Converter Transformer Interaction with MMC in Multi-Terminal DC Grid
8.3.1 Coupled Thermo-Electromagnetic Model of Converter Transformer
8.3.1.1 Finite Element Model for Magnetic Field
8.3.2 Electrothermal Modeling of MMC
8.3.3 Parallel Implementation of Integrated Field-CircuitModel
8.3.4 Case Study and Results
8.3.4.1 Case Description and Setup
8.3.4.2 External Network Simulation Results
8.3.4.3 Finite Element Simulation Results
8.4 Finite-Difference Relaxation-Based Parallel Computation of Ionized Field of HVDC Lines
8.4.1 Problem Description
8.4.1.1 Assumptions for Modeling
8.4.1.2 Governing Equations
8.4.1.3 Boundary Conditions
8.4.2 Predictor-Corrector Strategy
8.4.3 Finite-Difference Relaxation Methodology
8.4.3.1 Domain Discretization and FDR
8.4.3.2 Jacobi Method and Convergence Condition
8.4.3.3 Differentiated Grid Size
8.4.4 Massively Parallel Implementation
8.4.4.1 Data Dependency and Parallelism
8.4.4.2 Parallelization on CPU and GPU
8.4.5 Case Study and Result Comparison
8.4.5.1 Unipolar Case Study
8.4.5.2 Practical Bipolar Case Study
8.5 Space-Time-Parallel 3-D Finite Element Transformer Model for EMT Simulation
8.5.1 FEM Formulation for Eddy Current Analysis
8.5.1.1 Reduced Magnetic Potential Formulation
8.5.1.2 Finite Elements and Discretized Formulation
8.5.2 Space Parallelism: Refined Nonlinear FEM Solver
8.5.2.1 Linear Matrix-Free Preconditioned Conjugate Gradient Solver with Element-Level Parallelism
8.5.2.2 Isolating Nonlinearities with the Adaptive Transmission Line Decoupling
8.5.2.3 Adaptive Transmission Line Decoupling-Based Finite Element Formulation
8.5.3 Flux Extraction Coupling Finite Element Model with External Circuit
8.5.4 Time Parallelism: Parareal Algorithm for Field-Circuit Co-simulation
8.5.5 Case Studies
8.5.5.1 Static Case
8.5.5.2 Space-Time-Parallel Field-Circuit Co-simulation
8.6 Summary
A Parameters for Case Studies
A.1 Chapter 2
A.2 Chapter 3
A.2.1 Parameters for Case Study A
A.2.2 Parameters for Case Study B
A.3 Chapter 4
A.3.1 Parameters for Case Study in Sect.4.2
A.3.2 Parameters for Case Study in Sect.4.3
A.3.3 Parameters for Case Study in Sect.4.4
A.3.4 Parameters for Case Study in Sect.4.5
A.4 Chapter 5
A.4.1 Parameters for Case Study in Sect.5.2
A.4.2 Parameters for Case Study in Sect.5.3
A.4.3 Parameters for Case Study in Sect.5.4
A.4.4 Parameters for Case Study in Sect.5.5
A.5 Chapter 6
A.5.1 Parameters for Case Study in Sect.6.2
A.5.2 Parameters for Case Study in Sect.6.3
A.5.3 Parameters for Case Study in Sect.6.4
A.6 Chapter 7
A.7 Chapter 8
A.7.1 Parameters for System in Fig.8.2
A.7.2 Parameters for Case Study in Sect.8.3.4
A.7.3 Parameters for Cases in Sect.8.4.5
References
Index
Recommend Papers

Parallel Dynamic and Transient Simulation of Large-Scale Power Systems: A High Performance Computing Solution
 3030867811, 9783030867812

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Venkata Dinavahi Ning Lin

Parallel Dynamic and Transient Simulation of Large-Scale Power Systems A High Performance Computing Solution

Parallel Dynamic and Transient Simulation of Large-Scale Power Systems

Venkata Dinavahi • Ning Lin

Parallel Dynamic and Transient Simulation of Large-Scale Power Systems A High Performance Computing Solution

Venkata Dinavahi Electrical and Computer Engineering University of Alberta Edmonton, AB, Canada

Ning Lin Powertech Labs Surrey, BC, Canada

ISBN 978-3-030-86781-2 ISBN 978-3-030-86782-9 (eBook) https://doi.org/10.1007/978-3-030-86782-9 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

v

Dévi¯ Mah¯atmyam 12-10:

 |sV|||vV|||mV|||||||

 mV|||V|||||||

lyeV|| |iV|||sV|||veV|| |sV|||vV|||V|||V|| |sV|||V|||iV|||V|||e  sV||||nyeV|| yV|||bV|||e gV|||V|||iV||| nV|||V||||V|||yV|||iV|||nV|| nV|||mV|||eV|||tuV|| |teV||   Om sarvamangala ˙ m¯ angaly´ ˙ e´ siv´ e sarv¯ artha s¯ adhik´ e; ´ saran.y´ e tryambak´ e gauri N¯ ar¯ ayan.i namo stu t´ e. Who is the auspiciousness in all the auspicious, auspiciousness herself, complete with all the auspicious attributes, who fulfils all desires, refuge of all, three-eyed, bright-featured, O N¯ar¯ayan.i, salutations to you.

Preface

Electrical power systems are large and complex interconnected systems containing myriad equipment such as conventional and green energy sources, transmission lines, sophisticated power electronic technologies such as High Voltage Direct Current (HVDC) links and Flexible AC Transmission Systems (FACTS), and a variety of complex loads. Increasing energy demand by a plethora of sensitive loads, inclement weather conditions, stringent reliability constraints, and heightened environmental concerns that favour greater integration of renewables are creating conditions for a highly stressed power generation and transmission system. Under these circumstances there is higher probability of relatively harmless events metastasizing into major outages leading to blackouts and massive societal disruption and hardship to population. Accurate mathematical modeling and computer simulation play a crucial role in the planning, design, operation, and control of power systems as they are moved ever closer to their security limits. While the application of digital computer simulation spans the entire frequency spectrum of power system studies, two of the most important and computationally demanding studies related to large-scale power systems are the transient stability (TS) simulation and the electromagnetic transient (EMT) simulation. Transient stability simulation is an integral part of the dynamic security assessment (DSA) program executed on the energy control center computers to ensure the stability of the power system. This analysis takes considerable time to complete for a realistic-size power system due to the computationally onerous time-domain solution of thousands of nonlinear differential algebraic equations. Assuming single-phase fundamental frequency behavior these equations must be solved using a time-step that is a thousandth of a second. The TS simulation has to be carried out for each condition of a large set of credible contingencies on the grid to determine security limits and devise adequate control actions. Electromagnetic transients are temporary phenomena, such as changes of voltages or currents in a short time slice caused by the excitation due to switching operations, faults, lightning strikes, and other disturbances in power systems. Although they are short and fast, transients impact the operation, stability, reliability, and economics of the power system significantly. For example, transients can damage equipment insulation, activate control or protective systems, and can cause large-scale system interruption. Invariably EMT simulation requires detailed modeling of system equipment over a vii

viii

Preface

wide frequency range, inclusion of nonlinear behavior, consideration of all phases, and a time-step that is a millionth of a second. The graphics processing unit (GPU) was originally designed to accelerate the computation for image and video processing which require a great amount of lighting and rendering calculations. Therefore, it has a large number of processing elements working similarly, unlike the central processing unit (CPU). Based on its many-core architecture the GPU brought super-computing for the masses and increased the compute measure of gigaflops per dollar while simultaneously reducing the power requirements of the compute cluster, and applications for general purpose computing on graphics processing units rapidly mushroomed. For power system applications, historically most of the simulation programs for TS or EMT analysis were designed based on the idea of a computer system of the 1980’s. They were single-thread programs which ran sequentially on computer systems equipped with single-core CPUs. Although the CPU clock speed steadily increased while the transistor size decreased through subsequent generations of CPUs to fuel a sustained increase in the speed of these programs during 1990s to 2000s, it has now saturated to curtail the chip power dissipation and accommodate manufacturing constraints. The computer industry has therefore transitioned to multi-core and many-core chip architectures to improve overall processing performance, so that the performance of compute systems is no longer decided only by the clock frequency. However, this advance cannot be exploited by single-threading power system simulation programs. This means that off-line power system simulators can hardly derive any benefit from the progress of computer chip innovations; indeed they might even slow down considerably due to a lowered clock frequency or due to an increase in equipment modeling complexity or power system size. This book provides practical methods to accelerate the TS and EMT simulations on many-core processors for large-scale AC-DC grids. Special emphasis is placed on detailed device-level models for power system equipment and parallel-inspace-and-time decomposition techniques for simulating large-scale systems. While the content of the book is focused on off-line simulation the developed models and methods can also be deployed on modern GPUs for real-time applications. Simulation case studies provided in the book for realistic AC-DC grids reinforce the theoretical concepts. With a comprehensive presentation it is expected that this book will benefit two groups of readers. Graduate students in universities pursuing masters and doctoral degrees will find useful information that can motivate ideas for future research projects. Professional engineers and scientists in the industry will find a timely and valuable reference for solving potential problems in their design and development activities. The book is organized as follows: Chap. 1 introduces many-core processors including their hardware architecture, programming abstractions, and parallel performance. Chapters 2 and 3 present the TS and EMT simulations, respectively, of large-scale AC grids. Fundamental modeling methods of power system equipment, discrete-time solution methods, decomposition techniques, and implementation issues on GPUs are discussed in these chapters. Chapter 4 addresses the topic of power electronic switch models ranging from simplified system-level models

Preface

ix

to detailed device-level physics-based electro-thermal models. The methods for decomposing these models, nonlinear solution process, and their mapping to the many-core architecture of the GPUs are also discussed. Then, Chap. 5 covers in depth the EMT simulation of large-scale DC grids with detailed device-level models. The topic of hybrid EMT and TS simulation of large-scale AC-DC grids integrated with renewable energy sources is addressed in Chap. 6 which also includes variable time-stepping and adaptive heterogenous compute paradigms. Chapter 7 is devoted to parallel-in-time-and-space EMT and TS simulations of AC-DC grids. Finally, Chap. 8 presents a multi-physics modeling and simulation approaches wherein selected system components undergo a 2D or 3D finite-element simulation that is interfaced with the EMT simulation of the host network. Edmonton, Alberta, Canada June 2021

Venkata Dinavahi

Acknowledgements

We express our gratitude to the former graduate students and members of the Real-Time Experimental Laboratory (RTX-LAB) at the University of Alberta for allowing us to use material from their co-authored publications in this book: Dr. Vahid Jalili, Dr. Zhiyin Zhou, Shenhao Yan, Dr. Peng Liu, Tianshi Cheng, and Dr. Ruimin Zhu. We would also like to express our sincere thanks to the publishing team at Springer Nature, especially to Mr. Michael McCabe and Mr. Shabib Shaikh, and Ms. Aravajy Meenahkumary at Straive for their timely feedback and for facilitating an efficient publication process.

xi

Contents

1 Many-Core Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Single-Thread, Multi-Thread, and Massive-Thread Programming . . 1.3 Performance of Parallel Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Gustafson-Barsis’ Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 NVIDIA® GPU Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 CUDA Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Performance Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Heterogeneous Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.3 Dynamic Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Programming of Multi-core CPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 3 3 5 5 7 11 13 13 14 16

2

17 17 18 18 19 19 20 22 28 28 29 30 30 31 33 34 38 43

Large-Scale Transient Stability Simulation of AC Systems . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Transient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Transient Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Transient Stability Problem and Classical Solutions . . . . . . . . . . . . . . . . . 2.3.1 Transient Stability Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Parallel Solution of Large-Scale DAE Systems . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Tearing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Relaxation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Power System Specific Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Diakoptics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Parallel-in-Space Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Parallel-in-Time Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Waveform Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Instantaneous Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.6 Coherency-Based System Partitioning . . . . . . . . . . . . . . . . . . . . . . . 2.5.7 Types of Parallelism Utilized for Transient Stability Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45 xiii

xiv

Contents

2.6

SIMD-Based Standard Transient Stability Simulation on the GPU. . 2.6.1 GPU-Based Programming Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multi-GPU Implementation of Large-Scale Transient Stability Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.1 Computing System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.2 Multi-GPU Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.3 Implementation of Parallel Transient Stability Methods on Tesla® S1070 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46 48 50

Large-Scale Electromagnetic Transient Simulation of AC Systems . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Massive-Threading Parallel Equipment and Numerical Solver Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Linear Passive Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Transmission Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.3 Nonlinear Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 Forward-Backward Substitution with LU Factorization . . . . . 3.2.5 Compensation Method Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.6 Synchronous Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Transformer with Magnetic Saturation . . . . . . . . . . . . . . . . . . . . . . . 3.3 Shattering Network Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 First-Level Decomposition (Coarse-Grained) . . . . . . . . . . . . . . . . 3.3.2 Second-Level Decomposition (Fine-Grained) . . . . . . . . . . . . . . . 3.3.3 Massively Parallel EMT Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Balance and Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Simulation Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Case Study A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Case Study B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71 71 73 73 77 83 87 90 92 99 102 104 106 117 122 123 123 124 131

Device-Level Modeling and Transient Simulation of Power Electronic Switches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Nonlinear Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Diode Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 IGBT Behavioral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Electrothermal Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Complete IGBT/Diode Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.5 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Physics-Based Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Physics-Based Diode Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Physics-Based Nonlinear IGBT Model. . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135 135 136 136 139 144 148 150 152 154 158 168

2.7

2.8 3

4

56 56 58 60 64 69

Contents

4.4

Nonlinear Dynamic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Dynamic Transistor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 IGBT EMT Model Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.3 Wideband SiC MOSFET Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.4 IGBT Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Predefined Curve-Fitting Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5.1 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High-Order Nonlinear Model Equivalent Circuit . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

168 169 172 176 178 178 181 182 185

Large-Scale Electromagnetic Transient Simulation of DC Grids . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Generic MTDC Grid Fine-Grained Partitioning . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Level-One Partitioning: Universal Line Model . . . . . . . . . . . . . . 5.2.2 Level-Two Partitioning: TLM Link . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 MTDC Multi-Level Partitioning Scheme . . . . . . . . . . . . . . . . . . . . 5.2.4 Level-Three Partitioning: Coupled Voltage-Current Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.5 Kernel Dynamic Parallelism for Large-Scale MTDC Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.6 GPU Implementation Results and Validation . . . . . . . . . . . . . . . . 5.3 General Nonlinear MMC Parallel Solution Method . . . . . . . . . . . . . . . . . . 5.3.1 Massive-Thread Parallel Implementation of Newton-Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Block Jacobian Matrix Decomposition for MMC . . . . . . . . . . . 5.3.3 Parallel Massive-Thread Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Predictor-Corrector Variable Time-Stepping Scheme . . . . . . . 5.3.5 Case Studies and Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 TR-Based Nonlinear MMC Parallel Solution Method. . . . . . . . . . . . . . . . 5.4.1 Wind Farm Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.2 Nonlinear MMC Modeling and Computation . . . . . . . . . . . . . . . . 5.4.3 GPU Implementation Results and Validation . . . . . . . . . . . . . . . . 5.5 Hierarchical MMC Device-Level Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 Heterogeneous Computing Architecture . . . . . . . . . . . . . . . . . . . . . 5.5.2 Heterogeneous CPU/GPU HPC Results and Validation . . . . . 5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

187 187 188 189 190 191

215 217 221 223 224 229 230 234 238 242 242 246 252

Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Variable Time-Stepping Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Variable Time-Stepping Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Hybrid Time-Step Control and Synchronization . . . . . . . . . . . . . 6.2.3 VTS-Based MMC Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.4 VTS-Based MMC GPU Kernel Design . . . . . . . . . . . . . . . . . . . . . . 6.2.5 VTS Simulation Results and Validation . . . . . . . . . . . . . . . . . . . . . .

255 255 256 257 259 260 263 265

4.5 4.6 4.7 5

6

xv

193 202 204 215

xvi

Contents

6.3

6.4

6.5 7

8

Heterogeneous CPU-GPU Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Detailed Photovoltaic System EMT Model . . . . . . . . . . . . . . . . . . 6.3.2 PV-Integrated AC-DC Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Heterogeneous Computing Framework . . . . . . . . . . . . . . . . . . . . . . 6.3.4 EMT-Dynamic Co-simulation Results. . . . . . . . . . . . . . . . . . . . . . . . Adaptive Sequential-Parallel Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Wind Generation Model Reconfiguration . . . . . . . . . . . . . . . . . . . . 6.4.2 Integrated AC-DC Grid Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 Adaptive Sequential-Parallel Processing . . . . . . . . . . . . . . . . . . . . . 6.4.4 EMT-Dynamic Co-simulation Results. . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Parallel-in-Time Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Parareal Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Component Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Transmission Line Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Parareal Application to AC System EMT Simulation . . . . . . . . . . . . . . . . 7.3.1 Component-Based System Architecture. . . . . . . . . . . . . . . . . . . . . . 7.3.2 Fixed Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.3 Windowed Algorithm Implementation . . . . . . . . . . . . . . . . . . . . . . . 7.3.4 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Parallel-in-Time EMT Modeling of MMCs for DC Grid Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 Modular Multilevel Converter Modeling . . . . . . . . . . . . . . . . . . . . . 7.4.2 Parallel-in-Time Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.3 Simulation Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Hybrid Parallel-in-Time-and-Space Dynamic-EMT Simulation of AC-DC Grids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 AC and DC System Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 GPU-Based AC Grid TS Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 CPU-Based MTDC Grid EMT Simulation . . . . . . . . . . . . . . . . . . . 7.5.4 Case Studies and Performance Evaluation . . . . . . . . . . . . . . . . . . . 7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Multi-Physics Modeling and Simulation of AC-DC Grids . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Component-Level Thermo-Electromagnetic Nonlinear Transient Finite Element Modeling of Solid-State Transformer for DC Grid Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 MMC Nonlinear Component-Level Modeling . . . . . . . . . . . . . . . 8.2.2 Finite Element Transformer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Coupled Field-Circuit SST Kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Massively Parallel Co-simulation Results . . . . . . . . . . . . . . . . . . . .

273 274 277 282 285 293 293 296 301 306 309 313 313 314 315 317 320 321 321 323 325 326 330 331 335 338 345 346 346 349 350 356 359 359

360 360 364 370 373

Contents

8.3

8.4

8.5

8.6

xvii

Integrated Field-Transient Parallel Simulation of Converter Transformer Interaction with MMC in Multi-Terminal DC Grid . . . . 8.3.1 Coupled Thermo-Electromagnetic Model of Converter Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Electrothermal Modeling of MMC . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Parallel Implementation of Integrated Field-Circuit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.4 Case Study and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Finite-Difference Relaxation-Based Parallel Computation of Ionized Field of HVDC Lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Problem Description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Predictor-Corrector Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Finite-Difference Relaxation Methodology . . . . . . . . . . . . . . . . . . 8.4.4 Massively Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Case Study and Result Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . Space-Time-Parallel 3-D Finite Element Transformer Model for EMT Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 FEM Formulation for Eddy Current Analysis . . . . . . . . . . . . . . . 8.5.2 Space Parallelism: Refined Nonlinear FEM Solver . . . . . . . . . . 8.5.3 Flux Extraction Coupling Finite Element Model with External Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Time Parallelism: Parareal Algorithm for Field-Circuit Co-simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

A Parameters for Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Chapter 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2 Chapter 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Parameters for Case Study A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 Parameters for Case Study B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3 Chapter 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 Parameters for Case Study in Sect. 4.2 . . . . . . . . . . . . . . . . . . . . . . . A.3.2 Parameters for Case Study in Sect. 4.3 . . . . . . . . . . . . . . . . . . . . . . . A.3.3 Parameters for Case Study in Sect. 4.4 . . . . . . . . . . . . . . . . . . . . . . . A.3.4 Parameters for Case Study in Sect. 4.5 . . . . . . . . . . . . . . . . . . . . . . . A.4 Chapter 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.4.1 Parameters for Case Study in Sect. 5.2 . . . . . . . . . . . . . . . . . . . . . . . A.4.2 Parameters for Case Study in Sect. 5.3 . . . . . . . . . . . . . . . . . . . . . . . A.4.3 Parameters for Case Study in Sect. 5.4 . . . . . . . . . . . . . . . . . . . . . . . A.4.4 Parameters for Case Study in Sect. 5.5 . . . . . . . . . . . . . . . . . . . . . . . A.5 Chapter 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.5.1 Parameters for Case Study in Sect. 6.2 . . . . . . . . . . . . . . . . . . . . . . . A.5.2 Parameters for Case Study in Sect. 6.3 . . . . . . . . . . . . . . . . . . . . . . .

376 377 380 382 384 390 392 394 394 401 402 408 410 415 421 423 426 428 431 431 435 435 435 436 436 437 437 437 438 438 438 438 439 439 439 439

xviii

Contents

A.5.3 Parameters for Case Study in Sect. 6.4 . . . . . . . . . . . . . . . . . . . . . . . A.6 Chapter 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7 Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.1 Parameters for System in Fig. 8.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.7.2 Parameters for Case Study in Sect. 8.3.4 . . . . . . . . . . . . . . . . . . . . . A.7.3 Parameters for Cases in Sect. 8.4.5 . . . . . . . . . . . . . . . . . . . . . . . . . . .

439 439 450 450 450 451

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467

Acronyms

AC API ASP2 ATP AVC AVM AVR BC BDF BJT CB CDSM CMI CN CS CPU CUDA CAD DAE DC DCS DDE DDR4 DEM DFIG DOF DRAM DSA DSP EbE EM EMS EMT EMTP

Alternating Current Application Programming Interface Adaptive Sequential Parallel Processing Alternative Transients Program Average Value Control Average Value Model Automatic Voltage Regulator Balancing Control Backward Differentiation Formula Bipolar Junction Transistor Control Block Clamp Double Sub-module Compensation Method Interface Connecting Network Control Subsystem Central Processing Unit Compute Unified Device Architecture Computer Aided Design Differential Algebraic Equation Direct Current DC System Delay Differential Equations Double Data Rate Fourth-generation Detailed Equivalent Model Doubly-Fed Induction Generator Degree of Freedom Dynamic Random Access Memory Dynamic Security Assessment Digital Signal Processing (Processor) Element-by-Element Electromagnetic Energy Management System Electromagnetic Transient Electromagnetic Transient Program xix

xx

FACTS FBSM FBSLUM FDLM FDR FET FFT FIFO FPGA FPU FTS FWD GC GF GFLOPS G-J GJ-IR GPU GPGPU GS-IR GSVSC HBSM HHB HPC HV HVDC I/O IGBT IGBT-AE IGBT-DLE ILU IR KCL kV KVL LAN LCS LB LS LPE LPR LTE MB MCPU MFT

Acronyms

Flexible Alternating Current Transmission System Full-Bridge Sub-module Forward Backward Substitution with LU Factorization Module Frequency Dependent Line Model Finite Difference Relaxation Field Effect Transistor Fast Fourier Transform First-In First-Out Field Programmable Gate Array Floating Point Unit Fixed Time Stepping Free Wheeling Diode Grid Connected Grid Forming Giga Floating Operations Per Second Gauss-Jordan Gauss Jacobi—Instantaneous Relaxation Graphics Processing Unit General Purpose Computing on Graphics Processing Units Gauss Seidel—Instantaneous Relaxation Grid Side VSC Half-Bridge Sub-module Hybrid HVDC Breaker High Performance Computing High Voltage High Voltage Direct Current Input/Output Insulated Gate Bipolar Transistor IGBT Analog Equivalent IGBT Discretized Linearized Equivalent Incomplete LU Instantaneous Relaxation Kirchhoff’s Current Law Kilo Volt Kirchhoff’s Voltage Law Local Area Network Load Commutation Switch Linear Block Linear Subsystem Linear Passive Element Line Protection Local Truncation Error Main Breaker Multi-Core CPU Medium Frequency Transformer

Acronyms

MKL MIMD MMC MOSFET MOV MPI MTDC MPSoC MV MW NBM NDD NE NLB NLC NLD NLE NLM NLS NPC N-R NRI ODE OS OWF PCC PCIe PD PDE PiS PiT PLL PPPR PSC PSM PSS PV PWLD PWM RAM RCB RMVP RSVSC SDK SDS

xxi

Math Kernel Library Multiple Input Multiple Data Modular Multi-Level Converter Metal Oxide Semiconductor Field Effect Transistor Metal Oxide Varistor Message Passing Interface Multi-Terminal DC Multi Processor System-on-Chip Medium Voltage Megawatt Nonlinear Behavior Model Nodal Domain Decomposition Network Equivalent Nonlinear Block Nearest Level Control Nonlinear Diode Nonlinear Element Nearest Level Modulation Nonlinear Subsystem Neutral-Point-Clamped Newton-Raphson Newton Raphson Interface Ordinary Differential Equation Operating System Offshore Wind Farm Point of Common Coupling Peripheral Component Interconnect Express Phase-Disposition Partial Differential Equation Parallel-in-Space Parallel-in-Time Phase Locked Loop Potential Parallel Processing Region Phase-Shifted Carrier Programmable Switch Matrix Power System Stabilizer Photovoltaic Piecewise Linear Diode Pulse Width Modulation Random Access Memory Residual Current Breaker Reduced Matrix Vector Potential Rotor Side VSC Software Development Kit Sub-domain Solver

xxii

SiC SIMD SIMT SM SPWM SST TCFM TL TR TS TLM TSSM TWM UFD ULM ULPEM UMM USB VCCS VDHN VSC VTS WR WT

Acronyms

Silicon Carbide Single Instruction Multiple Data Single Instruction Multiple Thread Streaming Multiprocessor or Sub-module Sinusoidal Pulse Width Modulation Solid State Transformer Transient Curve Fitting Model Transmission Line Topological Reconfiguration Transient Stability Transmission Line Modeling Two State Switch Model Traveling Wave Model Ultrafast Disconnector Universal Line Model Unified Linear Passive Element Model Universal Machine Model Universal Serial Bus Voltage Controlled Current Source Very Dishonest Newton Voltage Source Converter Variable Time Stepping Waveform Relaxation Wind Turbine

1

Many-Core Processors

1.1

Introduction

The earliest ancestors of the dedicated graphics processors were originally designed to accelerate the visualization tasks in the research labs and flight simulators. Later they found their way to commercial workstations, personal computers, and entertainment consoles [1]. By the end of 2000s, the PC add-in graphics cards were developed as fixed-function accelerators for graphical processing operations (such as geometry processing, rasterization, fragment processing, and frame buffer processing). Around this time, the term “graphics processing unit” arose to refer to this hardware used for graphics acceleration [2]. In the early 2000s, the GPU was a fixed-function accelerator originally developed to meet the needs for fast graphics in the video game and animation industries [1,3]. The demand to render more realistic and stylized images in these applications increased with time. The existing obstacle in the fixed-function GPU was the lack of generality to express complicated graphical operations such as shading and lighting that are imperative for producing high-quality visualizations. The answer to this problem was to replace the fixed-function operations with user-specified functions. Developers, therefore, focused on improving both the application programming interface (API) and the GPU hardware. The result of this evolution is a powerful programmable processor with enormous arithmetic capability which could be exploited not only for graphics applications but also for general-purpose computing (GPGPU) [4, 5]. Taking advantage of the GPUs’ massively parallel architecture, the GPGPU applications quickly mushroomed to include intensive computations in diverse industries including healthcare and life sciences, financial services, media and entertainment, retail, cloud services, energy, robotics, transportation, and telecommunications to name just a few. The impact of GPGPU in the field of artificial intelligence [6–8], both in machine learning and deep learning applications, has been particularly powerful.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 V. Dinavahi, N. Lin, Parallel Dynamic and Transient Simulation of Large-Scale Power Systems, https://doi.org/10.1007/978-3-030-86782-9_1

1

2

1 Many-Core Processors

This chapter provides a brief overview of the hardware architecture of the many-core GPU and its compute unified device architecture (CUDA) programming abstraction. In addition programming of multi-core CPUs is also discussed. The specific single instruction multiple data (SIMD) architecture of the GPU made it a successful accelerator/processor in particular applications where a large amount of high-precision computations is required to be performed on a data-parallel structure of input elements, the same as graphics applications. A data-parallel application consists of large streams of data elements in the form of matrices and vectors on which identical compute instructions are executed. With the proper numerical algorithm that can exploit the massively parallel hardware architecture and efficient CUDA programming, the computational gains obtained can be truly astounding.

1.2

Single-Thread, Multi-Thread, and Massive-Thread Programming

In a traditional single-tasking operating system (OS), the CPU can only execute the instructions for one task at any point in time. The execution of code and access of data are serial and sequential since the program is single threading. The conception of multi-threading comes up with a multitasking OS, which was considered much earlier than a multi-core hardware. Since the executing speed of a CPU core is much faster than peripherals, such as DRAM, hard drive, and I/O ports, multithreading reduces time wasted waiting for low-speed peripherals. Although still running on the single-core CPU, threads in a multi-threading program are concurrent by sharing CPU time and switching context, scheduled by the multitasking OS. The multi-threading program really achieved parallel execution only after the advent of multi-core CPU; on the other hand, a program can exploit the computing power of multi-core CPU only if it supports multi-threading. Therefore, executing a single-threading power system simulation program on a multi-core architecture is inefficient because the code is executed on a single core, one instruction after another in a homogeneous fashion, unable to exploit the full resource of the underlying hardware. The overall performance of the code can be severely degraded especially when simulating large-scale systems with high data throughput requirements. A multi-threading parallel code can provide substantial gain in speed and throughput over a single-threading code on a multi-core CPU. Even on singlecore processor systems, multi-threading can add palpable performance improvement in most applications. The implementation of multi-threading, however, is not that natural because a problem is natively coupled and sequential in common. A serial data structure and algorithm are required to be redesigned to accommodate the multi-threading pattern. The idea of massive-threading is based on one of the most advantageous modern processor techniques—the many-core processor [9]. The original motivation of many-core processor was to accelerate the 3D graphics for digital graphics processing. It normally contains many cores organized as multiple streaming multiprocessors (SMs) with a massive parallel pipeline, than a conventional CPU

1.3 Performance of Parallel Programs

3

with just multiple cores. With the development of fast semiconductor material and the manufacturing technology in integrated circuit industry, more and more (up to thousands nowadays) cores are being integrated into one chip. However, unlike the cores in common multi-core CPU, the GPU cores are lightweight processors without complicated thread control; thus the SIMD technique invented in the 1970s’ vector supercomputers is applied widely in GPU for both graphic and general-purpose computing, which can accelerate the procedure and data-independent computations effectively. Since the GPU was designed for graphic applications natively, the GPU functions are difficult to be used in general-purpose computing, which requires a computer engineer to have enough graphics processing knowledge and transfer a normal mathematical problem to a graphics problem. Several developing platforms are provided to help the GPGPU development, which will be given briefly as follows: • CUDATM [10] offers a C-like language to develop GPGPU application on the GPU provided by NVIDIA . • OpenCLTM [11] is an open framework for developing GPGPU programs, which can be executed across various platforms, supporting the GPUs of AMD , Intel , and NVIDIA . Initiated by Apple with the collaboration of a group of GPU providers, OpenCL is developed by a nonprofit organization Khronos GroupTM . • DirectCompute [12], a part of DirectX 11, is a set of APIs that support GPGPU with DirectX 11-supported GPUs on Windows provided by Microsoft . All the above development platforms abstract the hardware resources on the GPU and provide a relatively unique program interface, making it easier for software developers to enable massive threading for their parallel programs.

1.3

Performance of Parallel Programs

In order to approximate the performance in parallel computation, there are two basic criteria giving the theoretical speedup in latency of the execution. One is Amdahl’s law, and the other is Gustafson-Barsis’ law, which addresses the problem in different scopes.

1.3.1

Amdahl’s Law

Amdahl’s law, based on the assumption of a fixed workload, gives the formula in (1.1) as follows: S=

1 (1 − P ) + P /N

(P ∈ [0, 1]),

(1.1)

4

1 Many-Core Processors 10 P=90% P=80% P=70% P=60% P=50% P=40% P=30% P=20% P=10%

9 8

Speedup

7 6 5 4 3 2 1

1

2

4

8

16

32

64

128

256

512

1024

Number of threads Fig. 1.1 Illustration of Amdahl’s law

where S is the execution speedup, P is the parallel portion of the whole computation, and N is the number of threads [13]. According to the above Eq. (1.1), the maximum performance that a parallel system can achieve is bound by the portion of parallel part in the whole task. No matter how many computing resources N are obtained, the max speedup S has a limitation respecting the parallel portion P . In the plotted Fig. 1.1, the speedup can never reach 2 when the portion rate is lower than 50%, even if the number of threads is given as many as 1024. Considering the unlimited resource, N → ∞,

P /N → 0,

(1.2)

the theoretical max performance is obtained as Smax =

1 1−P

(P ∈ [0, 1]),

(1.3)

which shows that the maximum speedup of a fixed size problem is decided by the parallelism of the problem itself instead of the hardware resources of parallel computing system. Therefore, increasing the parallel proportion is the key to optimize a fixed size problem in parallel computing.

1.4 NVIDIA® GPU Architecture

P=90% P=80% P=70% P=60% P=50% P=40% P=30% P=20% P=10%

1,000

800

Speedup

5

600 400 200 0 1

128

256

384

512

640

768

896

1,024

Number of threads Fig. 1.2 Illustration of Gustafson-Barsis’ law

1.3.2

Gustafson-Barsis’ Law

Different from Amdahl’s law, in Gustafson-Barsis’ law, the workload of problem is assumed to keep increasing to fulfill the existing computing resource, which is given as S = (1 − P ) + N ∗ P

(P ∈ [0, 1]),

(1.4)

where S is the execution speedup, P is the parallel portion of the whole computation, and N is the number of threads [14]. With the above formula (1.4), the overall performance of the parallel computing system will continue climbing, only if enough computing resources are offered, whereas the parallel proportion of the problem only influences the difficulty of reaching the higher speedup. For example, to obtain 200 times acceleration, it needs 256 threads when the portion rate is 80%; however, it requires 1000 threads when the portion rate is only 20% (Fig. 1.2). Therefore, the performance enhancement of a parallel system is unlimited when the computation acquires enough computing resources for an open problem.

1.4

NVIDIA® GPU Architecture

Between 2008 and 2020, NVIDIA® developed eight generations of successively powerful GPU (Tesla to Ampere) architectures. The block diagram of NVIDIA® GP104 (Pascal architecture) [15] is sketched in Fig. 1.3, which shows that it consists of multiple GPU processing clusters (GPCs), texture processing clusters (TPCs),

6

1 Many-Core Processors

PCI Express 3.0 Host Interface GigaThread Engine

GPC TPC PolyMorph Engine

SM

Raster Engine TPC PolyMorph Engine

SM

TPC

TPC

PolyMorph Engine

SM

PolyMorph Engine

SM

TPC PolyMorph Engine

SM

4 L2 Cache

Fig. 1.3 Hardware architecture of Pascal® GP104 GPU

and streaming multiprocessors (SMs). Specifically, the GP104 consists of four GPCs, five TPCs per GPC, and one SM per TPC. Thus, there are 20 SMs per GPU. On the other hand, the GV100 (Volta) [16] has a similar architecture; nevertheless, it features a larger device; for example, it contains six GPCs each having 14 SMs equally distributed in its seven TPCs and eight 512-bit memory controllers. As shown in Fig. 1.4a, inside an NVIDIA® GP104 streaming multiprocessor, there is 96 KB shared memory, up to 128 CUDA cores, and 256 KB of register file buffer. The SM can schedule warps of 32 threads to CUDA cores. Thus, with 20 highly parallel SMs, the GPU is equipped with a total of 2560 CUDA cores. As a comparison, the streaming multiprocessor of the GV100 GPU is illustrated in Fig. 1.4b. There are 64 FP32 cores, 64 INT32 cores, eight tensor cores, and four texture units. Therefore, with a total number of 80 streaming multiprocessors, the V100 GPU has 5120 FP32 cores and 5120 INT32 cores and a total of 6144 KB of L2 cache. The GPU CUDA compute capability is another concern as it unlocks the new hardware features of each GPU; for example, the GP104 and GV100 architectures have a CUDA compute capability of 6.1 and 7.0, respectively. The full comparison of features of the eight GPU generations is given in Table 1.1. It is evident that over a 12-year period, successively smaller manufacturing process utilization led to substantial increase in the billions of transistors accommodated on the chip, leading to higher core count and greater compute capacity measured in GFLOPS.

1.5 CUDA Abstraction

7

Instruction Cache

L1 Instruction Cache 4

Instruction Buffer

4

L0 Instruction Cache

Warp Scheduler

Warp Scheduler

Register File (16384×32 bit )

Register File (16384×32 bit )

Core

Core

Core

Core

LD/ST

SFU

FP64

INT

INT

FP32

Core

Core

Core

Core

LD/ST

SFU

FP64

INT

INT

FP32

FP32

FP64

INT

INT

FP32

FP32

FP64

INT

INT

FP32

FP32

FP64

INT

INT

FP32

FP32

FP64

INT

INT

FP32

FP32

FP64

INT

INT

FP32

FP32

INT

INT

FP32

FP32

Core

Core

Core

Core

LD/ST

SFU

Core

Core

Core

Core

LD/ST

SFU

Core

Core

Core

Core

LD/ST

SFU

Core

Core

Core

Core

LD/ST

SFU

Core

Core

Core

Core

LD/ST

SFU

FP64

Core

Core

Core

Core

LD/ST

SFU

LD /ST

LD /ST

LD /ST

LD /ST

LD /ST

FP32

LD /ST

Tensor Core

LD /ST

LD /ST

Texture/L1 Cache

Texture/L2 Data Cache

Shared Memory 96KB

128KB Shared Memory

(a)

(b)

Tensor Core

SFU

Fig. 1.4 Streaming multiprocessor architecture of (a) Pascal GP104 GPU and (b) Volta GV100 GPU

1.5

CUDA Abstraction

CUDA is a parallel computing software platform [17] introduced by NVIDIA to access the hardware resources of the GPU. It offers application programming interfaces, libraries, and compiler for software developers to use GPU for generalpurpose processing (GPGPU). With CUDA runtime, the GPU architectures are abstracted into CUDA specs, which decide how to map the parallel requests to hardware entities when the GPGPU program is being developed. It provides a unified interface for CUDA-supported GPU according to the compute capability version regardless of the different details of each generation device. Thus, the programmer needs to only focus on their algorithm design and need not concern themselves about the GPU hardware too much. Based on the CUDA attraction, the whole parallel computing platform including CUP and GPU is described as a host-device system, as shown in Fig. 1.5. When developing parallel application with CUDA, the programmer follows the standard of the CUDA capability given by the NVIDIA driver, such as version 6.1 listed in Table 1.2, which defines thread hierarchy and memory organization. Since each generation GPU hardware is bound with specific CUDA version, the configuration of parallel computing resource including threads and memory running on GPU is based on the CUDA capability version. Different from the heavyweight cores in multi-core CPU whose threads are almost independent workers, the threads in SIMT GPU are numerous but

8

1 Many-Core Processors

Table 1.1 Specifications of NVIDIA® GPU architectures Chip SMs CUDA Cores Base clock (MHz) Boost clock (MHz) GFLOPs (single precision) Memory (GB) Memory BW (GB/s) L2 cache size (KB) TDP (W) Transistors (billions) Die size (mm2 ) Process (nm) Compute Capability Launch Year

Tesla GT200 30 240

Fermi GF100 16 512

Kepler GK104 8 1536

Maxwell GM204 16 2048

Pascal GP104 20 2560

Volta GV100 80 5120

Turing TU104 40 2560

Ampere GA100 108 6912

1296

1300

745

899

810

1200

585

765



1300



1178

1063

1455

1590

1410

622

1030

4577

9560

5443

14,899

8100

19,500

4

6

4

8

8

16

16

40

102

144

160

160

192

900

320

1555

0.256

768

512

2048

2048

6144

4096

40,960

187.8 1.4

247 3.2

225 3.54

300 5.2

75 7.2

300 21.1

70 18.6

250 54.2

576

529

294

398

314

815

545

826

65

40

28

28

16

12

12

7

1.3

2.0

3.0

5.2

6.1

7.0

7.5

8.6

2008

2011

2012

2015

2016

2017

2018

2020

lightweight. Thus, the performance of GPU-based computation depends to a great extent on the workload distribution and resource utilization. The C-extended functions in CUDA, called kernel, run on GPU, device side, in parallel by different threads controlled by CPU, host side [17]. All threads are grouped into blocks and then grids. Each GPU device presents itself as a grid, in which there are up to 32 active blocks [15]. The threads in a block are grouped by warps. There are up to four active warps per block. Although a block maximally supports 1024 threads, only up to 32 threads in one warp can run simultaneously. The initial data in device are copied from host through PCIe bus, and the results also have to be transferred back to host via PCIe bus again, which causes serial delay. There are three major types of memory in CUDA abstraction: global memory, which is large and can be accessed by both host and device; shared memory, which

1.5 CUDA Abstraction

9

Gridn Execution Queue Grid1 Grid0 Execution Queue

Coren Core1 Core0

PCIe Bus

Multi-core CPU

Host Memory

Block0

Block1

Blockn

Warp

Warp

Warp

Thread0 Thread1 Thread2 Thread3 Thread4

Thread0 Thread1 Thread2 Thread3 Thread4

Thread0 Thread1 Thread2 Thread3 Thread4

Threadn

Threadn

Threadn

Shared Men / Cache

Shared Men / Cache

Shared Men / Cache

Hardware L2 Cache

Global Memory Fig. 1.5 CUDA abstraction of device (GPU) and host (CPU) Table 1.2 Specifications of CUDA compute capability 6.1 Device Total amount of global memory CUDA cores Total amount of constant memory Total amount of shared memory per block Total number of registers per block Warp size Maximum number of threads per block Max dimension size of a thread block (x,y,z) Max dimension size of a grid size (x,y,z) Concurrent copy and kernel execution

GeForce GTX 1080 (Pascal) 8192 MB 2560 64 KB 48 KB 64,000 32 1024 (1024, 1024, 64) (2 G, 64 K, 64 K) Yes, with two copy engines

is small, can be accessed by all threads in a block, and is even faster than global one; registers, which is limited, can only be accessed by each thread, and is the fastest one. Table 1.3 lists the typical bandwidth of major memory types in CUDA. Although the global memories have high bandwidth, the data exchange channel,

10

1 Many-Core Processors

Table 1.3 Memory bandwidth

Type Host to device Global memory Shared memory Cores

Bandwidth 6 GB/s 226 GB/s 2.6 TB/s 5 TB/s

Table 1.4 CUDA Libraries Library CUBLAS CUDART CUFFT CUSOLVER CUSPARSE

Description CUDA Basic Linear Algebra Subroutines CUDA RunTime CUDA Fast Fourier Transform library CUDA-based collection of dense and sparse direct solvers CUDA Sparse Matrix

PCIe bus, between host and device is slow; thus avoiding those transfers unless they are absolutely necessary is vital for computational efficiency. Besides that the compiler is extended to the industry-standard programming languages including C, C++, and Fortran for general programmers, CUDA platform offers the interfaces to other computing platforms, including OpenCL, DirectCompute OpenGL, and C++ AMP. In addition, CUDA is supported by various languages, such as Python, Perl, Java, Ruby, and MATLAB, as a third-part plug-in. CUDA toolkits also come with several libraries as listed in Table 1.4. Developers can choose some of libraries on-demand to simplify their programming. The parallel execution of a user algorithm on the GPU starts by deploying the execution configuration for dimensions of thread, block, and grid before the kernel is called. The configuration information can be retrieved inside the kernel by the built-in variables, including gridDim, blockIdx, blockDim, and threadIdx. There are different types of functions classified by type qualifiers in CUDA, such as __device__, __global__, and __host__. The __device__ qualifier declared function is • executed on the device, • callable from the device only. The __global__ qualifier declared function is • • • • •

executed on the device, callable from the host or device, returned void type, specified execution configuration, asynchronous execution.

1.5 CUDA Abstraction

11

The __host__ qualifier declared function is • executed on the host, • callable from the host only. According to the above criterion, a __global__ function cannot be __host__. Similarly, the variables are also classified by type qualifiers in CUDA, such as __device__, __constant__, and __shared__. The __device__ qualifier declared variable is • located in global memory on the device, • accessible from all the threads within the grid. The __constant__ qualifier declared function is • located in constant memory space on the device, • accessible from all the threads within the grid. The __shared__ qualifier declared function is • located in the shared-memory space of a thread block, • accessible from all the threads within the block. __device__ and __constant__ valuables have the lifetime of an application, while __shared__ valuable has the lifetime of the block.

1.5.1

Performance Tuning

As shown in Fig. 1.6, the first inner-level step characteristic (zoomed-in balloon) shows 32 threads working in parallel, while the second outer-level step characteristic shows four active warps in parallel to make up the total 128 executing threads. Therefore, lowering occupation in one block as well as raising some number of blocks with the same total number of threads is an optimal way to increase efficiency. In each block, there is 48 KB of shared memory which is roughly 10× faster and has 100× lower latency than non-cached global memory, whereas each thread has up to 255 registers running at the same speed as the cores. The overwhelming performance improvement was shown with avoiding and optimizing communication for parallel numerical linear algebra algorithms in various supercomputing platforms including GPU [18]. Making a full play of this critical resource can significantly increase the efficiency of computation, which requires the user to scale the problem perfectly and unroll the for loops in a particular way [19]. In addition to one task being accelerated by many threads, concurrent execution is also supported on GPU. As shown in Fig. 1.7, the typical nonconcurrent kernel is in three steps with one copy engine GPU:

12

1 Many-Core Processors

Time (µs)

3

1.04

1.02

2

1

1 0

128 256 384 512 640 768 896 1,024

0

16

32

48

64

80

96

112 128

Number of threads Fig. 1.6 CUDA compute performance related to the number of threads in one CUDA block

Non-concurrent Copy Engine

Host to Device (H-D)

Kernel Engine

Device to Host (D-H) Execute Steam (ES)

Sequential Concurrent Copy Engine

H-D1 H-D2 H-D3 H-D4

Kernel Engine

D-H1 D-H2 D-H3 D-H4 ES1

ES2

ES3

ES4

One Copy Engine Overlap Copy Engine Kernel Engine

H-D1 H-D2 H-D3 H-D4 D-H1 D-H2 D-H3 D-H4 ES1

ES2

ES3

ES4

Two Copy Engines Overlap Copy Engine1 H-D1 H-D2 H-D3 H-D4 Kernel Engine Copy Engine2

ES1

ES2

ES3

ES4

D-H1 D-H2 D-H3 D-H4

Fig. 1.7 Concurrent execution overlap for data transfer delay

• Copy data from host to device first by copy engine. • Execute in the default stream by kernel engine. • Copy results back to host from device by copy engine. The calculation can also be finished by multiple streams. In sequential concurrent, the performance is the same as nonconcurrent; however, different streams can run in overlap; thus, the calculation time can be completely hidden with one copy engine. Furthermore, the maximum performance can be reached using the hardware with two copy engines, where most of the time of data transfer is covered, and the device memory limitation is effectively relieved since the runtime device memory usage is

1.5 CUDA Abstraction

13

divided by multiple streams. According to CUDA compute capability specs, version 6.1 supports concurrent copy and kernel execution with two copy engines.

1.5.2

Heterogeneous Programming

In high-performance computing (HPC) applications that use multiple CPUs and GPUs, it is advantageous to have a common memory and address space. The traditional CUDA programming model assumed that the host and the device maintained their distinct memory spaces, and therefore, the global memory, constant memory, and texture memory spaces were made visible to the kernels through calls to the CUDA runtime. The newer unified memory [20] provides a managed coherent memory space that is accessible to all CPUs and GPUs and allows efficient heterogeneous programming by eliminating explicitly copying data back and forth between the hosts and devices. It also maximizes the data access speed to the processor that requires the most data. The unified managed memory allows all CUDA operations that were originally valid on a single device memory, and it is a feature that has been included since CUDA 6.0-enabling GPUs that have the Pascal architecture or newer.

1.5.3

Dynamic Parallelism

For the first few generations of CUDA-capable GPUs, only one level of parallelism was allowed: instances of parallel kernels were launched from the host CPU and executed on the device GPU. This meant that algorithms that had multiple nested loops, irregular loop structures, or other constructs had to be modified to conform to the mold of single-level parallelism for implementation. Starting with the Kepler architecture (CUDA compute capability 3.5 and higher), the concept of dynamic parallelism [20] was introduced wherein the running parent kernel could create and synchronize child kernels on the device itself. This reduced the need to transfer execution control and data between the host and device as the child kernel launch decisions were made at runtime by the threads executing on the device. Furthermore, this feature enabled significant increase in flexibility and transparency of CUDA programming by allowing generation of multiple levels of parallelism dynamically in response to data-driven workloads. Figure 1.8 shows a typical complete process of heterogeneous computing with CUDA dynamic parallelism. Each thread in the parent grid launched by Kernelx can invoke its child grids; however, with a proper control flow, the launch of child grids can be restricted to only those that require further parallel operations. For example, the pseudo for the demonstrated computing architecture can be generally written as __global__ void Kernelx (datatype *data){ device_data_manipulation(data); if (threadIdx.x == specified_thread){

14

1 Many-Core Processors

CPU (host) Start

Parent grid

GPU (device) Kernelx

Memcpy Kernelx Child grids

Kernelz Memcpy End

Kernela

Kernelb

Kernelc

Fig. 1.8 Computational architecture with dynamic parallelism

Kernela >(data);//child grid Kernelb >(data);//child grid Kernelc >(data);//child grid cudaDeviceSynchronize(); }; device_other_operation(data); } void main(){ datatype *data; data_manipulation(data); Kernelx >(data);//parent grid Kernely >(data); cudaDeviceSynchronize(); CPU_operation(data); }

1.6

Programming of Multi-core CPUs

In addition to GPU simulation, CPU simulation is also carried out in this book for speed comparison. The presence of a large number of repetitive components means

1.6 Programming of Multi-core CPUs

15

the CPU implementation would be extraordinarily inefficient if they proceeded in a sequential manner. Thus, the multi-core CPU is utilized to accelerate the simulation speed by distributing the tasks among those cores. The variables of identical circuit components are grouped as an array, so when the program is executed on a singlecore CPU, it takes the form of for loop, as taking the variable ao, for example, for (int i=0;i

max(ti , to ) (two copy engines) . (one copy engine) t i + to

Algorithm 3.2 Nonlinear side parallel solution repeat for each NLB do update RHS (3.137) update Jacobian matrix J (3.138) for each external bus node and internal node do solve for ν k and χ k ν (n+1) ← ν (n) k k + ν (n+1) (n) ← χ k + χ χk calculate currents ι (3.134) update ν c and ιc by interchange until | ν (n+1) − ν (n) |< and | f (n+1) − f (n) |<

(3.140)

122

3 Large-Scale Electromagnetic Transient Simulation of AC Systems texe

ti Stream1 Stream2

D1in

K11

K1s1

K12

D2in

K22

K21 D3in

Stream3

to

K2s2

K23

K31

Dnin

Streamn

D1out

K32

K n1

K n3

D2out K3s3

D3out

Kn4

Knsn

Dnout

Fig. 3.38 Concurrent execution with multiple streams CUDA Events

LBs

LB

GPUs

Linear Queue

Kernels

LB

Kernel

Kernel

Streams Kernel

Kernel

LBs

NLBs

Kernels NLB

NLB

Kernel

Kernel

GPUs

Nonlinear Queue Kernel

Kernel

Streams

Kernel

Kernel

NLBs

Kernel

Kernel

CUDA Events

Fig. 3.39 Mechanism of computational load balancing and event synchronization

The data transfer cost can be effectively covered only if texe > (ti + to ), which is scheduled delicately. In addition, the execution of streams can also be concurrent when the GPU hardware still has enough resources available, which increases the overall occupation of GPU since the kernels were designed with low occupancy.

3.3.4

Balance and Synchronization

Since the large-scale system has already been decomposed into LBs and NLBs of similar size relevantly small, the computing tasks can be assigned to each GPU evenly with a round-robin scheme with the task queues [138], if more than two GPUs are present on the simulation platform, as shown in Fig. 3.39. There are several criteria that are followed during the workload distribution: • Linear and nonlinear subsystems are processed in different groups of GPUs separately.

3.4 Simulation Case Studies

123

• All blocks belonging to one subsystem are assigned to the same GPU due to data interchange inside the subsystem. • Linear blocks with the same size can be grouped in multiple CUDA kernels and apportioned to different CUDA streams. • Nonlinear blocks with the same components can be grouped in multiple CUDA kernels and apportioned to different CUDA streams. • CUDA kernels inside the queue are synchronized by CUDA events.

3.4

Simulation Case Studies

In order to show the accuracy of transients and the acceleration for the proposed GPU-based parallel EMT simulator, two case study cases are utilized: • In the first case study, various transient behaviors are presented for a relatively small test system, and the simulation results are validated by the EMT software ATP and EMTP-RV . • In the second test case, the acceleration performance of GPU, whose execution times on various system scales are compared to those of EMTP-RV , is shown and analyzed by running the EMT simulation on the extended large-scale power systems. The parameters for the case studies are given in Appendix A.2.

3.4.1

Case Study A

The synchronous machine (SM), two transformers (T1 , T2 ), and the arrester (MOV) are nonlinear components in the test system, as shown in Fig. 3.40. The first switch (SW1 ) closes at 0.01 s, the ground fault happens at 0.15 s, and then the second switch (SW2 ) opens at 0.19 s to clear the fault. The total simulation time is 0.3 s with 20 μs time-step. All results of GPU-based EMT simulation are compared with those of EMTP-RV and ATP.

Bus1 SW1 G

T1

Bus2

Line1

Bus3

Line2

Bus4

Fig. 3.40 Single-line diagram for Case Study A

T2

SW2 Bus5

MOV

124

3 Large-Scale Electromagnetic Transient Simulation of AC Systems

The three-phase voltages at Bus2 are shown in Fig. 3.41, which are the output voltages of the step-up transformer, T1 ; the three-phase currents through Bus2 are shown in Fig. 3.42, which are the currents through the transformer, T1 ; the threephase voltages at Bus3 are shown in Fig. 3.43, which are the waveforms after transmission and the input of step-down transformer, T2 t; the power angle and electromagnetic torque waveforms of the synchronous machine, G, are shown in Fig. 3.44; and the active and reactive power of the case study A is shown in Fig. 3.45. When the switches activate and fault happens in the circuit of Case Study A, the power electromagnetic transients are clearly demonstrated by the proposed GPUbased parallel simulation in the waveforms of voltages, currents, power angle, electromagnetic torque, active power, and reactive power, which illustrate good agreement with the results from EMTP-RV and ATP. Although the different synchronous machine model (SM type 58) and transmission line model (line type JMarti) are used in ATP other than GPU-based parallel simulation and EMTP-RV , the results are nevertheless close enough to represent designed transient phenomena. Due to more sophisticated models applied, there are more details on the transient waveforms from the GPU simulation.

3.4.2

Case Study B

In order to show the acceleration of GPU-based EMT simulation, large-scale power systems are built, which are based on the IEEE 39-bus network as shown in Fig. 3.46. Considering the interconnection is a path of power grid growth, the largescale networks are obtained by duplicating the scale 1 system and interconnecting the resulting systems by transmission lines. As shown in Table 3.2, the test systems are extended up to 3 × 79,872 (239,616) buses. All networks are decomposed into LBs, NLBs, and CBs after fine-grained decomposition in the unified patterns. For instance, the 39-bus network is divided into 28 LBs, 21 NLBs, and 10 CBs, as shown in Fig. 3.47. The simulation is based on CPU, one-GPU, and twoGPU computational systems from 0 to 100 ms with 20 μs time-step, respectively, using double-precision and 64-bit operation system. All test cases are extended sufficiently long to suppress the deviation of the software timer, which starts after reading the circuit Netlist and parameters, including network decomposition, memory copy, component model calculation, linear/nonlinear solution, node voltage/current update, result output, and transmission delay. The scaled test networks are given in Table 3.2, including network size, bus number, and partition. The execution time for each network is listed in order of network size and categorized by the type of computing systems as well as the speedup referred to the performance on CPU. In the plotted Fig. 3.48 along the oneGPU speedup curve, the speedup increases slowly when network size is small (lower than four scales) since the GPU cannot be fed enough workload; for the network scale from 4 to 32, the acceleration climbs fast, showing the computational power of the GPU is released by fetching a greater amount of data; when the network size is more than 32, the performance approaches a constant since the computational

3.4 Simulation Case Studies

125

(a) Sim Bus2 Va Sim Bus2 Vb

20

Voltage (kV)

Sim Bus2 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(b)

EMTP-RV Bus2 Va EMTP-RV Bus2 Vb

Voltage (kV)

20

EMTP-RV Bus2 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(c) ATP Bus2 Va ATP Bus2 Vb

Voltage (kV)

20

ATP Bus2 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

Fig. 3.41 Three-phase voltages comparison at Bus2 of Case Study A. (a) Bus2 voltages from GPU-based simulation. (b) Bus2 voltages from EMTP-RV . (c) Bus2 voltages from ATP

126

3 Large-Scale Electromagnetic Transient Simulation of AC Systems

(a) Sim Bus2 Ia

100

Sim Bus2 Ib Sim Bus2 Ic

Current (A)

50

0

-50

-100 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(b) EMTP-RV Bus2 Ia

100

EMTP-RV Bus2 Ib EMTP-RV Bus2 Ic

Current (A)

50

0

-50

-100 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(c)

ATP Bus2 Ia

100

ATP Bus2 Ib ATP Bus2 Ic

Current (A)

50

0

-50

-100 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

Fig. 3.42 Three-phase currents comparison through Bus2 of Case Study A. (a) Bus2 currents from GPU-based simulation. (b) Bus2 currents from EMTP-RV . (c) Bus2 currents from ATP

3.4 Simulation Case Studies

127

(a) Sim Bus3 Va Sim Bus3 Vb

20

Voltage (kV)

Sim Bus3 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(b) EMTP-RV Bus3 Va EMTP-RV Bus3 Vb

Voltage (kV)

20

EMTP-RV Bus3 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(c)

ATP Bus3 Va ATP Bus3 Vb

Voltage (kV)

20

ATP Bus3 Vc

10 0

-10 -20 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

Fig. 3.43 Three-phase voltages comparison at Bus3 of Case Study A. (a) Bus3 voltages from GPU-based simulation. (b) Bus3 voltages from EMTP-RV . (c) Bus3 voltages from ATP

128

3 Large-Scale Electromagnetic Transient Simulation of AC Systems

(a) Sim Angle Angle Sim Sim Torque

4

60

3

40

2

20

1

0

0

-20

Torque (kNm)

Angle (10-3 rad)

80

-1 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(b)

EMTP-RV Angle Angle EMTP-RV EMTP-RV Torque

4

60

3

40

2

20

1

0

0

-20

Torque (kNm)

Angle (10-3 rad)

80

-1 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(c) ATP Angle Angle ATP ATP Torque

4

60

3

40

2

20

1

0

0

-20

Torque (kNm)

Angle (10-3 rad)

80

-1 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

Fig. 3.44 Synchronous machine angle and torque of Case Study A. (a) Angle and torque from GPU-based simulation. (b) Angle and torque from EMTP-RV . (c) Angle and torque from ATP

3.4 Simulation Case Studies

129

(a) Sim P

1

Sim Q

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

Reactive power (MVAR)

Active power (MW)

1

-0.2 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(b) EMTP-RV P

1

EMTP-RV Q

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

Reactive power (MVAR)

Active power (MW)

1

-0.2 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

(c) ATP P

1.2

ATP Q

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

Reactive power (MVAR)

Active power (MW)

1.2

-0.2 0

0.05

0.1

0.15

0.2

0.25

0.3

Simulation time (s)

Fig. 3.45 Active power and reactive power of Case Study A. (a) P and Q from GPU-based simulation. (b) P and Q from EMTP-RV . (c) P and Q from ATP

130

3 Large-Scale Electromagnetic Transient Simulation of AC Systems 1

16

8

4

2

2048

32

64

Fig. 3.46 System scale extension for Case Study B First-level decomposition 39-bus network

Second-level decomposition 17 LSs

28 LBs

11 NLSs

21 NLBs

10 CSs

10 CBs

Fig. 3.47 Fine-grained decomposition for the 39-bus system

capability of GPU closes to saturation. In the case of two-GPU system, the trend of speedup increase is similar to the one-GPU case except that the saturation point is put off because of the doubled computational capability. Due to the nonlinear relationship of the execution time to the system scale, the bar diagrams of execution times for various system scales are zoomed using a logarithmic axis to obtain a detailed view. Additionally, it can be noticed that the performance of CPU is also enhanced by the proposed decomposition method since the divided circuit blocks

3.5 Summary

131

Table 3.2 Comparison of execution time for various networks among CPU, single GPU, and multi-GPU for simulation duration 100 ms with time-step 20 μs Scale 1 2 4 8 16 32 64 128 256 512 1024 2048

3φ buses 39 78 156 312 624 1248 2496 4992 9984 19,968 39,936 79,872

Blocks LBs 28 56 112 224 448 896 1792 3584 7168 14,336 28,672 57,344

NLBs 21 42 84 168 336 672 1344 2688 5376 10,752 21,504 43,008

Execution time (s) CBs EMTP-RV CPU 1-GPU 2-GPU 10 1.24 1.38 1.16 0.93 20 2.71 2.87 1.52 1.36 40 5.54 5.53 2.03 1.77 80 11.94 11.22 2.96 2.37 160 26.50 23.12 4.63 3.35 320 60.65 48.16 7.98 5.19 640 142.31 100.11 14.42 8.72 1280 323.26 210.76 28.36 15.83 2560 705.92 439.26 55.47 30.23 5120 1513.35 892.45 109.29 57.78 10,240 3314.24 1863.66 225.17 116.86 20,480 7033.61 3796.37 454.48 234.37

Speedup CPU 1-GPU 0.89 1.06 0.94 1.78 1.00 2.73 1.06 4.04 1.15 5.73 1.26 7.60 1.42 9.87 1.53 11.40 1.61 12.73 1.70 13.85 1.78 14.72 1.85 15.48

2-GPU 1.33 1.99 3.14 5.03 7.91 11.68 16.31 20.42 23.35 26.19 28.36 30.01

simplify the sparse data structure to a dense one along with the increasing system scale; thus the systems can be solved by dense solver and avoid the extra cost of dealing with the sparse structure, such as nonzero elements analysis, which is involved in every solution. In that case, the computation load is almost linearly related to the system scale comparing with the nonlinear traditional sparse solver. Owing to the shattering network decomposition, the computation load can be well distributed to each compute device so that the overall performance of a computing system is decided by the number of processors, following Gustafson– Barsis’s law [14]. The average execution time for one time-step is listed in Table 3.3 due to the different convergent speed of each time-step. When the network scale is up to 211 (2048), which is close to the memory limitation of the compute system, the two-GPU system doubles the performance of the one-GPU system and attains 30 times faster than EMTP-RV .

3.5

Summary

Electromagnetic transient simulation of large-scale nonlinear power grids is computationally demanding, and it is therefore imperative to accelerate the simulation which is used to conduct a wide variety of studies by electric utilities. The models of typical components in power system were described in this chapter, which are the unified linear passive element model for linear loads, the frequencydependent universal line model for transmission lines, the universal machine model for synchronous machines, and the lumped model for transformer with magnetic saturation. The numerical solvers included LU factorization with forward and backward substitutions for solving linear systems and the Newton–Raphson method

3 Large-Scale Electromagnetic Transient Simulation of AC Systems

EMTP-RVtime time EMTP-RV

30

CPU timetime CPU 1-GPU timetime 1-GPU

6000

25

Execution time (s)

2-GPU time

2-GPU time

CPU speedup

20

1-GPU speedup

4000

2-GPU speedup

15

Speedup

132

10

2000

5 0

0 20

21

22

23

25

24

26

27

28

29

210

211

EMTP-RV time

102

EMTP-RV time

CPU time

CPU time

1-GPU time

1-GPU time

2-GPU time

2-GPU time

104 103

1

10

102

100

101 2

0

1

2

2

2

3

2

4

2

2

5

6

2

Network scale

7

2

8

2

9

2

2

10

Zoomed execution time (s)

Zoomed execution time (s)

Network scale

11

2

Network scale

Fig. 3.48 Execution time and speedup for varying scales of test networks on CPU-, one-GPUand two-GPU-based programs compared to EMTP-RV Table 3.3 Average execution time (ms) for one time-step Scale EMTP-RV CPU One GPU Two GPUs

20 0.25 0.28 0.23 0.19

21 0.54 0.57 0.30 0.27

22 1.11 1.11 0.41 0.35

23 2.39 2.24 0.59 0.47

24 25 26 27 28 29 210 211 5.30 12.13 28.46 64.65 141.18 302.67 662.85 1406.72 4.62 9.63 20.02 42.15 87.85 178.49 372.73 759.27 0.93 1.60 2.88 5.67 11.09 21.86 45.03 90.90 0.67 1.04 1.74 3.17 6.05 11.56 23.37 46.87

for nonlinear components. From their theoretical formulas, the massively parallel modules and solution algorithms were elaborated. For large-scale massively parallel EMT simulation, the multilevel shattering network decomposition is introduced, of which the first level is coarse-grained based on propagation delay and the second level is fine-grained for linear and nonlinear solution, respectively. The purpose of the decomposition is to divide the

3.5 Summary

133

large system into subsystems that are relatively small, which can optimally fit to the SIMT execution features and release the computational power of GPU as possible. At the fine-grained level, the compensation network decomposition is proposed for linear subsystem, which partitions the linear subsystem into linear blocks and connection network. The solution of linear subsystem is obtained by solving linear blocks in parallel and compensating the results with the solution of connection network. The Jacobian domain decomposition decoupled the Jacobian matrix to accelerate the solution of Newton equations, which parallelized the calculation and improved the convergence. All the component models employed are detailed, and the solution is fully iterative. Along with the increasing scale, fully decomposed simulation can be easily deployed on to multi-GPU computing system to implement the massively parallel algorithms so that the maximum performance of simulation can be obtained according Gustafson–Barsis’s law. The accuracy of the proposed massively parallel EMT simulation was verified with mainstream EMT simulation tools, and significant computational performance improvement was shown by the comparisons between CPU-based and GPU-based EMT simulators. The massively parallel implementation can be extended to include other components in AC and DC power systems as shown in the following chapters.

4

Device-Level Modeling and Transient Simulation of Power Electronic Switches

4.1

Introduction

Continuous integration of power electronics systems into the AC grid has lead to increasingly complex modern power systems. The power converter becomes essential to fundamental sectors of the electrical power system, i.e., generation, transmission, and distribution, in an effort to diversifying power supply, improving system reliability and stability, reducing the cost, etc. While the role that a power converter plays is dependent on its controller, the performance is largely affected by its circuit mainly composed of power semiconductor switches. For instance, the capacity of a power switch, typically the insulated gate bipolar transistor (IGBT) for grid-connected applications, determines the active and reactive power that a converter is able to contribute to the surrounding system. Therefore, the inclusion of various power semiconductor switch models in the library of an EMT simulation tool essentially expands its scope of system study. The power semiconductor switch model determines the extent of power converter information that can be revealed by simulation. A power converter model generally falls within one of the following categories, where each of them has a number of variants: 1. Average value model (AVM) 2. Ideal switch model 3. Device-level model The complete classification of switch models into seven types is provided in [139]. The AVM can be deemed as a system-level model for taking the overall converter as a modeling objective, and since the individuality of a power semiconductor switch is ignored, switching-related phenomena such as switching harmonics are unavailable. An ideal switch model distinguishes the turn-on and turn-off behaviors but is only limited to the states of being ON and OFF. A device-level model imitates © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 V. Dinavahi, N. Lin, Parallel Dynamic and Transient Simulation of Large-Scale Power Systems, https://doi.org/10.1007/978-3-030-86782-9_4

135

136

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

the behaviors of a real power semiconductor switch, including the turn-on and turnoff switching transients and the ON-state V –I characteristics. Among all types of models, the two-state switch model (TSSM) with fixed onand off-state resistances is prevalent in current mainstream EMT-type simulation tools of electric power systems such as PSCAD™ /EMTDC™ . Nevertheless, it is deemed as an ideal switch model which falls short of revealing transient behavior of a switch, as well as the effects associated with it; e.g., the inaccessibility of current overshoots during the switching process could lead to an underestimation of its actual stress and subsequently the inappropriate selection of a device type with inadequate capacity. To reveal the electromagnetic transient environment of the electric power system integrated with power electronic apparatus for purposes such as accurate system study and prototype design guidance, a fully detailed model resembling a real power converter is always preferred. The nonlinear device-level switch models are prevalent in professional power electronics simulators such as Pspice® , Multisim® , and SaberRD® , which offer comprehensive information of a converter. It is also imperative to include this type of model in EMT-type solvers for large-scale systems to gain adequate insight. Aiming at introducing how power semiconductor switch models are engaged in EMT simulation, this chapter focuses on three typical high-order device-level IGBT models that can be included in the libraries of electric power system simulators. Meanwhile, an occasionally useful predefined model is also categorized. The detailed circuit model is described first, followed by the approach that realizes its EMT simulation and CUDA kernel design. The universal equivalent circuit method which improves the computation efficiency is also provided in the end.

4.2

Nonlinear Behavioral Model

Nonlinear behavioral modeling is a technique that relies on the properties of various basic electrical circuit components to imitate the performance of a power semiconductor switch at the macro level. This fundamental feature determines that the nonlinear behavioral model (NBM) has many variants since an electrical phenomenon can be described by different combinations of circuit elements. The IGBT and its antiparallel diode can be deemed as two separate parts during their respective modeling and then be merged into one equivalent circuit following discretization and linearization.

4.2.1

Diode Behavioral Model

4.2.1.1 Basic Model Description The unidirectional conduction feature is the minimum requirement of a diode nonlinear behavioral model, and the two-node TSSM is the simplest form; it conducts when the semiconductor P –N junction is forwardly biased, i.e., its voltage

4.2 Nonlinear Behavioral Model

A

id (t)

id (t)

Ijeq

Static model

NLD

-

A

Static model

+ vj

137

Gj 2

+ vL L

ir

ir=K*vL

RL

GL

Reverse recovery

K

ILeq

3

GRL

K

(a)

1

(b)

Reverse recovery

Fig. 4.1 Diode behavioral modeling: (a) nonlinear behavioral model and (b) time-domain equivalent circuit

Vj is above the threshold Vf d . Represented by the symbol NLD in Fig. 4.1a, its EMT model can be described by following equations: ! Gj =

GON , GOF F ,

(Vj ≥ Vf d ) , (Vj < Vf d )

Ij eq = −Gj · Vf d ,

(4.1) (4.2)

where Ij eq is the companion current, as given in Fig. 4.1b. Thus, during the establishment of the admittance matrix, the diode is represented by conductance GON and GOF F with a typical value of 1000 S and 1 × 10−6 S, respectively. A substitution is based on its static I –V characteristics that reveal an exponential relationship between the junction voltage and static current Id , as  Vj  Vb Id = Is · e − 1 ,

(4.3)

where Is is the leakage current and Vb the junction barrier potential. As the above equation indicates an implicit conductance that is not readily applicable to EMT simulation, time-domain discretization and linearization should be carried out to yield a Norton equivalent circuit. The admittance of the diode is derived by partial derivative, and the companion current induced by the nonlinearity can also be subsequently calculated,

138

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

∂Id Is VVj = e b, ∂Vj Vb

(4.4)

Ij eq = Id − Gj · Vj .

(4.5)

Gj =

To further increase the diode model fidelity, more circuit components can be added to simulate other phenomena. The reverse recovery process is the most prominent dynamic feature of a power diode, and its waveform can be approximated by the parallel RL –L pair. Then, the node number increases to three. The resistor RL has a much smaller value than the equivalent resistance of NLD so that it does not incur any significant impact on the static characteristics. Once the diode becomes reversely biased, NLD exhibits high impedance, and the inductor current begins to decay through RL and so does the terminal voltage vL following a polarity reverse. As this internally circulating current does not flow to external circuits, a voltagecontrolled current source (VCCS) with a coefficient K is added between the anode (A) and cathode (K). The reverse recovery parameters RL , L, and K follow the equations below [140]: RL =

Irrm K= L



dt dIr



ln10 · L  , dt trr − Irrm dI r

⎤⎞(−1) −I − I Fo rrm ⎦⎠ ⎝1 − exp ⎣   , dIr L dt K + R1L ⎛

(4.6)



(4.7)

r where Irrm represents the peak of reverse recovery current, dI dt is the current slope at turn-off, and IF o is the instantaneous on-state current before the diode starts to turn off. The resistor RL can participate in admittance matrix formation simply by its reciprocal GRL , and the VCCS is allocated to the current vector. The inductor whose i–v characteristics are described by a differential equation, on the other hand, requires discretization. The first-order Backward Euler method is sufficient since a small time-step, denoted as t, is guaranteed in device-level simulation. Thus, its EMT model can be expressed as

GL =

L , t

ILeq (t) = iL (t − t),

(4.8) (4.9)

which describes the Norton equivalent circuit comprising conductance GL and history current ILeq . An iterative process exists between the inductor history current and its actual current during simulation, and according to Kirchhoff’s current law

4.2 Nonlinear Behavioral Model

139

(KCL), the inductor actual current iL (t) can be derived by iL (t) = ILeq (t) + GL · vL (t).

(4.10)

The availability of all conductance and companion currents enables the solution of the diode’s three nodal voltages organized as a vector, i.e., v = G−1 · Ieq ,

(4.11)

where the 3 × 3 admittance matrix, arranged from anode to cathode, takes the form of [141] ⎡

⎤ Gj K − Gj −K G = ⎣ −Gj Gj + GL + GRL −GL − GRL ⎦ , 0 −GL − GRL − K GL + GRL + K

(4.12)

and the corresponding companion current vector is T  Ieq = −Ij eq , Ij eq − ILeq , ILeq .

(4.13)

It is noted that the above admittance matrix and companion current vector are normally subject to an expansion of dimension when the diode is connected to other circuit components in the simulation, and consequently, the voltage vector contains more elements.

4.2.1.2 Parallel Massive-Thread Mapping As shown in Fig. 4.2 and described in Algorithm 4.1, there are two kernels in the massive-thread parallel nonlinear behavioral diode model. The static characteristics and reverse recovery are proceeded sequentially as Kernel0 to prepare for the matrix equation (4.11). The voltages vL and Vj are computed based on the v(t) from the last Newton-Raphson iteration or time-step, and ILeq is updated as iL at the previous calculation. The nonlinear system is then solved in nonlinear solution units (NSs) of Kernel1 . The convergence of vector v(t) as an aggregation of nodal voltages of all diodes is checked to determine whether the process should move to the next time-step or iteration.

4.2.2

IGBT Behavioral Model

Like the diode, typical IGBT behaviors including the function of a fully controllable switch and the tail current can be simulated by circuit components. As shown in Fig. 4.3a, the diode PWLD represents unidirectional conduction of the IGBT; i.e., the current iC only flows from the collector (C) to emitter (E). It can be taken as a two-state conductor with a forward threshold voltage Vf d , as given below:

140

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Node voltage v(t)

ILeq ILeq(t-∆t)

Global Memory

Kernel0 Static & Re verse Rec overy

SR1 SR2

Kernel1

Nonlinear solution

NS1

Yes

v(t)

NS2

Ieq(t)

SRn

...

vK(t)

G(t)

...

vA(t)

iL(t)

v(t) converge?

No

NSn

Fig. 4.2 Multi-thread parallel implementation of nonlinear behavioral diode model

Algorithm 4.1 Diode kernel procedure DIODE BEHAVIORAL MODEL Static characteristics: Calculate Vj (t) Compute Gj and Ij eq by (4.4) and (4.5) Reverse Recovery: Calculate GL , vL (t) Calculate ILeq (t) by (4.9) Update iL (t) in (4.10)

 Kernel0

Complete module: Solve matrix equation (4.11) Update v(t) Store ILeq (t) into global memory if v(t) not converged then go to Kernel0 else Store v(t) into global memory

! Gpwld =

GON , GOF F ,

 Kernel1

(Vpwld ≥ Vf d ) , (Vpwld < Vf d )

Ipwldeq = −Gpwld · Vf d .

(4.14) (4.15)

The MOSFET behavior represented by the current source imos is the pivotal part of the nonlinear IGBT model. Together with capacitors Cge and Ccg and the inherent gate resistor rg , it reflects static and dynamic performance other than the tail current by the following three sections corresponding to the off-state, the on-state, and the transient state, respectively [142]:

4.2 Nonlinear Behavioral Model

141

vCE (t)

vCE (t)

iC (t) Collector (C)

Emitter (E)

itail

Gate (G)

ICceeq GCce 2

itail Inode1 (n1)

iC (t)

iC (t)

Cce

C

Inode2 (n2)

PWLD

imos

Ipwldeq

3

Gpwld

Rtail

Rg

(a)

5

E

GRtail

E ICcgeq

Inode3 (n3)

Ccg

GCtail

imos

1

Ctail

C

ICtaileq

ICgeeq

4

GCcg

Cge

GCge 1/Rg

G

(b)

G

Fig. 4.3 IGBT behavioral modeling: (a) nonlinear behavioral model and (b) distinct view of discretized equivalent circuit

imos

⎧ (vCge < Vt )||(vd ≤ 0) ⎪ ⎨ 0, 1 (z+1) (z+2) − b2 · vd , vd < (y · (vCge − Vt )) x , = a2 · vd ⎪ ⎩ (vCge −Vt )2 (others) a1 +b1 ·(vCge −Vt ) ,

(4.16)

where a1 , b1 , a2 , b2 , x, y, and z are coefficients and Vt is the channel threshold voltage. As can be seen, the current imos is jointly determined by its own terminal voltages vd and vCge and the voltage Cge . Hence, discretization of the component yields conductance Gmosvd and transconductance Gmosvcge derived by taking partial derivatives with respect to vd and vCge , respectively:

Gmosvd

⎧ ⎪ ⎨ 0, = a2 (z + 1)vdz − b2 (z + 2)vd(z+1) , ⎪ ⎩ 0,

Gmosvcge =

⎧ ⎪ ⎪ 0, ⎨ ⎪ ⎪ ⎩

(vCge < Vt )||(vd ≤ 0) 1 vd < (y · (vCge − Vt )) x , (others) (4.17) (vCge < Vt )||(vd ≤ 0) 1 vd < (y · (vCge − Vt )) x .

(z+1) ∂a2 2 − ∂v∂bCge · vd(z+2) , ∂vCge · vd 2·(vCge −Vt ) b1 ·(vCge −Vt )2 a1 +b1 ·(vCge −Vt ) − (a1 +b1 ·(vCge −Vt ))2 ,

(others) (4.18)

142

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

The corresponding companion current Imoseq can then be formulated as [143] Imoseq = imos − Gmosvd · vd − Gmosvcge · vCge .

(4.19)

While a change of current direction in the diode reverse recovery process is realized by a parallel RL –L pair, the IGBT tail current which emerges during turn-off does not have a direction change and, therefore, can be reproduced by the Rtail –Ctail pair, with the VCCS expressed below:  itail =

0, 

vtail Rtail



− imos · irat ,

vtail Rtail vtail Rtail

≤ imos > imos

,

(4.20)

where vtail denotes the voltage of Rtail –Ctail pair and irat is a coefficient. The tail current is dependent on three voltages, i.e., vtail , vd , and vcge , the last two of which are related to imos . Hence, discretization of the component yields transconductance Gtailvd , Gtailvcge , and Gtailvtail :  Gtailvd =  Gtailvcge =  Gtailvtail =

0, − Gmosvd irat ,

vtail Rtail ≤ imos vtail Rtail > imos

,

(4.21)

0,

vtail Rtail ≤ imos vtail Rtail > imos

,

(4.22)

vtail Rtail ≤ imos vtail Rtail > imos

.

(4.23)



Gmosvcge irat ,

0, 1 Rtail ·irat ,

Similarly, the companion current of itail is Itaileq = itail − Gtailvd · vd − Gtailvcge · vCge − Gtailvtail · vtail .

(4.24)

In the IGBT behavioral model, Cge and Ctail have a fixed capacitance, and therefore, their Norton equivalent circuits, i.e., GCge –ICgeeq and GCtail –ICtaileq , can be obtained using the following general equations based on backward Euler method, GC =

C , t

(4.25)

iC = GC · [vC (t) − vC (t − t)],

(4.26)

ICeq = −GC · vC (t − t).

(4.27)

4.2 Nonlinear Behavioral Model

143

The Norton equivalent circuits of nonlinear capacitors Ccg and Cce can be obtained in the same manner, take the former one for example,

GCcg =

⎧    Ccg −M ⎨ ccgo· 1+ vvcgo ⎩ ccgo t

ICcgeq =

t

,

,

vCcg > 0 , vCcg ≤ 0

(4.28)

qCcg (t) − qCcg (t − t) − GCcg · vCcg (t), t

(4.29)

where M is the Miller capacitance exponent coefficient with a default value of 0.5. A distinct view of the discretized equivalent circuit of the nonlinear behavioral IGBT model is shown in Fig. 4.3b. With a node sequence given in the figure, a 5 × 5 admittance matrix and its five-element companion current vector can be established based on the relations among various nodes, as given in (4.30) and (4.31). Then, the nodal voltage can be solved using (4.11). ⎡

Gpwld

−Gpwld

0

0

0



⎥ ⎢ ⎥ ⎢ ⎥ ⎢ −G Gtailvtail − Gmosvcge + −Gmosvcge − ⎥ ⎢ pwld GCce + GCcg + ⎥ ⎢ ⎢ Gmosvd + Gtailvd Gmosvd − Gtailvd Gtailvcge Gtailvtail − GCce ⎥ ⎥ ⎢ ⎥ ⎢ +Gpwld −GCcg −Gtailvcge ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 −Gmosvd Gmosvd + −Gmosvcge Gmosvcge − ⎥ ⎢ ⎥ ⎢ ⎥. Gtail − GCtail Gtail + GCtail G=⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 −GCcg 0 GCge + GCcg −GCge − ⎥ ⎢ ⎥ ⎢ 1 1 + Rg ⎥ ⎢ R g ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 −GCce − Gtailvd − −GCge − GCge + GCce + Gtail ⎥ ⎢ ⎥ ⎢ ⎣ Gtailvd Gtail − GCtail Gtailvcge +Gtailvtail + Gtailvcge ⎦ 1 1 −Gtailvtail − Rg +GCtail + Rg (4.30) ⎡ ⎢ Ieq = ⎢ ⎣

−Ipwldeq ,

⎤T Ipwldeq − Imoseq Imoseq ICcgeq − ICgeeq Itaileq + ICtaileq ⎥ V −ICceeq − ICcgeq −ICtaileq , + Rgg , +ICgeeq + ICceeq ⎥ . ⎦ V −Itaileq , − Rgg (4.31)

4.2.2.1 Parallel Massive-Thread Mapping The nonlinear behavioral IGBT model contains several parts, which can be written into a corresponding number of kernels or taken as two kernels for massivethread parallel processing to reduce global memory exchange frequency, as shown in Fig. 4.4 and Algorithm 4.2. Therefore, the first kernel Kernel0 contains the

144

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Node Voltage v(t), Gate Voltage vg (t)

qce, qcg, vtail, vcge

Global Memory

Kernel0

vg (t)

Nonlinear Solution

IGBT1

NS1

G(t)

IGBT2

Ieq(t)

IGBTn

NS2

...

...

v(t)

Kernel1

IGBT Model

Yes

v(t)

v(t) converge? No

NSn

Fig. 4.4 Multi-thread parallel implementation of nonlinear behavioral IGBT model

entire IGBT model to prepare for the admittance matrix (4.30) and companion current vector (4.31). Their processes are largely the same, i.e., first calculating the terminal voltage based on the voltage vector of the previous solution or initialized at the beginning of the simulation, followed by computation of conductance and companion current. For nonlinear current sources and capacitors, their actual currents are used in the update of companion currents, which can be immediately obtained in PWLD and linear capacitors. Then, the IGBT is solved by multiple NS threads invoked by Kernel1 . Like in the diode, convergence check on vector v(t) is carried out to determine whether the process should move to the next time-step or continue with another iteration.

4.2.3

Electrothermal Network

The instantaneous power loss of an IGBT—including its freewheeling diode—due to the static conduction and switching process is the source of heat that dissipates through the IGBT module case and raises its junction temperature that consequently affects the device performance. Calculated as PI GBT = vCE (t) · iC (t)

(4.32)

in continuous time domain, the power loss equation needs to be discretized for EMT simulation purposes. For an arbitrary period between T0 and T , the average power loss is generally calculated by the following equation: & P

I GBT

=

vce (t)ic (t)dt . T − T0

(4.33)

4.2 Nonlinear Behavioral Model

145

Algorithm 4.2 IGBT kernel procedure IGBT BEHAVIORAL MODEL PWLD: Calculate Vpwld Calculate Gpwld , Ipwldeq by (4.14) and (4.15) MOSFET: Calculate vCge and vd Compute Gmosvd and Gmosvcge by (4.17) and (4.18) Compute imos and Imoseq by (4.16) and (4.19) tail current: Calculate vtail Compute Gtailvd , Gtailvcge and Gtailvtail by (4.21)–(4.23) Calculate itail and Itaileq by (4.20) and (4.24) capacitors: Calculate vC Compute GC , iC by (4.25)–(4.26) Update ICeq by (4.27) or (4.29) Complete module: Form matrices G and Ieq by (4.30)–(4.31) Solve matrix equation (4.11) Update v(t) Store qce , qcg , vtail and vcge into global memory if v(t) not converged then go to Kernel0 else Store v(t) into global memory

 Kernel0

 Kernel1

Thus, in digital simulation, it is formulated as &T PI GBT =

T0 vCE · iC dt

T − T0

=

 Nt  n n n+1 n+1 t n=1 vCE iC + vCE iC 2(T − T0 )

(4.34)

where Nt is the number of discrete sections in that period. Two circuit-based dynamic electrothermal networks are established in Fig. 4.5a,b for cooling system capacity evaluation, i.e., the Cauer type and Foster type [144]. The power loss of IGBT is taken as a time-varying current injection, whose terminal voltage is deemed as the junction temperature Tvj . The dynamic junction to case thermal impedance has the following expression: Zth =

n 

Rth(i) (1 − e

− τt

i

),

(4.35)

i=1

where Rth(i) and τi are constants available in manufacturer’s device datasheet. For semiconductor switch electrothermal analysis, the thermal impedance is embodied by a corresponding number of R–C pairs. In the Cauer-type network, the thermal resistance of the module case and heat sink is taken as five serial resistors,

146

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Rth1

PIGBT Tvj

Cth1

Rth2

Cth2

Module Case

Rth3

Rth4

Cth3

Cth4

(a)

Rth5

Cth5 Te Heatsink

Cth1

Cth2

Cth3

Cth4

Cth5

Rth1

Rth2

Rth3

Rth4

Rth5 Te

Ih1

Ih2

Rh1

Rh2

PIGBT Tvj

(b)

Ih3

Ih4

Ih5

Rh3

Rh4

Rh5 Te

PIGBT Tvj

(c) Fig. 4.5 IGBT electrothermal transient network: (a) Cauer-type, (b) Foster-type, and (c) Fostertype EMT models

while the dynamic diffusion is realized by the five grounded capacitors. In contrast, the Foster-type network reflects the dynamic process by cascaded parallel R–C combinations. In both equivalent circuits, the ambient temperature is represented by the voltage source Te , whereas the ground symbol means temperature reference normally set as 0◦ C. Though both equivalent circuits have the same number of electrical nodes, those in Foster type can be easily merged. With the capacitance being calculated by Cth(i) =

τi , Rth(i)

(4.36)

the EMT model of Foster type can be easily derived, as shown in Fig. 4.5c. Consequently, the junction temperature can be calculated as [145] Tvj (t) =

5  PI GBT (t) + Ih(i) (t − t) i=1

−1 Rth(i) + GCth (i)

+ Te , i = 1 − 5,

(4.37)

where GCth (i) and its companion current Ihi , also known as history current, can be obtained by different numerical integration approaches. For example, with TLM

4.2 Nonlinear Behavioral Model

147

stub model, it is GCth (i) =

t , 2Cth(i)

(4.38)

i Ih(i) (t) = 2 · tC(i) (t) · GCth (i) ,

(4.39)

i in which tC(i) is the incident pulse of capacitor’s TLM stub model and is updated by i tC(i) (t) =

PI GBT (t) + Ih(i) (t) −1 Rth(i) + GCth (i)

i − tC(i) (t − t).

(4.40)

Similarly, when the trapezoidal integration method is applied to the capacitors, GCth (i) keeps the same form due to an identical numerical order, whereas the history current is recursively expressed as Ih(i) (t) = Ih(i) (t − t) + 2vCth (i) (t) · GCth (i) ,

(4.41)

where vCth (i) denotes the corresponding capacitor voltage. The device datasheet provided by the manufacturer normally gives a few curves of some key parameters under two typical temperatures such as 25◦ C and 125◦ C. To obtain a thorough relation between a parameter and the junction temperature covering all operational range, linear interpolation can be used; for example, parameter y is expressed as y(Tvj ) = k · Tvj + p.

(4.42)

With more available data under different junction temperatures, nonlinear functions can be employed so as to describe the dynamic electrothermal features more precisely. The Cauer-type can be either calculated by using the nodal voltage equation (4.11) or the following format: 

Tvj = 

PI GBT + Ih0 

G0

(4.43)

,



where G0 and Ih0 are 

Ih1



Ih0 = Ih1 + 



Rth1 G1 + 1

G0 = GCth 1 +

1 

(4.44)

,

Rth1 G1 + 1

,

(4.45)

148

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

or recursively expressed as 

Ih(i+1)



Ih(i) = Ih(i+1) +



Rth(i+1) G(i+1) + 1

(4.46)

, i = 0, 1, 2, 3

(4.47)

1



Gi = GCth (i+1) +

, i = 0, 1, 2, 3



Rth(i+1) G(i+1) + 1

while the last section is 

Ih4 = Ih5 + 

Te , Rth5

G4 = GCth 5 +

4.2.4

1 . Rth5

(4.48)

(4.49)

Complete IGBT/Diode Model

A complete nonlinear IGBT/diode model is available when the two devices are connected in an antiparallel manner, as shown in Fig. 4.6a. The diode shares its two terminals with its semiconductor counterpart, and when the pair participates in circuit simulation, the admittance matrix has a dimension of six, which can be formed node by node. The above model is practical for small-scale power converter simulation as the CPU can solve the corresponding matrix with convergent results. However, it will become burdensome with a rising number of IGBTs, which could also be connected to each other; moreover, the Rtail –Ctail pair in the IGBT is the main source that contributes to numerical divergence. Thus, a simplified model is given in Fig. 4.6b. As the Rtail –Ctail pair only impacts the tail current itail , it can be removed from its initial position and constitute an independent circuit. In EMT simulation, after obtaining imos , the value is used to compute the nodal voltage of the RC pair, which further yields the tail current. The freewheeling diode can be kept unchanged, or under certain circumstances, its reverse recovery could be neglected, only retaining the NLD for higher computational efficiency. In this case, discretization of the controlled current source itail as given in (4.21)– (4.24) becomes unnecessary, since its value can be directly calculated by (4.20). This simplified model shortens the computation time and reduces the chance of numerical divergence. Since behavioral modeling provides both the IGBT and diode with flexibility, there could be other variants, such as Fig. 4.6c,d.

4.2 Nonlinear Behavioral Model

C

149

1

C

E

PWLD

G G

IGBT

itail

3

4

Rg Cge

Rtail

Rg

Te Ploss

2

Cce

imos

K

imos

I

itail

Zth

Cge

Rtail C tail

A

Te

4

E

(b) C

1

Ploss

K G

imos

Ccg

C

Ccg

Zth

NLD

itail

Ctail

G

(c) 1

Ploss

K I

itail

imos RL

2

Rtail C tail E

Te

5

imos

Cge

irr

A E

Rg

RL 4

Rtail

Cge

L

2

3

Rg

Zth

5

NLD

(a)

PWLD

Ccg

L

A

1

C

3

Cce

6

E

G

RL ir

Ctail

Ploss

K

2

imos

Ccg

Diode

3

NLD

A

4

L

Zth irr Te

(d) Fig. 4.6 Complete nonlinear IGBT/diode behavioral model: (a) full model, (b) reduced-order model I, (c) reduced-order model II, and (d) reduced-order model III

150

4.2.5

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Model Validation

The IGBT/diode modeling method was verified by the commercial device-level simulation tool SaberRD® using its default Siemens IGBT module BSM300GA160D since it provides switch models that were experimentally validated, and the parameters are listed in Appendix A.3.1. In Table 4.1, the execution time of device-level simulation was compared by computing single-phase modular multilevel converters (MMCs) for a 100 ms duration with 100 ns as the time-step. It takes SaberRD® up to 1700 s to compute a nine-level converter, and the results are no longer convergent once the voltage level reaches 11. Therefore, it is infeasible for a commercial device-level simulation package to conduct power system computation. In the meantime, the IGBT/diode model with decoupling method was also tested on the single-core CPU and the NVIDIA Tesla® V100 GPU. This device-level model enables the CPU to achieve a speedup SP1 of almost 18 times in nine-level MMC, and the speedup SP2 by GPU was near 11. The GPU overtakes the CPU when the MMC level reaches 21 since its speedup over CPU SP3 is greater than one. As the scale of power converters with high-order nonlinear power semiconductor switch models that current component-level simulation tools are able to solve is very limited due to numerical instability, a single-phase nine-level MMC with a reasonable DC bus voltage of 8 kV which is the maximum scale that SaberRD® can solve was employed for the IGBT NBM model validation. Some device-level results from the MMC are given in Fig. 4.7. With a dead time T = 5 μs, a gate resistance of 10 , and a voltage of ±15 V, the switching transients are normal in Fig. 4.7a–c. A slight overshoot was observed in the IGBT turn-on current, and the diode reverse recovery process accounts for this phenomenon. Figure 4.7d shows the impact of the gate driving conditions on switching transients that is only available in device-level modeling. Adjusting the gate turn-off voltage to 0 V leads to a tremendous current overshoot, which means this driving condition is hazardous to the IGBT. And when the gate resistance is set to 15 , the current rises more slowly as the time interval t2 is slightly larger than t1 . SaberRD® simulation was also conducted, and a good agreement validates the IGBT/diode nonlinear behavioral model designed into a GPU kernel. Table 4.1 NBM-based MMC execution time by various platforms for 100 ms duration

MMC Level 5-L 7-L 9-L 11-L 21-L 33-L

Execution time (s) SaberRD CPU1 709 56.2 1240 80.3 1720 98.4 – 121.2 – 260.3 – 368.9

GPU 159.1 159.8 163.7 163.3 206.0 238.4

Speedup SP1 SP2 12.6 4.5 15.4 7.8 17.5 10.5 – – – – – –

SP3 0.35 0.50 0.60 0.74 1.26 1.55

4.2 Nonlinear Behavioral Model

vCE (kV), iC/D (kA)

1.0

151

SaberRD® 1.0

0.8

NBM

vCE

0.6

0.6

0.4

0.4

iC

0.2

4.0

6.0

8.0

t/μs

1.0

vCE (kV), iC/D (kA)

iC

0.0 2.0

(a)

SaberRD

0.8

®

NBM

vCE

0.6 0.4 0.2

iD

0.0 0.0

SaberRD® NBM

0.2

0.0 0.0

vCE

0.8

0.0

2.0

4.0

(c)

6.0

8.0

t/μs

6.0

8.0

(b) 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

t/μs

SaberRD®

Rg=10Ω, Vg=15V/0V

NBM

Rg=10Ω, Vg=±15V Rg=15Ω, Vg=±15V

iC

0.0 2.0

4.0

t1 t2

t1 2.0

4.0

(d)

6.0

8.0

t/μs

Fig. 4.7 Type I IGBT/diode switching transients: (a) turn-on, (b) turn-off, (c) diode reverse recovery, and (d) IGBT turn-on current under different gate conditions

In Fig. 4.8a–c, the switching patterns between different tools are compared. The NBM leads to exact static and dynamic current waveforms to SaberRD® . In contrast, PSCAD™ /EMTDC™ was not able to give the actual current stress of an IGBT during operation, as the switching transients could not be observed. Moreover, the model also determines the simulation accuracy. It is shown by Fig. 4.8c that with default IGBT and diode model TSSM1 and a typical time-step of 20 μs in PSCAD™ /EMTDC™ , a current disparity I = 6 A out of 190 A is witnessed even under steady state. The result from PSCAD™ /EMTDC™ becomes closer to the NBM when an approximate on-state resistance and voltage drop of the IGBT/diode pair are set to its switch model TSSM2 . Applying the Fostertype electrothermal network, the junction temperatures of the two complementary switches in a submodule can be obtained, as given in Fig. 4.8d. A dramatic temperature surge is observed when the nine-level MMC starts to operate, and the curve decreases gradually along with the converter’s entry into the steady state. The correctness of these results is validated by SaberRD® . The switching transients peculiar to a device-level model, e.g., the IGBT turnon currents and voltages, are provided. In Fig. 4.9a, a 5 μs dead time is set for

152

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

IGBT turn-on

1.0

SaberRD

1.0 0.5

40

NBM overshoot PSCAD TSSM2

(a)

60

80

iS2 (kA)

0 SaberRD®

-0.5

NBM 20

NBM, ΔT=20μs

NBM 20

t/ms

(b)

ΔI

PSCAD

(c)

60

80

t/ms

S1

S1 S2

20 0

t/ms

67.5ºC

80

40

40

80

135ºC

60

-0.5

60

S2

100

0

20

40

120

NBM, ΔT=5μs

PSCAD TSSM1

IGBT turn-on

®

Tvj (ºC)

iS1 (kA)

-1.0

diode reverse recovery

0.5

0

-0.5

iS2 (kA)

1.0

diode reverse recovery

0.5

20

SaberRD®

NBM

40

80

(d)

60

t/ms

Fig. 4.8 Type I IGBT/diode switching pattern and junction temperature: (a) upper switch current, (b) lower switch current, (c) IGBT junction temperatures, and (d) switching pattern difference between device-level model and two-state switch model

the two complimentary IGBTs, and their current waveforms demonstrate tolerable overshoots; on the other hand, when the dead time is canceled, a tremendous surge which exceeds the capacity of the BSM300GA160D IGBT module is observed in both switches, as given in Fig. 4.9b. More apparent results are provided by showing the junction temperatures in Fig. 4.9c. As can be seen, in the former case, both of the IGBTs maintain a normal temperature of around 55◦ C, while without an appropriate dead time, the junction temperatures could reach over 100◦ C, meaning the IGBTs cannot operate in such scenarios. The good match between the fourthorder model and the SaberRD® fifth-order model indicates that this model can be used in real applications to provide a thorough insight into component-level details that otherwise current commercial EMT-type solvers are unable to achieve.

4.3

Physics-Based Model

The physics-based model uses various basic circuit elements to describe the microlevel physics phenomena of a power semiconductor switch. Similar to the behavioral model, it could also have a number of variants since a semiconductor physics

4.3 Physics-Based Model

1.0

153

vCE (kV), iC (kA)

Proposed Model

0.8

vCE1

SaberRD

Proposed Model

vCE2

SaberRD

0.6 0.4 0.2

iC1

0.0

-0.2 0

3

1.0

4 2

iC2 5

1 6

3

0

t (μs) (a)

4 2

vCE (kV), iC (kA)

Proposed Model

vCE1

SaberRD

0.8

5

1 6

Proposed Model

vCE2

SaberRD

0.6 0.4

iC1

0.2

iC2

0.0

-0.2 0

1

2

3

4

5

6

1

0

t (μs)

2

3

4

5

6

(b)

55

Tvj2

50

Tvj1

Tvj2

Tvj (ºC)

45 40 35 30 25

Tvj1 0

20

40

60

80 100 120 140 160 180

0

t (ms)

20

40

60

80 100 120 140 160 180

(c)

Fig. 4.9 IGBT type III model component-level validation: (a) switching dead time 5μs, (b) no switching dead time, and (c) junction temperatures when the dead time is 5μs (left) and 0 (right)

154

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

phenomenon can be represented by various circuit components, and the overall model is also obtained based on the equivalent circuit of the two individual parts.

4.3.1

Physics-Based Diode Model

4.3.1.1 Model Formulation The physical structure of a p-i-n power diode is shown in Fig. 4.10a [146]. The reverse recovery happens when turning off a forward-conducting diode rapidly as described by the following equations [147]: iR (t) = 0=

qE (t) − qM (t) , TM

(4.50)

dqM (t) qM (t) qE (t) − qM (t) + − , dt τ TM

(4.51)

qE (t) = IS τ [e

vE (t) VT

− 1],

(4.52)

where iR (t) is the diffusion current in the intrinsic (i) region; qE (t) and qM (t), respectively, represent charge variable in the junction area and in the middle of i-region; TM is the diffusion transit time across i-region; τ is the lifetime of recombination; IS is the diode saturation current constant; vE is the junction voltage; and VT is the thermal voltage constant. The voltage drop across half of i-region vF (t) is described as vF (t) =

VT TM RM0 i(t) , qM (t)RM0 + VT TM

(4.53)

where i(t) is diode current and RM0 is the initial resistance of this region. The emitter recombination phenomenon can be described by

v(t)

i(t)

i(t) vE(t)

A i(t)

CJ(t)

P

+

vA(t)

vF(t) vE(t)

vF(t)

RS

rF(t)

iR(t)

(a)

2vE(t) gR

2vF(t) vin(t)

iReq

CJ(t)

N iE(t)

+

K

rF(t) Forward recovery Reverse recovery

gE

i-region iE(t)

v(t)

+

K

A

RS

vK(t) Contact resistor

iEeq gJ

Emitter recombination Junction capacitance

iJeq

(b)

Fig. 4.10 Physics-based diode modeling: (a) physical structure of power diode and (b) equivalent circuit

4.3 Physics-Based Model

155

iE (t) = ISE [e

2vE (t) VT

− 1],

(4.54)

where vE represents the junction voltage and iE (t) and ISE are the end-region recombination current and emitter saturation current, respectively. Then, the voltage across the diode is v(t) = 2vF (t) + 2vE (t) + RS · i(t),

(4.55)

where RS is the contact resistance, and i(t) = iR (t) + iE (t) +

dqJ (t) . dt

(4.56)

The charge stored in the junction capacitance qJ (t) which contributes to the diode current i(t) takes the form of  qJ (t) = CJ (t)d[2vE (t)], (4.57) with the junction capacitance CJ (t) given as CJ (t) =

⎧ ⎨ ⎩

CJ 0 , 2v (t) [1− φE ]m B m·CJ 0 ·2vE (t) φB ·0.5m+1

vE < CJ 0 − (m − 1) 0.5 vE ≥ m,

φB 4

φB 4

(4.58)

where CJ 0 is the zero-biased junction capacitance, φB is the built-in potential, and m is the junction grading coefficient.

4.3.1.2 Physics-Based Diode EMT Model The discrete and linearized equivalent circuit of the diode nonlinear physics-based model is mandatory for EMT simulation. It implies in (4.50) that the companion circuit of reverse recovery phenomenon is dependent on qE and qM . Discretization M (t) of the differential term dqdt in (4.51) by trapezoidal rule leads to qM (t) =

t · qE (t) qhist (t − t)  + , 1 + k12t 2TM 1 + k12t

(4.59)

where the history term takes the form of   t t · k1 + · qE (t), qhist (t) = qM (t) 1 − 2 2TM k1 =

1 1 + . τ TM

(4.60)

(4.61)

156

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Then, the equivalent conductance gR and companion current iReq can be derived as gR = iReq = k2 IS τ [e

vE (t) VT

vE (t) ∂iR 1 = k2 IS τ · · e VT , ∂(2vE ) 2VT

(4.62)

− 1] − k3 qhist (t − t) − gR · 2vE (t),

(4.63)

where k2 =

1 t  − TM 2TM2 1 +

tk1 2

 , and k3 =



1

TM 1 +

tk1 2

.

(4.64)

Taking partial derivative of (4.54) yields the discretized equivalent circuit for emitter recombination (gE , iEeq ): gE =

ISE 2vVE (t) ∂iE = e T , ∂(2vE ) VT

iEeq = ISE [e

2vE (t) VT

− 1] − gE · 2vE (t).

(4.65)

(4.66)

The equivalent circuit of junction capacitance is gJ =

2 CJ (t), t

iJ eq = iJ (t) − gJ · 2vE (t),

(4.67) (4.68)

and according to (4.56) and (4.58), iJ (t) =

2 2 · qJ (t) − · qJ (t − t) − iJ (t − t), t t

⎧ −φB CJ 0 2vE (t) (1−m) , vE (t) < φ4B ⎪ 1−m · [1 − φB ] ⎨ m+2 2 qJ (t) = 2 mCJ 0 vE (t) − φB ⎪ ⎩ 2m+1 (m − 1)CJ 0 vE (t). vE (t) ≥ φ4B

(4.69)

(4.70)

The equivalent resistance of forward recovery is rF (t) =

2VT TM RM0 2vF (t) = . i(t) qM (t)RM0 + VT TM

(4.71)

4.3 Physics-Based Model

157

Node voltage v(t)

qhist qhist(t-Δt)

Kernel0 Reverse r ecovery & Junction capacita nce

RJ1

RJn

qhist(t) gR(t) iReq(t) gJ(t) iJeq(t)

Kernel1 Nonlinear solution

NS1

Yes

v(t)

NS2

...

...

vE(t)

RJ2

Global Memory

vE(t) converge? No

NSn

Fig. 4.11 Massive-thread parallel implementation of power diode physics-based model

The final three-node equivalent circuit for the diode is organized as in Fig. 4.10b. The admittance matrix and companion current vector in (4.11) are expressed as ⎡

⎤ −gR − gE − gJ 0 g R + gE + gJ ⎢ ⎥ 1 1 − rF (t)+R G = ⎣ −gR − gE − gJ gR + gE + gJ + rF (t)+R , S S ⎦ 1 1 0 − rF (t)+RS rF (t)+RS

(4.72)

T  Ieq = −iReq − iEeq − iJ eq , iReq + iEeq + iJ eq , 0 .

(4.73)

In massive-thread parallel Newton-Raphson iteration, the diode nodal voltages (n+1) can be updated by previous nth iteration values at the next calculation V Diode until the solution is converged.

4.3.1.3 Parallel Massive-Thread Mapping As shown in Fig. 4.11 and described in Algorithm 4.3, there are two kernels in the massive-thread parallel diode model. The dynamic conductance gR and equivalent reverse recovery current iReq are updated by the junction voltage vE , and the equivalent conductance gJ and current source iJ eq are updated by the junction capacitance CJ in reverse recovery and junction capacitance units (RJs) of Kernel0 . The nonlinear system is solved using the massive-thread parallel Newton-Raphson iteration method in nonlinear solution units (NSs) of Kernel1 . The convergence of v(t) is checked, and it determines whether the process will move to the next time step.

158

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Algorithm 4.3 Diode kernel procedure PHYSICS-BASED POWER DIODE MODULE Reverse recovery: Calculate qE (t) from vE (t) in (4.52) Calculate qM (t) from qE (t) and qhist (t − t) using (4.59) Update qhist (t) in (4.60) Calculate gR and iReq from vE (t) as (4.62) and (4.63) Junction capacitance: Calculate CJ (t) from vE (t) as (4.58) Compute gJ and iJ eq from vE (t) as (4.67) and (4.68) Complete module: Solve matrix equation (4.11) Update v(t) if v(t) not converged then go to reverse recovery else Store vA , vin , and vK into global memory

4.3.2

 Kernel0

 Kernel1

Physics-Based Nonlinear IGBT Model

4.3.2.1 Model Formulation Based on Hefner’s physics-based model [148], the IGBT is described as the combination of a bipolar transistor and a MOSFET. Since these internal devices are differently structured from standard microelectronic devices, a regional approach is adopted to identify the phenomenological circuit of IGBT as shown in Fig. 4.12a. An analog equivalent circuit, shown in Fig. 4.12b, makes it possible to implement the model in circuit simulators by replacing the BJT with base and collector current sources and MOSFET with a current source, which represents the currents between each of the terminals and internal nodes in terms of nonlinear functions [149]. Currents The steady-state collector current icss of BJT is formulated as icss =

4bDp Q iT + , 1 + b (1 + b)W 2

(4.74)

where b is the ambipolar mobility ratio, Dp represents hole diffusivity, and the anode current iT and quasi-neutral base width W are shown as follows: iT =

vAe , rb

W = WB − Wbcj = WB −

(4.75) '

2 si (vds + 0.6) . qNscl

(4.76)

4.3 Physics-Based Model

159

Gate

Gate

Collector

g Cgd

g Coxd

Collector

MOSFET

s

Cgdj Cdsj d

rb

d

Cm

Coxs

c

n+

Depletion Region

p+

Cgs

Cdsj imult

b

Ccer

Cmult

b

BJT

n-

Cebj+Cebd e

c

imos s

Ceb

+

p

ibss

e

icss

rb Anode

Emitter

(a)

Ccer

(b)

Fig. 4.12 (a) Phenomenological physical structure of IGBT and (b) analog equivalent circuit of IGBT (IGBT-AE)

The instantaneous excess-carrier base charge Q is given as

Q=

⎧ ⎪ ⎪ ⎨Q1 max(Q1 , Q2 ) ⎪ ⎪ ⎩Q 2

veb ≤ 0 0 < veb < 0.6 .

(4.77)

veb ≥ 0.6

The emitter-base depletion charge Q1 and excess-carrier quasi-neutral base charge Q2 have the following expression: ( Q1 = Qbi − A 2qNB si (0.6 − veb ),   W , Q2 = p0 qALtanh 2L

(4.78) (4.79)

where Qbi is the emitter-base junction built-in charge, NB is the base doping concentration, vAe represents the voltage across rb , q is the electron charge, p0 is the carrier concentration at the emitter end of the base, A is the device active area, L is the ambipolar diffusion length, WB represents the metallurgical base width, Wbcj is the base-collector depletion width, si is the silicon dielectric constant, vds means the drain-source voltage, and Nscl is the base-collector space concentration. When the emitter-base voltage veb ≥ 0, p0 can be an iterative numerical solution due to the relationship with Q2 in (4.79) based on comparison between Q1 and Q2 . The

160

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Newton-Raphson method is adopted to solve p0 in the following equation: qveb = ln kT

)

*



p0 + NB (NB + p0 ) − βln NB

p0 1 + 2 NB ni

 ,

(4.80)

where β=

2μp . μn + μp

(4.81)

If 0 < veb < 0.6 and Q1 > Q2 , instead of the numerical solution, p0 is modified as p0 =

Q1 qALtanh

 W .

(4.82)

2L

The base resistance rb in (4.75) is expressed as rb =

⎧ ⎨

W qμn ANB W ⎩ qμeff Aneff

veb ≤ 0 veb > 0

,

(4.83)

where μn and μeff stand for electron mobility and effective mobility and neff is the effective doping concentration. The steady-state base current ibss is caused by decay of excess base charge of recombination in the base and electron injection in the emitter, expressed as follows: ibss =

2 i 4Q2 Nscl Q sne + , 2 2 τH L QB ni

(4.84)

where τH L is the base high-level lifetime, isne is the emitter electron saturation current, ni is the intrinsic carrier concentration, and QB representing the background mobile carrier base charge is formulated as QB = qAW Nscl .

(4.85)

And the MOSFET channel current is expressed as

imos =

⎧ ⎪ ⎪ ⎨0

vgs < vT

K (v − v )v − ⎪ p gs 2 T ds ⎪ ⎩ Kp (vgs −vT ) 2

2 Kp vds 2

vds ≤ vgs − vT ,

(4.86)

vds > vgs − vT

where Kp is the MOSFET transconductance parameter, vgs is the gain-source voltage, and vT is the MOSFET channel threshold voltage.

4.3 Physics-Based Model

161

In addition, there is an avalanche multiplication current imult in Fig. 4.12b. Due to thermal generation in the depletion region and carrier multiplication, which is a key factor to determine the avalanche breakdown voltage, and the leakage current, imult is given as imult = (M − 1)(imos + icss + iccer ) + Migen ,

(4.87)

where the avalanche multiplication factor M is given as   vds −BVn , M = 1− BVcb0

(4.88)

where BVcb0 is the open-emitter collector-base breakdown voltage and the collector-base thermally generated current Igen has the following expression: ' igen

qni A = τH L

2 si vbc . qNscl

(4.89)

Capacitance and Charges The gate-source capacitance Cgs in the analog module is a constant, while all the others are charge related: Qgs = Cgs vgs .

(4.90)

Depending on the operation condition, the gate-drain capacitance Cgd has two sections Cgd =

 Coxd

vds ≤ vgs − vT d

Cgdj Coxd Cgdj +Coxd

vds > vgs − vT d

,

(4.91)

where vT d is the gate-drain overlap depletion threshold voltage and Coxd is the gatedrain overlap capacitance. And the capacitance Cgdj is given as Cgdj =

Agd si , Wgdj

(4.92)

where Agd is the gate-drain overlap area, si is the silicon dielectric constant, and Wgdj is the gate-drain overlap depletion width given as follows: ' Wgdj =

2 si (vdg + VT d ) . qNscl

(4.93)

162

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

And the charge of Cgd takes the form of Qgd =

⎧ ⎨Coxd vdg 2 ⎩ qNB si Agd Coxd



Coxd Wgdj

si Agd



Coxd Wgdj

si Agd

− ln 1 +



vds ≤ vgs − vT d − Coxd vT d

vds > vgs − vT d (4.94)

Similarly, other capacitances are related to a certain depletion capacitance, and the depletion capacitance depends on active area and width. The drain-source depletion capacitance Cdsj is related to (A-Agd ) and drain-source depletion width Wdsj as follows: Cdsj =

(A − Agd ) si . Wdsj

(4.95)

And the charge Qds is ( Qds = Ads 2 si (vds + 0.6)qNscl . ∂Qeb ∂Veb

according to the following

(Q − Qbi )2 , 2qNB si A2

(4.97)

The emitter-base capacitance Ceb is solved for equation: Vebj = 0.6 −

(4.96)

as Ceb = −

qNB si A2 . Q − Qbi

(4.98)

And Qeb is the same as Q given in (4.77). The collector-emitter redistribution capacitance Ccer is solved from the ambipolar diffusion equation as Ccer =

QCbcj , 3QB

(4.99)

where QB is the background mobile carrier base charge and the base-collector depletion capacitance Cbcj is the same as Cdsj . The carrier multiplication charges and capacitance are related to Ccer , given as Qmult = (M − 1)Qce ,

(4.100)

Cmult = (M − 1)Ccer .

(4.101)

.

4.3 Physics-Based Model

163

4.3.2.2 Model Discretization and Linearization Similar to the power diode model, the analog equivalent circuit model (IGBT-AE) in Fig. 4.12b containing nonlinear and time-varying elements is transformed into the discretized and linearized equivalent circuit (IGBT-DLE) as shown in Fig. 4.13. After linearization of the four current sources and the conductivity-modulated base resistance rb , these five elements are converted into the time-discrete form. Take the imos at the (n + 1)th iteration for example: n+1 n imos = imos +

  ∂i n   n ∂imos n+1 n+1 n n vgs + mos vds − vgs − vds ∂vgs ∂vds

n+1 n n n+1 n + gmosgs vgs + gmosds vds , = imoseq

(4.102)

Collector

iCgseq gCgs

iCgdeq gCgd

Gate

Emitter Fig. 4.13 Discretized and linearized equivalent circuit of IGBT (IGBT-DLE)

gcssbcvbc

iCmulteq gcssaevae

imulteq gCmultbc gcssebveb

gmultebveb icsseq

gmultaevae iCcereq gTae

gmultgsvgs gCcerbcvbc

gbsseb iTeq

gTbcvbc

gbssbcvbs gTebveb

ibsseq

gCeb

iCebeq

imoseq

gCdsj

iCdsjeq

gmosgsvgs

gmultds

gmosds

164

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

where n n n n = imos − gmosgs vgs − gmosds vds . imoseq

(4.103)

Similarly, applying the method to iT , icss , ibss , and imult yields n+1 n+1 n+1 iTn+1 = iTn eq + gTn ae vae + gTn bc vbc + gTn eb veb ,

(4.104)

n+1 n+1 n+1 n n n n+1 n icss = icsseq + gcssbc vbc + gcssae vae + gcsseb veb ,

(4.105)

n+1 n+1 n+1 n n n ibss = ibsseq + ibsseb veb + gbssbc vbc ,

(4.106)

n+1 n+1 n+1 n+1 n n n n n+1 n imult = imulteq + gmultds vds + gmultds vds + gmultae vae + gmulteb veb . (4.107)

All the capacitors can be replaced by pairs of conductance in parallel with a current source using integration methods such as Euler, trapezoidal on charges at t + t. For example, applying the trapezoidal method on Qgs gives h [iQgs (t + t) + iQgs (t)], 2

(4.108)

2 [Qgs (t + t) − Qgs (t)] − iQgs (t) h

(4.109)

Qgs (t + t) = Qgs (t) + and solution of iQgs yields iQgs (t + t) =

= iQgseq (t) + GCgs (t + t)vgs (t + t), iQgseq (t) = iQgs (t) − GCgs (t)vgs (t).

(4.110)

With the conversion above, IGBT-DLE is modeled as in Fig. 4.13. Applying KCL to nodes gate, collector, base, emitter, and anode gives the five equations as follows:

−imult −

dQmult dt

dQgd + imos dt

dQgs dQds − dt dt dQgs dQdsj dQCcer − − imos − − icss − dt dt dt dQdsj dQmult dQeb + imult + − ibss − + dt dt dt dQeb dQCcer + ibss + + icss − iT dt dt

= iG

(4.111)

= iC

(4.112)

=0

(4.113)

=0

(4.114)

iT = iA

(4.115)

4.3 Physics-Based Model

165

Based on the detailed linearized equivalent circuit, a 5 × 5 conductance matrix GI GBT can be established for the matrix equation, GI GBT · V I GBT = I IeqGBT ,

(4.116)

where V I GBT = [vc I IeqGBT = [iceq

vg

va

igeq

ve ]T ,

vd

iaeq

(4.117)

ieeq ]T ,

ideq

(4.118)

and iceq = imulteq + iCmulteq + icsseq + iCcereq + iCgseq + iCdsj eq + imoseq , (4.119) igeq = −iCgseq + iCdgeq ,

(4.120)

iaeq = −iT eq ,

(4.121)

ideq = iCcebeq + ibsseq − imoseq − iCdgeq − iCdsj eq − imulteq − iCmulteq , (4.122) ieeq = −icsseq − ibsseq − iCcereq − iCcebeq + iT eq ,

(4.123)

GI GBT = gmultds +gmultgs +gCmultbc + gcssbc +gCcerbc +gCgs + gCdsj +gmosds +gmosgs

−gmultgs − gCgs − gmosgs

−gCgs

gCgs + gCdg

0

−gCdg

0

−gT bc

0

gT ae

gT bc − gT eb

−gT ae + gT eb

gbssbc −gmosgs −gmosds − gCdsj −gmultds −gmultgs − gCmultbc

gmosgs − gCdg + gmultgs

gmultae

−gcssbc −gCcerbc − gbssbc +gT bc

0

gcssae − gT ae

−gmultae − gcssae

−gmultds −gCmultbc −gcssbc + gcsseb −gCcerbc −gdsj − gmosds +gmulteb

gCeb −gbssbc +gbsseb + gmosds +gCdg +gCdsj + gmultds −gmulteb −gCmultbc gcssbc −gcsseb +gCcerbc − gbsseb +gbssbc −gCeb − gT bc +gT eb

−gmultae −gcssae + gcsseb −gmulteb

−gCeb +gbsseb − gmultae +gmulteb −gcssae +gcsseb +gbsseb + +gCeb +gT ae −gT eb

,

(4.124) Similarly, when using the Newton-Raphson method to solve the matrix equation, instead of solving nodal voltage vector V I GBT directly, V is obtained to update V I GBT iteratively. Therefore, (4.116) evolves to GI GBT V = −I,

(4.125)

where −I = [i c

ig

ia

id

i e ]T ,

(4.126)

166

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

i c = imulteq + iCmulteq + icsseq + iCcereq + iCgseq + iCdsj eq + imoseq + (gmultgs − gCgs + gmosgs )(vg − vc ) + (gCmultbc + gmultds + gCdsj + gmosds + gcssbc + gCcerbc )(vd − vc ) + (gmultae + gcssae )(va − ve ) + (gmulteb + gcsseb )(ve − vd ),

(4.127)

i = −iCgseq + iCdgeq + gCgs (vc − vg ) + gCdg (vd − vg ),

(4.128)

i a = −iT eq + gT ae (va − ve ) + gT bc (vb − vc ) + gT eb (ve − vb ),

(4.129)

g

i d = iCcebeq + ibsseq − imoseq − iCdgeq − iCdsj eq − imulteq − iCmulteq + (gbssbc − gmultds − gCmultbc − gCdsj − gmosds )(vb − vc ) + (gCeb + gbsseb − gmulteb )(ve − vb ) − gCgd (vb − vg ) − (gmosgs + gmultgs )(vg − vc ) − gmultae (va − ve ),

(4.130)

i e = −icsseq − ibsseq − iCcereq − iCcebeq + iT eq + (gT bc − gcssbc − gbssbc − gCcerbc )(vb − vc ) + (gT ae − gcssae )(va − ve ) + (gT eb − gbsseb − gCeb − gcsseb )(ve − vb ).

(4.131)

4.3.2.3 Parallel Massive-thread Mapping The massive-thread concurrent implementation of a number of IGBTs, noted as n, is shown in Fig. 4.14 and described in Algorithm 4.4. A total number of 12 kernels are involved in the module. Kernel0 and Kernel1 check the PN junction and FET junction voltage limitations in successive Newton-Raphson iterations, and the main equivalent parameters updating is accomplished in Kernel2 to Kernel9 , including rb , six nonlinear capacitors, and five current source approximation. Kernel10 is in charge of establishing the iterative Jacobian matrix equation in (4.125) and solving V I GBT using LU decomposition or Gaussian elimination. Kernel11 assigns n CUDA blocks containing five threads per block to update voltage vector in each IGBT. If V I GBT converges, it means the iterative solution for the current timestep is accomplished, and the nodal voltage vector V I GBT will be stored into global memory and used as the initial value for next time-step. Otherwise, a new iteration starts by checking junction limitation. The six nonlinear capacitors Cgd , Cgs , Cdsj , Cmult , Ceb , and Ccer are updated in Kernel5 and Kernel6 for charges, currents, equivalent conductance, and parallel current sources. Intermediate parameters p0 , Q, Q1 , Q2 , and M are processed in p0 , Q, and M units in Kernel2 , Kernel3 , and Kernel4 . With the capacitance known, the equivalent conductance GQ and parallel current source IQeq are updated in Kernel8 , followed by equivalent conductance GQ and parallel current source I Qeq update in Kernel10 . Similarly, for the approximation to current sources imos , iT , icss , ibss , and imult , Kernel9 updates GI and I I eq . In this way, parameters in IGBT-DLE are updated. Solving for V I GBT in the matrix equation (4.125) is accomplished in Kernel10 . And Kernel11 is to update V I GBT from previous iteration value and V I GBT .

4.3 Physics-Based Model

167

Node voltages V Charge in capacitors Q

Global Memory

Current through capacitors IQ Time step size ∆t Collector-emitter redistribution capacitance Ccer

vneb(t) n vn-1eb(t) Kernel0 v eb(t) n n v bc(t) BJT v bc(t) vn-1bc(t)

Vn(t) vneb(t) Kernel2

p0

Q(t-∆t)

vngs(t) Kernel1 n v gs(t) vn-1gs(t) FET

IQ(t-∆t)

vngs(t) vnds(t)

Kernel3

Q

n Kernel8 gs(t) Qngd(t) InQ(t) Cgd IGBTn (t) C gd AE1 Cgs Qn(t) Qnds(t) Cdsj Ceb Cndsj(t) IGBTQn (t) Cncer(t) AE2 In (t) Qeq Cneb(t) eb n Kernel 6 Q ce(t) Qnce(t-∆t) GnQ(t) n n Q mult(t) I Qce(t-∆t) Cnmult(t) IGBTCn (t-∆t)

Kernel5 Q

vds(t-∆t)

Kernel4

M

cer

AEn

Mn(t) Ccer vnds(t-∆t) Cmult

Kernel9

∆tn ∆t Vn(t)

GnI(t)

Kernel7

rb

iT ibss icss imos imult

IGBTAE1 ∆Vn(t) IGBTAE2 Kernel10 IGBTAEn ...

...

Ccer(t-∆t)

Kernel11

Jacobian

InIeq(t) No

Vn(t) ∆V(t) converge? Yes

Vn(t)

Fig. 4.14 Massive-thread parallel implementation of IGBT

Algorithm 4.4 IGBT kernel procedure PHYSICS-BASED IGBT MODULE Check PN junction voltage veb (t) and vbc (t) from last iteration value Check MOSFET junction voltage vgs (t) from last iteration value Update parameters from IGBT-AE to IGBT-DLE Solve p0 as (4.80) to (4.82) Calculate Q, Q1 , and Q2 as (4.77) to (4.79) Update intermediate parameter M as (4.88) Calculate C and Q in Cgd , Cgs , Cdsj , and Ceb as (4.90) to (4.98) Calculate C and Q in Ccer and Cmult as (4.100) to (4.101) Update I Q , I Qeq , and GQ in all capacitors as (4.108) to (4.110) Calculate rb as (4.83) Calculate GI and I I eq for all current sources as (4.74) to (4.107 ) Build matrix equation (4.125) using (4.120) to (4.131) and solve for V Update V I GBT (t) for current iteration Check convergence of V I GBT if V I GBT converges then Store V I GBT to global memory and update t else Start from checking junction iterative limitation

 Kernel0  Kernel1  Kernel2  Kernel3  Kernel4  Kernel5  Kernel7  Kernel8  Kernel6  Kernel9  Kernel10  Kernel11

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Vce (V)

800 600

Vce Ic

Gate on

td(on)=1.09μs

Vce Ic

Gate on

td(on)=1.12μs

100

400

50

200

tr=0.19μs

tr=0.20μs

0

0 0

2

4

6

Gate off

t (μs) 0 (a) Vce Ic

2

4

6

t (μs) Vce Ic

Gate off

80

400 td(off)=3.1μs

200

td(off)=3.1μs

tf=0.36μs

Ic (A)

Vce (V)

600

Ic (A)

168

40

tf=0.35μs

0

0 0

2

4

t (μs) (b)

0

2

4

t (μs)

Fig. 4.15 IGBT switching results comparison between GPU simulation (left) and SaberRD® (right): (a) turn-on process and (b) turn-off process Table 4.2 Device-level switching times and power dissipation of IGBT and diode

Switching time (μs) Power dissipation (W) SaberRD GPU SaberRD GPU I GBT I GBT td(on) 0.10 0.098 Pon 112.31 113.02 GBT tIr GBT 0.17 0.15 PIoff 75.01 74.84 GBT 0.33 tId(off )

tIfGBT tDiode rr

4.3.3

0.34

GBT 287.52 PIcond

0.67

0.65

0.66

0.64

PDiode cond PDiode rr

289.06

7.48

7.55

9.95

10.07

Model Validation

The device-level simulation results are given in Fig. 4.15 which shows the detailed turn-on and turn-off switching times, voltages, and currents. Table 4.2 gives the power dissipation of the IGBT and its freewheeling diode under switching and conduction conditions. As can be seen from Fig. 4.15 and Table 4.2, the massively parallel GPU simulation results agree with the SaberRD quite well.

4.4

Nonlinear Dynamic Model

The nonlinear dynamic model is a physics-based model and therefore is also established according to the phenomenological circuit in Fig. 4.12a, with a BJT sharing as its base and collector, respectively, and the drain and source of a MOSFET as the mainframe, which is complemented by several passive elements and resistors to construct a complete model, as shown in Fig. 4.16a [150, 151].

4.4 Nonlinear Dynamic Model

169

C

1 RC+LC

2 Cbe Ccg

3

Cdg

G RG Cge

BJT C CE c

b

d

g

4

e

Rp

5 RA

E RE+LE

(a) 1 RC

C

G Rg

2

Rp B5

Cdg 5'

F1 Cge

FET

E 8

F2

3'

CCE

CD

Rb

irr

D 5

RA

(b)

7

iT

F3 B4 4' F4

RE

BJT

B1

B3

5'

3 4

6

S

8

Ccg

irr

D

s

B2

CD

7

Cds FET

Cbe

Rb

iT

6

S

Fig. 4.16 IGBT dynamic modeling: (a) advanced dynamic model and (b) discrete nonreciprocal EMT companion circuit

4.4.1

Dynamic Transistor Model

The model mainly reflects the static characteristics of the IGBT through the Darlington connection of MOSFET and BJT, while the junction and diffusion capacitances indicate the dynamic components of the switch. The classical Shichman-Hodges model [152] is used to describe the N-channel MOSFET static characteristics in three different regions, namely, the cutoff, linear, and saturation:

170

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

⎧ ⎪ ⎪ ⎨0, Vgs ≤ Vth   ds ds , Vds ≤ Vsat Ids (Vgs , Vds ) = VVsat Isat (Vgs , Vds ) · 2 − VVsat ⎪ ⎪ ⎩ Isat (Vgs , Vds ), Vds > Vsat

(4.132)

where Ids is the drain current; Vds denotes the drain-source voltage; Vgs and Vth represent the voltages of gate-source and the channel threshold, respectively; and variables relating to saturation are Isat = K(Vgs − Vth )NF ET · (1 + λVds ),

(4.133)

Vsat = AF ET (Vgs − Vth )MF ET ,

(4.134)

μn Cox Weff , 2 Leff

K=

(4.135)

where λ and μn are channel length modulation parameter and carrier mobility, Cox is the gate capacitance per unit area, Weff and Leff are effective channel width and length, and the remaining symbols with subscript F ET are constants. In the PNP-BJT, with a current gain β, the collector current maintains a proportional relationship with the base current, as [153] Ic = βIb , Ib = ISBJ T

 V  be MBJ T VT e −1 ,

(4.136) (4.137)

where Vbe is the base-emitter voltage, VT is the thermal voltage, and ISBJ T and MBJ T are coefficients.

4.4.1.1 Parasitic Dynamics The distribution of holes and electrons in the semiconductor yields parasitic capacitance. Following the changes of junction bias, a dynamic process when the positive and negative charges accumulate or neutralize with each other contributes to switching transients. Therefore, stray capacitors should be included in an integral nonlinear high-order model. The capacitors, including Cgc , Cce , Cds , and Cge , are generally under either the enhancement or depletion state, depending on how the junction is biased, i.e., VJ∗N C = Ms · Vdiff − VJ N C ,

(4.138)

where Vdiff and VJ N C are diffusion and junction voltages, respectively; Ms is a coefficient; and transition between the two states occurs when VJ∗N C crosses zero. Then, the capacitors, generally referred to as Cxy , can be written as [154]

4.4 Nonlinear Dynamic Model

Cxy (VJ∗N C ) =

171

   ⎧

α·VJ∗NC (1−ρ) ⎪ C (κ − 1) · 1 − exp − + 1 , VJ∗N C ≥ 0 ⎪ 0 (κ−1)Vdiff ⎪ ⎞ ⎨ ⎛ ⎪ C0 ⎝ρ + ⎪ ⎪ ⎩



1−ρ  ⎠ , α ∗

VJ∗N C < 0

V

1− VJ NC diff

(4.139) where C0 is the reference capacitance and α, κ, and ρ are factors. Taking the form of ' L C + LE , (4.140) Rs (VJ∗N C ) = Ddp Cxy (VJ∗N C ) in which Ddp is a coefficient, some additional damping resistances can be added in series to C–E and C–G capacitors to avoid possible oscillation between them and the terminal parasitic inductance LC and LE . The diffusion capacitors Cbe between the BJT base-emitter and CD in the freewheeling diode, on the other hand, can uniformly be expressed as Cdiff = τ

i(t) + K(Vgs − Vth )NF ET , Mdiff · VT

(4.141)

where i(t) is the BJT base current or diode anode current, τ denotes carrier lifetime, and Mdiff is a coefficient. Parameters τtail and δtail influence the off-switch current edge, as the latter shapes the falling edge and the former sets the duration. The injected current Itail at load current Iload is expressed as [155]:   t , Itail = δtail · Iload · exp − τtail

(4.142)

All the dynamic parameters described above have working point dependency on current I , voltage V , and junction temperature T as Cx = Cx0 · xI (I ) · xV (V ) · xT (T ),  xI (I ) = CC0 + (1 − CC0 )

 xT (T ) =

TJ TN OM

CC

IN OM 

xV (V ) = VC0 + (1 − VC0 )

I

V VN OM

(4.143) ,

(4.144)

,

(4.145)

V C

T C .

(4.146)

172

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

4.4.1.2 Freewheeling Diode The freewheeling diode connected to the IGBT in an antiparallel position has the following typical static characteristics:  V  D (4.147) ID = IDsat · e MD VT − 1 , where MD is an ideality factor, IDsat means saturation current, and VD is the forward-biased voltage between anode and cathode. The current-dependent series resistor is calculated by RBk Rb = + . ID 1 + INOM

(4.148)

In the above equation, RBk is the diode’s bulk resistance, and IN OM is the nominal collector current. The waveform of reverse recovery current is defined as the form of a piecewise function with five sections [156]. All sections are connected so that the curve remains differentiable at the transition from one region to the other. The maximum current Irrmax is obtained by  Irrmax = k τf wd −

1 τf wd



1 Tn

  ts 1 − e τf wd .

(4.149)

where parameter k represents the turn-off slope, τ is the carrier lifetime, Tn is the electron transit time, and ts when Irrmax is reached is calculated from the reverse recovery charge Qrr and the form factors.

4.4.2

IGBT EMT Model Derivation

The IGBT/FWD model is described in continuous time domain, and therefore, discretization prior to digital simulation implementation on any processor is mandatory, as shown in Fig. 4.16b [157]. Among a variety of integration methods, the first-order implicit backward Euler is preferred since it results in fewer computations without compromising accuracy due to a small time-step that is compulsory for device-level simulation. Many internal components are determined by multiple factors; e.g., (4.132)– (4.135) indicate that the MOSFET current is not only dependent on its terminal voltage Vds but also Vgs that exerts between gate and source. Therefore, partial derivatives which lead to both conductance and transconductance, respectively, are performed, e.g., for the former,

4.4 Nonlinear Dynamic Model

⎧ ⎪ ⎪0, ⎨

Gds

∂Ids sat = = ∂I ∂Vds ⎪ ∂Vds ⎪ ⎩ ∂Isat ∂Vds

173

2 2Vsat Vds −Vds 2 Vsat

+ 2Isat



Vsat −Vds 2 Vsat



= λK(Vgs − Vth )NF ET .

Vgs ≤ Vth , Vds ≤ Vsat Vds > Vsat

(4.150) The transconductance Gds,gs can also be derived by taking ∂Ids /∂Vgs in the same manner. Then, the companion current in branch F1 can be found as Idseq = Ids − Gds Vds − Gds,gs Vgs .

(4.151)

Being nonreciprocal between two nodes is one distinct feature of transconductance; e.g., other than Gds in F1 , node 3 also links to nodes 4 and 5 with -Gds,gs and Gds,gs , respectively, but the latter two nodes do not have any transconductance pointing to node 3. Therefore, virtual nodes 4 and 5 are introduced to reflect the unilateral relationship in branches F3 and F2 . Similarly, node 5 also has Gds,gs in F4 , while the reverse is not the case. The BJT companion model is conceptually the same, with its conductance expressed by GBJ T =

Vbe ∂Ib ISBJ T = e MBJ T VT , ∂Vbe MBJ T VT

(4.152)

and the accompanying current contribution in branch B1 as IBJ T eq = Ib − GBJ T Vbe .

(4.153)

As another three-terminal device, the BJT discretization also yields transconductance that demonstrates unidirectionality. Among the five branches, B1 and B2 are reciprocal between nodes 2–5 and 2–3, respectively; however, to nodes 3 and 5, B3 – B5 are unilateral, and therefore, two virtual nodes 3 and 5 are introduced. Table 4.3 lists all admittance G contributed by the two three-terminal nonlinear devices and the node-to-node relations where N-m denotes node m. The nonreciprocal relationship between different nodes is fully shown; e.g., when establishing the admittance matrix, node 3 contributes branch B3 to node 2, while the reverse is not the case. Therefore, the admittance matrix turns out to be asymmetrical. The nonlinear capacitors also distinguish themselves from linear elements. Since Cxy is solely dependent on its terminal voltage, the companion current is calculated in a manner given in (4.153), while the actual current is calculated by ICxy (t) =

QCxy (t) − QCxy (t − t) t

,

(4.154)

174

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Table 4.3 Admittance contributed by MOSFET and BJT BJT G FET G

B1 βB2 F1 Gds //Idseq

B2 GBJ T //IBJ Teq F2 Gds,gs

Node-to-node relation Node N-2 3 N-2 1 Bi + B5 N-3 B2 N-4 0 N-5 B1

 QCxy (t) = =

t

t0

B3 βGBJ T F3 -Gds,gs

N-3 B2 + B3 B2 + F 1 0 F 1 + B4

B4 -βGBJ T F4 Gds,gs

N-4 0 F3 0 F4

B5 -βGBJ T –

N-5 B1 + B5 F1 + F2 0 F 1 + F 4 + B1 + B4

Cxy (VJ∗N C )dVJ∗N C

 C0 [(κ − 1) · (V − C0 (ρV + B(1 +

exp(V ·A) ) + V ] + C0 (κ−1) , A A V 1−α ) − B),

Vdiff

VJ∗N C ≥ 0

VJ∗N C < 0

(4.155) where A and B are A= B=

α(1 − ρ)Vdiff , κ −1

(4.156)

(1 − ρ)Vdiff . 1−α

(4.157)

The equivalent conductance is expressed as GCxy (t) =

Cxy (VJ∗N C ) , t

(4.158)

just as other linear and nonlinear capacitors. For diffusion capacitors, the charge is first derived using integration QCdiff (t) = τ K(Vgs − Vth )NF ET (eV ·VT M

−1

− 1),

(4.159)

and then the current can be calculated according to variation of the charges in two neighboring time-steps, i.e., (4.154). The diode static I –V characteristic takes the same form as the nonlinear behavioral model (4.3) and can therefore be linearized identically.

4.4 Nonlinear Dynamic Model

175

As annotated, the entire IGBT model has eight nodes, meaning a single such device corresponds to an admittance matrix of 8 × 8 and a current vector with eight elements in the process of nodal voltage solution using the following equation: Gk (t) · Uk+1 (t) = Jk (t),

(4.160)

where k denotes the Newton-Raphson iteration count, since the nonlinearity requires the admittance matrix and the current vector can be updated in every calculation based on the nodal voltage vector at the same iteration Uk (t).

GsC

−GsC

0

0

0

0

0

0

−GsC

(bn + 1)GBJT + 1 +G +G sC + b R

−(bn + 1)GBJT −

−Geg

−Gec

−Gdif

−Gb

0

Gmosvgs

−Gmosvds −

0

0

0

0

0

0

0

0

P

Gec + Geg +

0

GCeb + Gdif −GBJT − R1 −GCeb

G=

0

0

0 0 0

P

−Geg

−bnGBJT − Gec

−Gdif −Gb 0

1 RP

− GCeb

Gmosvds + GBJT 1 RP

+ GCeb 0

−Gmosvds + bnGBJT

Gmosvgs 1 Rg

+ Ggs + Geg

−Gmosvgs − Ggs

0 0 0

− R1

P

− Ggs

Gmosvds + Gmosvgs + Ggs + Gsaux + Gec

Gsaux Gf wd +

1 Rg

.

+

1 Rg

−Gsaux

Gsaux + GsE +

−Gf wd

−GsE

0 0

0 0

Gdif −Gf wd −GsE

Gb + Gf wd 0

0 GsE

(4.161)

Ieq =

−IsC ,

−(bn + 1)Ibeq − Ieceq −Iegeq − ICcbeq + Idif eq − IsE ,

−Imoseq + Ibeq + ICebeq ,

Vs Rg

− Igseq

Imoseq + bnIbeq

+Iegeq ,

+Igseq + Ieceq −Isaux

Vs −If wdeq − R − g

Idif eq − Isaux + IsE

If wdeq

−IsE

T

.

(4.162)

4.4.2.1 Parallel Massive-Thread Mapping Like the other two nonlinear IGBT models, the nonlinear IGBT dynamic model also consists of many parts, which can be written into individual kernels like in the physics-based model, or an integral one to reduce global memory exchange frequency. Figure 4.17 inherits the nonlinear behavioral model kernel design method by using two generic portions, i.e., the model equations and the nonlinear solver. In this scenario, only five global variables are needed, four of which are both inputs and outputs, i.e., the nodal voltage vector v, IGBT gate-source voltage VGS , and nonlinear capacitor charges QCE , QCG , and QEB . The remaining global variable is vs , which, unlike its counterparts, does not involve Newton-Raphson iteration. Algorithm 4.5 shows the detailed process. Based on the nodal voltage vector, different parts of the dynamic model are calculated one after another to obtain the admittance and companion current of the discrete circuit. Then, all the elements are grouped as the admittance matrix and current vector which are used to solve the nodal voltage vector in Kernel1 . Variables like VGS , QCE , QCG , and QEB are stored in the global memory regardless of the convergence noticing that they will be

176

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

Node Voltage v(t), Gate Voltage vs(t)

VGS, QCE, QCG, QEB Kernel0

Global Memory Kernel1

IGBT Model

Nonlinear Solver

IGBT1

NS1

IGBT2

v(t)

... IGBTn

G(t) Ieq(t)

NS2

...

vs(t)

Yes

v(t)

v(t) converge? No

NSn

Fig. 4.17 Multi-thread parallel implementation of nonlinear dynamic IGBT model

used in the next computation, whereas vector v(t) is stored on condition of numerical convergence. It should be pointed out that like in other models, the gate voltage vs (t) always remains constant during an arbitrary time step.

4.4.3

Wideband SiC MOSFET Model

The MOSFET is another type of prevalent power semiconductor switch. Since in Fig. 4.16 the IGBT dynamic model combines the MOSFET with BJT, the SiC MOSFET model can be conveniently derived by removing circuit parts mainly related to the BJT, as given in Fig. 4.18. Therefore, all remaining parts have the same mathematical expressions other than the nonlinear resistance RN L , which represents the behavior of channel length modulation, expressed by its I –V characteristics as [158]   V V RV , tanh + I (V ) = R0 RV R1

(4.163)

where RV , R0 , and R1 can be extracted from the datasheet. The discrete time-domain EMT model also has eight nodes, and the admittance and companion current vector can be derived in a manner similar to its IGBT counterpart.

4.4 Nonlinear Dynamic Model

177

Algorithm 4.5 IGBT kernel procedure IGBT DYNAMIC MODEL FWD: Calculate VD Calculate Gf wd and If wd update If wdeq MOSFET: Calculate vgs and vds Compute Gmosvds and Gmosvgs Compute ids and Idseq BJT: Calculate veb Compute GBJ T and Ib Calculate IBJ T eq Capacitors: Calculate vC Compute GC and iC Update ICeq

 Kernel0

Complete module: Form matrices G and Ieq by (4.161)–(4.162) Solve matrix equation (4.160) Update v(t) Store VGS , QCE , QCG , and QEB into global memory if v(t) not converged then go to Kernel0 else Store v(t) into global memory

D

1

2

2

RNL G RG

4

6

3

Rb

FET

7

Cdif

RNL

Cdg G Rg

3

RA

(a)

5'

F1

5 RE+LE

Cgs S

8

FET

8

Rb

7

4

D

Cgs

1 RD

D

RD+LD

Cdg

 Kernel1

F2

F3 4' F4

D 5 RA

RE

S

(b)

CD

6

Fig. 4.18 SiC MOSFET dynamic modeling: (a) advanced dynamic model and (b) discrete nonreciprocal EMT companion circuit

178

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

4.4.4

IGBT Model Validation

A simple test of the IGBT dynamic model is carried out using a half-bridge submodule (HBSM) MMC under normal operation. In Fig. 4.19a–b, the upper IGBT switching transients under two different freewheeling times, as well as the power loss in one cycle, are compared. Commanding ON/OFF states of the IGBT without any freewheeling time as normally seen in the ideal switch model leads to a dramatic collector current overshoot that is detrimental to the converter. On the other hand, the inclusion of a 10% freewheeling time mitigates the turn-on surge, and the IGBT switching power loss reduces almost by one order of magnitude. These switching transients demonstrate a good agreement with results from ANSYS/Simplorer® , whose IGBT/diode model has been experimentally validated against real devices and shows a decent accuracy from both static and dynamic perspectives [159]. The accumulated energy losses under these two scenarios also show a good agreement with the validation tool in Fig. 4.19c. In contrast, the ideal switch model is unable to reveal power loss; neither could it reflect the relationship between current waveform shape and freewheeling time. Therefore, a high-order device-level model is unique for being able to assess a converter design.

4.5

Predefined Curve-Fitting Model

In the system-level simulation, the IGBT is idealized by the two-state switch model with fixed on- and off-state resistances. Therefore, fast simulation speed can be gained at the cost of fidelity. On the other hand, complex, nonlinear models involving a lot of device physics as introduced above are too burdensome to compute and, more often than not, suffer from numerical divergence, albeit they are more accurate. Therefore, the transient curve-fitting model (TCFM) with key parameters being dynamically adjustable becomes the best choice, which provides device-level information and also ensures computational efficiency [160, 161]. The TCFM introduced in this section is datasheet-driven, so its parameters can be easily extracted. It contains two parts: the static characteristics and the switching transients. The former is realized by a current-dependent resistor, whose value is calculated based on the static V –I characteristics available in the manufacturer’s datasheet, and by piecewise linearizing the terminal voltage VCE as a function of collector current IC , it can be expressed by rs (Ic , Vg , Tvj ) =

VCE = a1 (Ic , Vg , Tvj ) + a2 (Ic , Vg , Tvj ) · IC−1 , IC

(4.164)

where a1 and a2 are coefficients dependent on a number of factors, such as the gate voltage Vg and junction temperature Tvj . In the meantime, switching transients which include turn-on and turn-off processes are modeled separately. The parameters reflecting these two stages are the rise time tr and fall time tf , both of which are sensitive to Vg , Tvj , Rg , and IC

4.5 Predefined Curve-Fitting Model

400

700

iC /A, vCE/V

179

600

400

Ploss (kW)

vCE

500

ΔI

400

Ploss (kW)

vCE ΔI

300

tr

200

0

16ms

tr

0

16ms

iC

100 0

iC (a)

70

iC /A, vCE/V

500

70

vCE

400

Ploss (kW)

vCE

Ploss (kW)

300

tr

200

iC

0 16ms

100

tr

iC

0 16ms

0 0

0.5

1.5 2.0

2.5

3.0 3.5

4.0

t/μs 0 (b)

Proposed model

100

®

ANSYS/Simplorer

80

W/J

1.0

60

1.0

1.5 2.0

2.5

3.0 3.5

70

Proposed model

60

ANSYS/Simplorer®

50 40

IGBT2

40

0.5

4.0

t/μs

IGBT2

30

IGBT1

IGBT1

20

20

10 0

20

40

60

80

100 120 140 160

t/ms 0 (c)

20

40

60

80

100 120 140 160

t/ms

Fig. 4.19 IGBT switching transients ((a)–(b) left: physics model; right: ANSYS/Simplorer® model): (a) without freewheeling time, (b) 10% freewheeling time, and (c) accumulative energy loss with (L) and without (R) freewheeling time

and less affected by external circuits. Normally datasheets do not provide the times under different gate voltage; nevertheless, in actual application, Vg is typically fixed, and it is reasonable not to consider it. For the other three factors, the piecewise linear method can be applied to individual nonlinear curves. However, when all of them are taken into consideration, the linear curve falls short of describing them. Therefore, according to Stone-Weierstrass theorem, the combined effect can be described by the following polynomial function:

180

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

G r (tr,f) Vg

vCg

isw1

C

ig

S1

Cg iC

S1

Vdc

rs vCE

(a)

iL

T1 T2

E

isw2

(b)

Fig. 4.20 IGBT curve-fitting model: (a) model equivalent circuit and (b) IGBTs in a bridge structure

tr,f (s1 , s2 , s3 ) = k0 ·

2 , i=1

si +

i=j  i,j =1,2,3

ki si sj + +

3 

bi si + b0 ,

(4.165)

i=1

where the variable si represents one of the three factors and ki , bi are constants. The turn-on and turn-off waveforms of 5SNA 2000K450300 StakPak IGBT Module are obtained from a bridge-structure test circuit [162], which ensures an identical electromagnetic environment to that of most power converters, meaning the impact caused by the IGBT freewheeling diode is automatically taken into account. A controlled current source iC produces the transient currents, as given in Fig. 4.20a. Take the turn-on current in Fig. 4.21a for example, when the IGBT is ordered to turn on, Vg is taken as binary 1, so the capacitor Cg is being charged with a time constant of τ = r · Cg . Stipulating that it takes the collector current iC a time of τ to reach its maximum and combined with the fact that iC is virtually a straight line, r can be calculated by (90% − 10%)IC (160% − 0%)IC = , tr r · Cg

(4.166)

where Cg is set as 1nF. Afterward, iC rises much slower and eventually begins to drop, and the curve is approximately exponential, while vCg keeps rising. Depending on the requirement of curve-fitting accuracy, multiple exponential sections are used. Thus, the controlled current source can be generally expressed as ⎧ (vCg = 1), ⎪ ⎨ 0, −2tr (vCg ≤ 1 − e τ ), iC (t) = iC (t − t) + k1 · t/τ, (4.167) ⎪ −t −2tr ⎩ k τ 2 (iC (t − t) − 1) · e + IC , (vCg > 1 − e ), where k1 and k2 denote the rise and fall rates of the current and IC is the steady-state current which is added to ensure the current will not fall to zero. After a sufficiently long time, the IGBT enters steady state, and vCg is set to one since its actual value is very close to it. In the meantime, the switch S1 opens, and the IGBT is taken as a resistor. Similarly, when the turn-off process begins, Vg =0 and vCg gradually fall

0.5

vCE (kV), iC (kA)

0 3.0

4

6

vCE tr

1.0 10% 2

tf

Ploss

Ploss

10% 2

2.0 1.5

0

iC

90%

2.5

0.5

iC

tr

2.0 1.5 1.0

8

(μs) 0

2

iC

90%

Ploss (a)

8

(μs) 0

2

6

8

1

0 3

2

4

(b)

8

(μs)

Ir

6

8

1

Qrr

0

0

2

(μs)

vCE

IF

2

-1

6

iD

Qrr

0

(μs)

90%

10% 4

vCE

IF

2

-1

10% 4

tf

Ploss 6

90%

vCE

iC

4

3

vCE

160%

vCE

2.5

vCE (kV), iD (kA)

3.0

181

vCE (kV), iD (kA)

vCE (kV), iC (kA)

4.5 Predefined Curve-Fitting Model

4

Ir (c)

iD 6

8

(μs)

Fig. 4.21 5SNA 2000K450300 StakPak IGBT Module experimental (top) and simulated (bottom) switching transients: (a) turn-on waveforms, (b) turn-off waveforms, and (c) diode reverse recovery

from 1 V. The transient voltage curves can also be obtained in the same manner that uses vCg as the control signal. The steady-state value is critical in deciding the amplitude of transient waveforms. For a power converter, it can be extracted from the external line current iL and the DC bus voltage Vdc for the reason that they both change slowly and are the current and voltage that both IGBTs are actually subjected to. As Fig. 4.20b shows, for the bridge, since the two switches T1 and T2 are complementary, and the simulated IGBT waveform already considers the reverse recovery phenomenon of the other diode, only the active IGBT is calculated, while the terminal voltage and current of the opposite diode can be calculated based on nodal voltage and current; i.e., when T1 is active, the voltage and current of T2 are vCE2 = VCi − vCE1 and isw2 = iL +isw1 , respectively. The results are given in Fig. 4.21 where the simulated curves demonstrate a good agreement to experimental waveforms.

4.5.1

Model Validation

In order to retain IGBT device-level information, a maximum time-step of 1 μs is chosen for computing the TCFM. The switching transient of a semiconductor switch is reflected directly by the rise/fall time and ultimately affects the junction temperature. The simulated curves of (4.165) and experimental results available in the datasheet under different collector currents are shown in Fig. 4.22a. Three sections of linear functions are used for the tf − IC curve, while the rise time curve requires only two sections, and when the y-axis is turned into logarithmic, like in the datasheet, both curves bend. The relationship between IGBT average power loss and its switching frequency is drawn in Fig. 4.22b. The average losses

182

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

1.2 1.0

tr, tf (μs)

0.8 0.6 Simulated tf curve

0.4

Simulated tr curve

0.5

1.0

1.5

2.0

(a)

10

SaberRD lower IGBT

8 6 4 2

3.5

Simulated lower IGBT SaberRD upper IGBT

Datasheet tf value

3.0

Simulated upper IGBT

12

Datasheet tr value 2.5

IC (kA)

Ploss (kW)

14

0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0

f (kHz)

(b)

Fig. 4.22 IGBT device-level performance: (a) variation of turn-on and turn-off times and (b) averaged power loss under different switching frequencies

in both switches rise steadily along with the frequency, as the transient power loss becomes increasingly significant. Figure 4.23 shows device-level results from a five-level MMC with 16 kV DC bus, 2 kHz carrier frequency, and 2 kA and 60 Hz output current. The more severe power loss on the lower switch results in a higher junction temperature than its counterpart, since the upper diode and lower IGBT are expected to be subjected to a larger current, as shown in Fig. 4.23b,d, which also show intensive diode reverse recovery and IGBT turn-on overshoot currents that otherwise are not available in the ideal switch model. The above device-level results fit with those from SaberRD as an industry-standard tool frequently referred to for guidance on power converter design evaluation.

4.6

High-Order Nonlinear Model Equivalent Circuit

All aforementioned nonlinear high-order power semiconductor switch models contain a number of nodes, and each of them results in a high-order admittance matrix prior to the solution of nodal voltages. Meanwhile, as these IGBT/diode pairs are nonlinear, an iterative process is always required for numerical convergence. This twofold challenge slows down the micro-level simulation significantly even for a single device study, let alone when solving a number of them together, which could be far beyond the computational capability of current processors. Therefore, the network equivalence method is narrated in this section to expedite the simulation and improve its numerical stability. A typical example is given in Fig. 4.24a where five IGBTs and seven diodes exist in a clamped double submodule (CDSM) taken from an MMC. Therefore, this circuit unit has up to 35 nodes which result in a large admittance matrix of 35 × 35, accompanied by excessive iterations and a high chance of numerical divergence.

Tvj1 (ºC)

4.6 High-Order Nonlinear Model Equivalent Circuit

52 50 48 46 44 43 40 10

isw1 (kA)

183

1.5 1.0 0.5 0 -0.5 -1.0 -1.5

30

50

Diode reverse recovery

70

t (ms) 10 (a)

Diode reverse recovery

turn-on overshoot

Diode conduction 10

30

50

70

30

50

70

t (ms)

turn-on overshoot

Diode conduction 10

30

50

70

t (ms)

t (ms) 10 (c)

30

50

70

t (ms)

t (ms)

Tvj2 (ºC)

(b) 115 110 105 100 95 90 85

isw2 (kA)

10

30

50

70

turn-on overshoot IGBT conduction

3.5 3.0 2.5 2.0 1.5 1.0 0.5 0 -0.5 10

30

50

70

turn-on overshoot IGBT conduction

t (ms) 10 (d)

30

50

70

t (ms)

Fig. 4.23 IGBT device-level performance (TCFM (left); SaberRD (right)): (a) SM upper IGBT junction temperature, (b) upper switch current waveform, (c) SM lower IGBT junction temperature, and (d) lower switch current waveform

184

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

#4

#1

IGBT1

IGBT4 #2429

#6-11 #2 #1217

ia

#1 Circuit

#1

Veq

#2-7 Vg

I2

Req

#1

#0 -1 Req

Veq Req

IGBT3

#5

(a)

I1

U1o

#3035

#18-23

IGBT2 #3

#0

#2-7

#0

Vg

U1s Is

#2 Circuit

(b)

Fig. 4.24 Network equivalence: (a) a multi-device circuit unit example and (b) two-port Thévenin/Norton equivalent circuit derivation

It can be found that of all the nodes in a single CDSM, 30 of them are IGBT/FWD internal nodes, while the remaining five, i.e., nodes 1–5, are external nodes. The apparent fact that solving a five-node circuit is remarkably faster than its 35-node counterpart prompts a further simplification. During the iterative solution of a multiIGBT/FWD circuit unit, the admittance matrix and current vector are updated in each calculation following a temporary solution of nodal voltages, meaning that the IGBT/FWD EMT model, including the conductance, transconductance, and companion current sources, is known. Then, the three-terminal device can be treated as a two-port network since the gate voltage keeps constant in a time-step and consequently does not change any circuit parameters. Two sets of inputs are sufficient for deriving the IGBT/FWD network’s equivalent circuit [163], as given in Fig. 4.24b. The first one is an open circuit that seeks the Thévenin internal voltage Veq . Noticing that one of IGBT/FWD’s eight nodes should be grounded when solving the two-port circuit, the actual dimension of the admittance matrix reduces to seven. Using the following expanded form of (4.160), the terminal voltage U1o , which is also Veq , can be obtained:

4.7 Summary

185

⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

⎡ G ⎥ ⎢ 11 ⎥ ⎢G21 ⎥ ⎢ ⎥ = ⎢ .. ⎥ ⎢ . ⎥ ⎢ U7 ⎦ ⎣G71 0 G81

U1o U2 .. .



G12 G22 .. . G72 G82

⎤−1 ⎡ ⎤ J1 G17 G18 ⎥ ⎢ ⎥ G27 G28 ⎥ ⎢ J2 ⎥ ⎥ .⎥ .. .. ⎥ ⎢ ⎢ .. ⎥ . . . ⎥ ⎢ ⎥ ⎥ · · · G77 G78 ⎦ ⎣ J7 ⎦ J8 · · · G87 G88 ··· ··· .. .

(4.168)

In the above equation, elements Gij and Ji belong solely to IGBT/FWD, and no external circuit is involved since the two-port network is under the open-circuit state. In the second circuit, a DC source Is is imposed on the semiconductor switch. Thus, the external current should be added to the first element of J, and U1s , the first element of U, can be obtained from the following equation: ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣

⎡ G ⎥ ⎢ 11 ⎥ ⎢G21 ⎥ ⎢ ⎥ = ⎢ .. ⎥ ⎢ . ⎥ ⎢ U7 ⎦ ⎣G71 0 G81

U1s U2 .. .



G12 G22 .. . G72 G82

⎤−1 ⎡ ⎤ J1 + Is G17 G18 ⎥ J2 ⎥ G27 G28 ⎥ ⎢ ⎥ ⎢ . ⎥ ⎥ .. .. ⎥ ⎢ ⎢ .. ⎥ . . . ⎥ ⎥ ⎥ ⎢ · · · G77 G78 ⎦ ⎣ J7 ⎦ J8 · · · G87 G88 ··· ··· .. .

(4.169)

Taking the two-port IGBT/FWD network as a Norton circuit, its resistance and companion current can be derived as Req =

U1s − U1o , Is

(4.170)

U1o . Req

(4.171)

Ieq =

Solving a seventh-order matrix equation, rather than direct components merging, is adopted since several suspended nodes in the EMT model make it incomplete. Following the derivation of an equivalent network of IGBT/FWD, the iterative Newton-Raphson process evolves to calculating each IGBT/FWD with a dimension of seven, rather than the overall circuit involving too many nonlinear high-order semiconductor switches. For example, the solution of CDSM-MMC consequently becomes solving five individual IGBTs first and then the five-node circuit unit.

4.7

Summary

This chapter introduced three nonlinear device-level power semiconductor switch models for electromagnetic transient simulation of power electronics converters. The IGBT behavioral model imitates the performance of an actual device by using fundamental circuit components, and the dynamic model describes the IGBT from

186

4 Device-Level Modeling and Transient Simulation of Power Electronic Switches

a structural point of view, while their physics-based counterpart involves detailed device physics. The achievement of high accuracy is subsequently accompanied by a heavy computational burden owing to a high numerical order of the models and repetitive Newton-Raphson iterations for solving nonlinear nodal equations. To expedite the simulation, network equivalence is introduced that consequently enables an independent solution of each nonlinear high-order device. Compared with other categories of power semiconductor switch models, these device-level models provide an insight into electrothermal interaction, voltage and current stresses, etc., which is critical for power converter design evaluation, as demonstrated in the validations based on the submodule of a modular multilevel converter.

5

Large-Scale Electromagnetic Transient Simulation of DC Grids

5.1

Introduction

High-voltage direct current (HVDC) systems based on the modular multilevel converter (MMC) have gained tremendous attention, and with an extensive application of this technology, multiterminal DC (MTDC) grids can be formed by interconnecting several point-to-point HVDC converter stations directly or indirectly [164]. Prior to carrying out on-site commissioning tasks, the secondary control and protection strategies of the converter need to be designed and subsequently tested. The electromagnetic transient (EMT) simulation provides a platform with which the framework of appropriate control and protection strategies can be ascertained and the system interactions and performance can be studied and is thus essential to computer-aided design (CAD) and development of MTDC grids. Currently, the central processing unit (CPU) is the dominant processor for a variety of electrical power system CAD tools, such as MultisimTM , PLECS , PSIM, PSpice , SaberRD , and PSCAD™ /EMTDC™ , where off-line time-domain EMT simulations of power electronic circuits and power systems can be conducted. As the transients of power semiconductor switches are increasingly concerned due to the need to develop efficacious control and protection strategies, it is imperative to include high-order nonlinear device-level models to gain adequate insight into the converter operation status that consequently aids a proper selection of devices before prototype testing [165]. This has been widely seen in small-scale converter CAD scenarios where experimentally validated models are included in packages such as SaberRD® and PSpice® [166–172]. When power converters scale up to withstand higher voltages and larger currents for power system applications, the dozens or even hundreds of power semiconductor switches instantly overwhelm the computational capacity of the CPU because of its sequential processing manner. With numerous circuit nodes linking linear and nonlinear components, the simulation is either subjected to numerical divergence or an extraordinarily long solution time, which accounts for the absence of nonlinear © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 V. Dinavahi, N. Lin, Parallel Dynamic and Transient Simulation of Large-Scale Power Systems, https://doi.org/10.1007/978-3-030-86782-9_5

187

188

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

IGBT and diode models from the MMC in an MTDC grid study. Hence, it forces the investigation of other methods such as model-order reduction with the subsequent loss of device or equipment model details. Among them, MMC models with ideal characteristics, such as the system-level averaged value model, detailed equivalent model (DEM), and its variants based on two-state switch model (TSSM) [173–175], are widely adopted by mainstream EMT-type simulation tools such as PSCAD™ /EMTDC™ [176–178]. The necessity of EMT computation of gigantic systems such as an MTDC grid in time domain on the graphics processing unit (GPU) is thus manifested. As a more advanced parallel processing platform than multi-core CPU, the GPU has been employed for massively parallel EMT computation of power systems and power electronic circuits [89, 133, 179–182], and a remarkable speedup is desirable in cases where an explicit homogeneity prevails. Therefore, in this chapter, the GPU massive parallelism is explored to expedite the simulation of MTDC grids involving micro-level details. Since the submodule (SM) containing nonlinear IGBT/diode contributes to the main computational burden, topological reconfiguration (TR) based on a pair of coupled voltage-current sources is carried out to create a fine-grained network from the original system structure by separating all MMC submodules, whose similarity enables writing them as one global function by the GPU programming language CUDA C/C++ [183] and parallel execution by multiple computational blocks and threads. It should be pointed out that the main challenge to the efficiency of parallel EMT simulation on GPU is the topological irregularity of the electrical system, i.e., many components only have a small quantity; thus the CPU outperforms its manycore counterpart, considering the latter processor has a lower clock frequency. As a result, for a complex system such as the MMC-based MTDC grid which possesses both homogeneity and inhomogeneity, neither pure CPU nor GPU could achieve the maximum efficiency. As a prelude of heterogeneous high-performance computing introduced in Chap. 6, this chapter concludes with a description of a CPU/GPU cosimulation platform which is a general solution for simulating complex device-level electrical systems, noticing that nowadays, GPU is quite common in a workstation or a personal computer.

5.2

Generic MTDC Grid Fine-Grained Partitioning

Figure 5.1 shows the CIGRÉ B4 DC grid test system [164] containing three DC subsystems (DCS) as a typical testbench for GPU-based EMT simulation. Its details are listed in Appendix A.4.1, and the converter stations are numbered in the figure. The power flowing out of a station is defined as positive. It is impractical to simulate this gigantic system with device-level modeling directly since it corresponds to an admittance matrix of enormous size. Therefore, topological reconfiguration based on three levels of circuit partitioning is applied to different locations of the DC grid, creating a substantial number of identical but independent components that can be mapped to the massively parallel architecture of the GPU.

5.2 Generic MTDC Grid Fine-Grained Partitioning

Cm-A1 (#4)

U1 kV

Bm-A1

Bm-C1

DCS1

U2 kV Pdc4

189

F1

Cm-C1 (#5)

Pdc5

Cb-A1 (#10) Pdc10

AC-DC

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc9 DC-DC

DCS3

Bb-B4

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Cb-D1 (#8)

Bb-D1 Pdc8

Cd-E1 Cm-E1 (#3)

Pdc6

Cb-B2 (#7)

Bm-E1 Pdc7

Cm-B2 (#1)

F2 L2

Bm-B2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0) Pdc1

PL1 L1

Pdc0

Pdc3 PL2 Cm-F1 (#2)

Bm-F1

Pdc2

Fig. 5.1 CIGRÉ B4 DC grid test system

5.2.1

Level-One Partitioning: Universal Line Model

The transmission line linking one electrical equipment with another provides an inherent circuit partitioning method to the power system, due to the time delay induced by traveling waves. As introduced in Chap. 3, the universal line model (ULM) is able to describe all underground cables and overhead line geometries accurately in the phase domain. The graphic form it takes for EMT computation is the same as other line models, such as the Norton equivalent circuit shown in Fig. 5.2a, where the history current of an arbitrary terminal is expressed by (3.23). As the equation set (3.23) shows, the history item at terminal k is a function of the current at terminal m and vice versa and means the two terminals are interactive and the traveling wave is manifested by the travel time τ . With Ihk , Ihm , and the admittance GY known, the terminal voltage v(k/m) can be obtained by solving the circuit where the transmission line or cable locates. Then,

190

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

ik Ikr vk

im

G Iki

G Ihk Ihm (a)

Imr vm

vk

vkr

Ik

Im

Z

Z

vki

Imi

Vhk (b)

vmi Vhm

vmr vm

Fig. 5.2 General form of a universal line model: (a) Norton equivalent circuit and (b) the Thévenin equivalent circuit

the terminal current can be calculated as (3.22), where i(k) and i(m) are used to calculate the incident current, by which the history currents can be updated as (3.36). To facilitate circuit computation, the Norton equivalent circuit of ULM is converted to its Thévenin counterpart, as shown in Fig. 5.2b. Consequently, voltages Vhk and Vhm participate in circuit EMT computation, whereas the update of ULM’s parameters is undertaken with currents Ihk and Ihm .

5.2.2

Level-Two Partitioning: TLM Link

In the absence of physical transmission lines, circuit partitioning is still available so long as the circuit section presents phenomena that the transmission line has. For instance, when the current variation of an inductor or voltage change of a capacitor is very small in one simulation time-step, they can be taken as a lossless transmission line [184], and thereafter the general forms in Fig. 5.2 still apply. Compared with ULM, the link based on transmission line modeling (TLM) technique is mathematically simpler; e.g., the surge impedance Z and history item are  t capacitor C , ZC = , (5.1) L inductor t , i , Vh(k/m) = 2vk/m

(5.2)

i where vk/m is the incident pulse updated by i vki (t) = vm (t − t) − vm (t − t).

(5.3)

i vm (t) = vk (t − t) − vki (t − t).

(5.4)

The position exchange of m and k in the subscripts reflects a transmission time delay of t, so it is set as the simulation time-step in this partitioning scheme.

5.2 Generic MTDC Grid Fine-Grained Partitioning

5.2.3

191

MTDC Multi-Level Partitioning Scheme

The apparent irregularity of the CIGRÉ DC grid with a mixture of homogeneity and its opposite poses a major challenge to the kernel-based massive parallelism. To obtain a circuit structure suitable for GPU concurrent task implementation and a consequent speedup, the DC grid is partitioned from system level to component level. The first level, as introduced, is the natural separation of converter stations by DC transmission lines. As each station connects to a different number of other stations, the configuration of DC yards also varies. In Fig. 5.3, the partitioning of the four-terminal DCS2 is shown as an example, where ULMs are discretized to separate one station from another. Despite that, from a mathematical point of view, a converter station still corresponds to an admittance matrix with a huge dimension since both MMC and the DC yard where the DC circuit breaker is located contain hundreds of nodes. Thus, facilitated by the MMC DC bus which can be deemed as a TLM link, the second level of circuit partitioning is applied to detach them. The third level of topological reconfiguration is introduced for separating the MMC submodules from their arms. It is noticed that some virtual branches are created in DC yards to enable all of them to have the same configuration. Identical names such as Zx and Vx are assigned to components at the same position in different DC yards, which enables programming the DC yard as one GPU kernel and these signals to be stored in an array. However, distinct circuit topologies lead to different computation algorithms. For example, the DC yards of Cm-B3 and Cm-F1 can be uniformly written as ⎡ Im =



2  i=1

ZH H Bi + Zx + Zy

⎤−1 −ZH H B2 − Zy ⎦

(5.5)

−ZH H B2 − Zy ZT + ZH H B2 + Zy ⎡ ⎤ 2  i V + (−1) VH H Bi − Vx ⎦ × ⎣ y i=1 , i 2vm − VH H B2 − Vy

where ZH H B and VH H B are the equivalent impedance and voltage contribution i of a hybrid HVDC circuit breaker (HHB), Zx/y and Vx/y are the ULM, and vm constitutes TLM link model of a DC capacitor along with its impedance ZT . On the contrary, in single-line DC yards, there is only one actual loop, and the mesh current can be obtained more conveniently by the following algebraic equation: Im1 =

i −V 2vm H H B1 − Vx , ZT + ZH H B1 + Zx

since it is obvious that the mesh current Im2 = 0.

(5.6)

192

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

MMC0

TLM link ZT

2vmi

Cm-B3 Iy Zy HHB Vy MMC1

z

Im2

z

HHB

Iy ULM Iy L1 Zy Zy

Im2

Vy

Vy

Ix

Ix

Zx

Zx

HHB

Vx

Im1

HHB Im1

Ix ZT

HHB Im1

Cm-B2

Vx

2vmi

Zx Vx

L0

L2

HHB Im2

2vmi

Vx

Cm-F1 Iy

z

z

HHB

HHB Zy Im2

Ix Zx

MMC2

ZT

ZT

Vy MMC3

Im1 2vmi

Cm-E1

Fig. 5.3 Circuit partitioning of DCS2 by transmission line models

In the overall CIGRÉ B4 DC grid, one station may connect up to three other stations; e.g., Cb-A1 is connected with Cb-C2, Cb-B1, and Cb-B2. To enable all DC yards to be computed by one kernel and subsequently achieve high parallelism, rather than by two kernels with lower parallelism, the standard DC yard should have three branches denoted by subscripts x, y, and z. Thus, the virtual line currents in double-branch DC yard and single-branch DC yard are Iz = 0 and Iy = Iz = 0, respectively, whereas the actual branch current can be determined from mesh currents calculated by either (5.5) or (5.6). Comparing the ULM voltage variable names in Figs. 5.2b and 5.3, the names in the latter figure should be sorted. Throughout coupled DC yards, the one with a smaller station number is defined as terminal k. Thus, variables belonging to the station Vx/y/z and Ix/y/z can be mapped to those of ULM, i.e., vk/m and Ih(k/m) . Taking DCS2 for example, between MMC0 and MMC2 is the DC line L1 ; thus, Vy0 = Vhk1 and Vy2 = Vhm1 , where the number in the subscripts on the left and right side denotes MMC number and line number, respectively. By parity of reasoning, for the other two lines, the relationship between station variables and line variables is Vx0 = Vhk0 and Vx1 = Vhm0 and Vx2 = Vhk2 and Vx3 = Vhm2 , while the variables in virtual branches do not have effective assignments. On the other hand, updating ULM’s information requires its terminal voltage and current, which obey the same rule. The former is calculated by vk,m = Ix/y · Zx/y + Ih(x/y) · Zx/y ,

(5.7)

and the latter ik and im are chosen from either Ix or Iy . The GPU kernel for the DC yard, which also includes the ULM, is designed in Fig. 5.4. The relationship between DC yard and ULM variables is realized by CUDA

5.2 Generic MTDC Grid Fine-Grained Partitioning

DC Yard Iki Ihis Imi

(kernel0) variable IhisX Mesh sort1 current Ikr Imr

Ihisk1 Ihism1

193

Variable Sort1

VX IX

Vk Vm variable Ik Im sort2

kernel0

9 threads kernel1

Ik/m Vk/m

Ikr Imr Ikr(t-τ) Imr(t-τ) 9 threads

kernel2

Ihisk2 Ihism2

kernel3

4 threads Ihisk3 Ihism3 Ihisk1,2,3 Ihism1,2,3 Ihisk Ihism Iki Imi

kernel4 kernel5

Global memory

U L M

DCY Ihisx[0] Ihisx[1] Ihisx[2] Ihisx[3]

ULM Ihisk[0] Ihism[0] Ihisk[2] Ihism[2]

Ihisy[0]

Ihisk[1]

Ihisy[2] Ihisy[1] Ihisy[3] Ihisz

Ihism[1] 0 0 [0]

Variable Sort2

ULM Vk[0] Vk[1] Vk[2] Vm[0] Vm[1] Vm[2]

DCY Vx[0] Vy[0] Vx[2] Vx[1] Vy[2] Vx[3]

Fig. 5.4 DC yard kernel structure and variable sort algorithm

C in the form of device function, so it can be accessed directly by global functions. The ULM contains six kernels, as each of them has a different grid size. After the variable sort process as described above, the ULM equations can be processed in the corresponding kernels without distinguishing the terminal, meaning that the parallelism is improved. The introduction of redundant branches enables all DC yards to have the same inputs and outputs, facilitating concurrent computation on the GPU.

5.2.4

Level-Three Partitioning: Coupled Voltage-Current Sources

5.2.4.1 MMC Internal Reconfiguration The detailed models of both MMC and HHB contain many nodes, and the computational burden is still high despite level-two partitioning. Therefore, a further reconfiguration is carried out toward them, and in this part, the MMC is first taken as an example to illustrate the mechanism. As Fig. 5.5a shows, an (N + 1)-level, three-phase MMC can operate as ACDC, DC-AC, and DC-DC interface in the multiterminal DC grid. To withstand high voltages, there should be a sufficient number of submodules—usually in the range of dozens to a few hundred. Thus, in EMT simulation, simplification of the converter is necessary even if the submodule employs the TSSM as switching elements.

194

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

SM1 Vdc1 SMN Lu

SM1 SMN Lu

SM1

Vp1 upper arm

SMN

VpN

vs1 SM1 Js1 vsN SMN JsN

Lu

Lu

iu (ip1)

Vac

id (ip2) Ld

Ld

Ld

Vdc2 SM1

SM1

SM1

SMN

SMN

SMN

Ld lower arm Vp1

VpN

(a)

vs1 SM1 Js1 vsN SMN JsN (b)

Fig. 5.5 (a) Three-phase MMC topology and (b) computational structure for one phase with V –I coupling

Otherwise, the speed will be extremely slow due to a large admittance matrix caused by hundreds of nodes. The Thévenin equivalent circuit is an efficient method for eliminating redundant nodes when the switches are idealized. Nevertheless, when complex switch models are adopted for device-level information, this method will be no longer applicable. The fact that arm currents alternate at a frequency much lower than that of the digital simulation rate indicates that the third-level partitioning method which detaches all submodules from corresponding arms by a pair of coupled voltage-current sources, termed as V –I coupling, to constitute individual subsystems given in Fig. 5.5b will not affect the accuracy. Consequently, the reconfigured MMC has two parts, i.e., the submodules that could be linear or nonlinear and the remaining MMC frame which is always linear. Therefore, on the arm side, all voltage couplings Vpi are summed up and then participate in the MMC main circuit computation; on the other hand, submodules and current couplings Jsi constitute independent sub-circuits and may be subjected to N-R iterations. Thus, at time instant t, the arm and SM sides are solved by Kirchhoff’s voltage law and current law, Jp (t) = Z−1 p · Vp (t),

(5.8)

Vs (t) = G−1 s · Js (t),

(5.9)

5.2 Generic MTDC Grid Fine-Grained Partitioning

HHB

#7

To TL

JpC1 GpC1

GpB1

GpC2 C2 TLM link

vac

JpA1

#8

PI P* Q*

id*

2 3Vgd

id iq

PI

iq* Rec/Inv/SST

GpA1

GpB2

GpA2

JpC2

JpB2

(a) Vdc

Vac*

JpB1

C1 DC yard

Vdc*

195

#1-3

Transformer JpA2 ma

vgd md

PI

dq

mb

ωL abc PI

mq (b)

vgq

#4-6

θ

mc

MMC PSC (A) MMC PSC (B) MMC PSC (C)

VgA VgB VgC

Fig. 5.6 (a) EMT model of an three-phase MMC main circuit and (b) a general controller scheme for various control targets

respectively, where Jp and Js are mesh current vectors on the corresponding side and Vp and Vs are the nodal voltages. Independence in the solution of (5.8) and (5.9) is noticed, as the only bond is the interaction which occurs when they exchange arm currents and SM terminal voltages. To be more specific, the arm currents extracted from Jp are sent to SM sides and becomes Jsi ; in return, Vs is sent to the arm side and taken as individual Vpi , meaning solutions of the next time-step are reliant on the other side, −1 Jp (t + t) = Z−1 p · f1 (Gs Js (t)),

(5.10)

−1 Vs (t + t) = G−1 s · f2 (Zp Vp (t)),

(5.11)

where functions f1 and f2 are used to derive submodule voltages and arm currents. The relative independence gained by introducing a unit time-step delay brings the benefit of shortening simulation time, as the partitioned MMC corresponds to a collection of small matrices which can be processed more efficiently, particularly when the hardware architecture supports parallel computation. Figure 5.6a summarizes the MMC structure for all three types of conversions, i.e., rectifier, inverter, and the DC-DC converter. The existence of two capacitors C1 and C2 leads to stable DC voltage that justifies the second-level separation by TLM link. In the MMC main circuit, each arm is composed of cascaded voltage sources and an inductor, and it can be converted into the Norton equivalent circuit, with the equivalent conductance and its companion current as Gp =

1 , ZLu.d + rarm

(5.12)

196

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

Jp = Gp ·

N 

i Vpi + 2vLu,d ,

(5.13)

i=1

where rarm is the resistance caused by the arm inductor. The three-phase transformer model introduced in Chap. 3 is used as the interface between AC and DC grids. Then, a universal form of an N-node system containing the transformer can be written as   GT 0 + diag[Gext1 , ..., GextN ] [vT |vext7 , ..., vextN ] = 0 0 N ×N T

−1

(5.14)

· ([Ihis |0]T + [Jext1 , ..., JextN ]), where the three-phase transformer corresponds to the first six nodes, and elements with subscript ext are contributed by its surrounding components. The external conductance results in a diagonal matrix, which is added with the inherent transformer admittance matrix. Similarly, the current contribution vector from the outer system is also combined with that of the transformer. Thus, the interaction between the transformer and its neighboring circuits can be obtained by solving the above equation. For rectifiers and inverters, the MMC frame has eight nodes, six of which are induced by the transformer, while in the front-to-front MMC-based DC-DC converter, the total number of nodes reaches ten since the topology is symmetrical. In either scenario, the MMC can be solved instantly without iteration using (5.14) by the EMT program due to its linearity. In Fig. 5.6b, the MMC controller which contains two loops is given. Depending on the actual demand, the outer-loop controller which is based on d-q frame compares various feedback, such as DC and AC voltages, active and reactive powers, with their references. Then, the inverse Park’s transformation restores signals to three phases, and the inner-loop controller employing phase-shift control (PSC) [185] regulates individual phases. Noticing that the main circuit part is purely system-level, a large time-step in the range of a few microseconds is applicable so that the simulation can proceed faster.

5.2.4.2 MMC GPU Kernel As can be seen, extensive symmetries exist in the three-phase MMC. The three phases have an identical topology and so do the three MMC inner-loop controllers, which contain three parts: generation of carriers, averaging control (AVC), and balancing control (BC), as shown in Fig. 5.7a. For CPU simulation, several identical algebraic operations should be conducted repeatedly due to the sequential implementation manner, e.g., the definition of N carriers, summing up all DC capacitor voltages, and IGBT gate voltage Vg generation for 2N SMs, all of which prolong the simulation time. However, the PSC computational structure in GPU sees a dramatic simplification: the controller can largely be implemented in parallel. Accordingly,

5.2 Generic MTDC Grid Fine-Grained Partitioning W1

t

Carrier generation vC1

vC1 +

t 1/(2N)

vCN

vC2 vC3 Averaging control ±P VCref

vC(2N)

+

-

PI 1 2 (ip+in)

vC(2N) +

BC1 BC2

Balancing control

+

PI

+

1 2 VCref

+

vau

vau -

W1

±m

verr

WN

t

Vg1,2

PWM

bu

W2

vCave -

VC,ref

vau

+

197

±P

+

VCref

PWM

±m

WN

Vg1,2 BC2N

(a) vC1 vC2

∑vC

vCN vC vC(N+1) vC(N+2)

vau log2(2N) Kernel0

vC(2N) ∑vC vau W Vg Global memory

verr=(VCref-∑vC/(2N)); vzu=K1verr+K2∫verrdt; vau=K3e+K4∫(·)dt; Kernel1

vC

SM1

SM2

SM3

SM4

SM5

SM6

SM(2N-2) SM(2N-1) Vg

SM2N

vau W vC

BC1

BC2

BC3

BC4

BC5

BC6

BC(2N-2)

BC(2N-1)

BC2N

bu[i]=K5/6(VCref-vC[i]); m[i]=bu[i]+vau+0.5VCref±uref; W[i]=carrier[i]; Balancing Vg[i]=±15; control

Kernel3

Kernel2

(b)

Fig. 5.7 MMC inner-loop control for single phase: (a) phase-shift control in CPU and (b) massive-thread parallel structure of PSC and SM on GPU

the PSC is composed of several kernels, and signals transmitted between them are set as global memory, so they can be accessed by other kernels. After SM DC voltages are obtained, they are summed up by multiple threads, as indicated in Fig. 5.7b. For one phase, the CPU needs to conduct the add operation (2N − 1) times, while in GPU, the number of operation times is Nsum = log2 (2M ), 2M−1 < 2N ≤ 2M .

(5.15)

Thus, for an arbitrary MMC level, if 2N is less than 2M , the extra numbers are compensated by zeros so that in the addition process defined as in Kernel 0 , an even number of variables is always reserved until the last operation. The bulk of averaging control is realized by Kernel 1 , which corresponds to one MMC phase. Its output is sent to Kernel 2 where a massive-thread parallel implementation of the balancing control is carried out. It should be pointed out that repeated definition of the carriers as in the CPU can be avoided; instead, they are defined only once in each thread and are stored in global memory. Finally, Kernel 3

198

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

Host (CPU) int main() { Initialization Invoke HVDC kernel HVDC

vC

Kernel0

∑vC

Kernel1

W, vC vau

Kernel2

Vg

Kernel3

∑vC vau Vg vC, Vp m id, Js iq Vdc

Memcpy device host

PQC (Kernel4)

id, iq, Vdc

Global memory

cudaFree

MMC (Kernel5)

Vp

Device (GPU)

}

gridG

Fig. 5.8 Hierarchical dynamic parallelism implementation of a three-phase HVDC converter

receives the output array Vg , and the submodule calculation is conducted. Like previous kernels, each thread corresponds to one SM, enhancing the computational efficiency significantly by reducing the number of identical calculations. The output array vC is sent to the global memory so that Kernel 0 will be able to read it when a new simulation time-step starts. As an actual MMC has three phases, the above massively parallel structure needs an extension. From a circuit point of view, all six arms share the same configuration and so do the 6N submodules. With regard to the controller, Kernel 0 and Kernel 1 of PSC have three copies—each corresponds to one phase, while Kernel 2 will be launched as a compute grid of 6N threads on the GPU. Therefore, a new HVDC converter kernel which contains the PSC kernel (Kernel 0 –Kernel 2 ), SM kernel, and the MMC main circuit kernel based on (5.14) is constructed using the dynamic parallelism feature, as Fig. 5.8 shows. Compared with previous GPU computational architecture, launching new compute grids from the GPU, rather than the host CPU, enables a flexible expansion of the number of HVDC converters to constitute an MTDC system. Meanwhile, the number of threads in each compute grid is the same as that of an actual circuit part. For example, there is one outer-loop controller and one MMC main circuit, and accordingly, kernels P QC and MMC both invoke only thread, whereas Kernel 3 for SM launches a grid of 2N threads. In the host, after initialization, the HVDC converter kernel is launched by CPU, which then invokes the six kernels with different CUDA grid scales. The input and output signals of each kernel are stored in global memory.

5.2.4.3 Hybrid HVDC Circuit Breaker Modeling Figure 5.9a describes a typical structure of the hybrid HVDC breaker, which contains a large number of repetitive circuit units that could potentially produce

5.2 Generic MTDC Grid Fine-Grained Partitioning

UFD

IGBT array

LCS

MOV

L

D Rs

RCB #1

Cs

199

Vp1

Vp2

Js1

Js2

Snubber

Cs

Rs #2 MB unit

MOVu D

RCB #NH

H

LCSu

L D

VpN Vp-Js couplings

JsN

H

UFDu #1 #2

Cs Snubber

Rs

#0

#1 HHB unit

(a)

#2

#NH (b)

Fig. 5.9 HHB EMT models: (a) the conventional full model and (b) partitioned full model with reduced order

a high degree of parallelism. It contains six major parts: the ultrafast disconnector (UFD) and the load commutation switch (LCS) constitute the conducting path under normal operation; the main breaker (MB) and the metal-oxide varistor (MOV) are used to divert the fault current; the current limiting inductor L and the residual current breaker (RCB) are installed to inhibit large DC current after line fault and protect the MOV, respectively. The conventional full model containing as many components as a real HHB, rather than the simplified model with a significantly reduced number of components [186, 187], is preferred in EMT simulation of the MTDC system as it features higher fidelity and gives more details. The subsequent issue is the huge number of nodes leads to an admittance matrix of an enormous dimension that is too burdensome to solve, particularly in the MTDC system that contains many HHBs. Like in the MMC, the V –I coupling as the third-level decoupling method is applied, as shown in Fig. 5.9b, and it is reasonable to do so due to the existence of the current limiting inductor which helps to curb an abrupt current change in the transmission corridor. Therefore, the HHB unit composed of the inserted current source Js , the MOV, the LCS, and the MB becomes independent from the DC yard. This physically isolated structure results in many small matrix equations which can be calculated efficiently by parallel cores, and in particular, it caters specifically to the massive processing units of GPU. The MOV plays a significant role in protecting IGBTs against overvoltage and quenching residual energy in the transmission path. Under low voltage, it is deemed as a large resistor, while the resistance plummets when its terminal voltage reaches the protection level. Thus, it is a nonlinear component, as the V –I characteristic takes the form of  vM = kM

iM Iref

α −1 M

· Vref ,

(5.16)

where kM and αM are constants, Vref is the protection voltage, and Iref is the corresponding current. Then, the equivalent conductance and the current contribution of its EMT model can be derived by

200

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

GM =

αM Iref ∂iM = · ∂vM kM Vref



vM kM Vref

αM −1 (5.17)

,

IMeq (t + t) = iM (t) − GM vM (t).

(5.18)

As can be seen, the MOV’s conductance is still reliant on its terminal voltage vM that is also the outcome of EMT calculation, meaning an appropriate GM may not be immediately found. Therefore, the iterative Newton-Raphson method should be adopted in order to obtain convergent results. For a specific application, these nonlinear functions are piecewise linearized to accelerate the simulation speed. To ensure sufficient accuracy, the overall V –I relation is divided into a few voltage segments so that in each segment, GM can be found as a constant. The power loss on LCS is mainly induced by steady-state conduction, whereas the switching loss is negligible because the operation mechanism determines that the MB is always conducting whenever the LCS is energized [188], indicating that the voltage over LCS is clamped to a low level and consequently an electromagnetic environment is enabled for zero-voltage switching. This fact means that in IGBT modeling, the LCS could be purely taken as a current-dependent resistor, facilitating an equal distribution of the LCS in all HHB units by simple algebraic calculations. After partitioning, the HHB leaves the RCB, the current limiting inductor, and a series of voltage sources, denoted by Vp , in the DC yard, and all nonlinearities are excluded from the DC yard. Applying TLM-stub theory [189], the HHB’s contributions to the DC yard, represented by a combination of impedance ZH H B and voltage VH H B , are given as H H BDCyard = ZH H B ·Js +VH H B = (RCB +ZL )·Js +2vLi +

NH 

Vpi ,

(5.19)

i=1

where ZL and vLi constitute the inductor TLM-stub model and RCB represents the resistance of itself. On the other hand, nonlinearities are confined to the small HHB unit, so the simulation efficiency is improved markedly by avoiding repeated calculations of the originally extremely large circuit to derive a convergent result. Instead, the matrix equation where the Newton-Raphson method should apply is kept low, and depending on the status of main breaker IGBTs, it has two forms, which can be expressed uniformly by UHHB = G−1 HHB JHHB =



−1 −1 + Trt · GMB + GM + GU L −RsD RsD −1 −1 −RsD RsD + GCs

) * Js − IMeq − (1 − Trt ) · IMB × , 2vCi s · GCs

−1 (5.20)

5.2 Generic MTDC Grid Fine-Grained Partitioning

201

where Trt is a binary value indicating the steady state and switching state by 0 and 1, respectively; RsD is the equivalent resistance of the parallel Rs and D in the snubber; and GU L is the overall conductance of the UFD-LCS path. By detecting the convergent HHB unit’s nodal voltages in each time-step, RsD can be ascertained: it behaves as a diode when UH H B (1) is larger than UH H B (2); otherwise, it is purely Rs . It should be pointed out that the main-branch IGBT conductance GMB can be multidimensional if a high-order device-level model is adopted; nevertheless, it can still be converted into a two-node structure after applying nonlinear model equivalence introduced in Chap. 5. When the HHB is initiated following the detection of line fault, the instantaneous current during the breaking period is estimated by idc (t) =

R Vdc Pdc (0) − RLpp t − pt · (1 − e Lp ) + e , Rp Vdc

(5.21)

where Rp and Lp denote the equivalent resistance and inductance of the power flow path and Pdc (0) is the power before fault. This indicates that the AC grid at rectifier stations provides energy to the fault position immediately, while at the inverter station, a negative Pdc (0) means that the DC current will reduce to zero before its AC grid could feed energy to the DC side. At the end of the breaking f period t1 , the fault current reaches its peak, and the fault clearance period lasting f t2 takes over. Thus, the number of HHB units NH could be calculated—assuming a δ% margin is reserved to ensure safe operation—by the following equation:

NH · VCES (1 − δ%) = Vdc + Lp

  f idc t1 f

,

(5.22)

t2

where VCES is IGBT’s maximum collector-emitter voltage.

5.2.4.4 HHB GPU Kernel The GPU computational structure for HHB contains two parts: the voltage source side and the HHB units. The former is included in the linear DC yard function, while the latter constitutes an independent kernel with Newton-Raphson iteration, as Fig. 5.10 shows. The key for GPU to have speed leverage over CPU toward the same configuration as Fig. 5.9b is its capability to fully utilize massive parallelism over all HHB units, rather than computing them in a sequential manner or by a few batches when multi-core CPU is available. In the HHB unit kernel, the varistor, the LCS-UFD branch, and the IGBT model are realized by CUDA C device functions so that they can be accessed by the kernel directly, and only the MOV function that causes nonlinearity may have to be called multiple times by the iterative Newton-Raphson process in which (5.20) is computed repeatedly until a precise result is derived. However, the update of variables stored

202

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

Global memory

Ttrip

LCS& UFD

GU-L

IGBT Model

Trt

UHHB

HHB unit (Kernel1)

GMB IMB GMOV

MOV

IMOVeq No

vCsi UHHB Vp Js Vdc Ttrip

UHHB=G-1I

Converge?

Yes N-R iteration DC Yard (Kernel0) 1 thread

Idc Vdc

NL×NH threads

LPR (Kernel2) NL threads

Fig. 5.10 HHB unit kernel and its EMT calculation in conjunction with DC yard and LPR

in global memory as well as the determination of RsD only takes place when the nodal voltages are convergent. It should be pointed out that the HHB is always applied in conjunction with line protection (LPR). Various strategies have been proposed, and most of them rely on the measurement of the DC line voltage and current, such as the voltage derivative strategy. Thus, its kernel is briefly drawn for illustration of the coordination between the HHB kernel and the protection device. It is apparent that the DC yard of one converter station could have a transmission line number of NL , meaning theoretically the same number of HHBs should be installed and potentially the same number of LPR algorithms as well. As a consequence, the total number of HHB units in one DC yard reaches a significant NL · NH . The large number disparity underlines the necessity of using dynamic parallelism to cater to this hierarchical structure, and like the MMC, all kernels for the DC yard are also included in the HVDC kernel.

5.2.5

Kernel Dynamic Parallelism for Large-Scale MTDC Grids

A further expansion of the MTDC grid is possible in real scenarios, such as the Greater CIGRÉ DC grid which is composed of several interconnected CIGRÉ DC systems, as shown in Fig. 5.11. The hierarchical GPU computational structure for

5.2 Generic MTDC Grid Fine-Grained Partitioning Cm-A1 (#4)

U1 kV

Bm-A1

#1

Cm-A1 (#4)

U1 kV

Pdc5

DC-DC

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Pdc9 DC-DC

Cb-D1 (#8)

Bb-D1

DCS3

Bb-B4

Cd-E1 Cm-E1 (#3)

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

L2

Cm-B2 (#1)

Bm-B2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0)

PL1

Pdc1

Cm-E1 (#3)

Pdc7

Bm-B2

Cm-F1 (#2)

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0)

PL1

Pdc1

PL2 Cm-F1 (#2)

Bm-F1

L1

Pdc2

Pdc0

Pdc3

F2 L2

Cm-B2 (#1)

PL2

Bm-F1

L1

Cd-E1

Bm-E1 Pdc3

F2

Pdc8

Pdc6

Cb-B2 (#7)

Bm-E1 Pdc7

Cb-D1 (#8)

Bb-D1

DCS3

Pdc8

Pdc6

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

AC-DC Pdc9

Bb-B4

#2

Pdc5

Cb-A1 (#10) Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

Cb-B2 (#7)

Cm-C1 (#5)

Bm-C1

DCS1 F1

Pdc4

Cb-A1 (#10)

AC-DC

Bm-A1

U2 kV

F1

Pdc4

Cm-C1 (#5)

Bm-C1

DCS1

U2 kV

203

Pdc2

Pdc0

AC Bus Cm-A1 (#4)

U1 kV

Bm-A1

#3

Cm-A1 (#4)

U1 kV

Pdc5

DC-DC

DCS3 Cb-B1 (#6)

Pdc9 DC-DC

Cb-D1 (#8)

Bb-D1

Bb-B1 Bb-E1 Bb-B1s Cd-B1

DCS3

Pdc8 Bb-B4

Cd-E1 Cm-E1 (#3)

Pdc6

Cb-B1 (#6)

L2

Cm-B2 (#1)

Bm-B2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0)

PL1

Pdc1

Cm-E1 (#3)

Cm-A1 (#4)

U1 kV

Bm-A1

Bm-B2

Cm-F1 (#2)

Bm-C1

DCS2

L0

Bm-B3

Cm-B3 (#0)

PL1 L1

Cm-C1 (#5)

#5

Cm-A1 (#4)

U1 kV

Pdc5

Bm-A1

Cm-F1 (#2)

Pdc2

Bm-C1

DCS1 F1

Pdc4

PL2

Bm-F1

Pdc0

Cb-A1 (#10) Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

AC-DC

DC-DC

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Pdc9 DC-DC

Cb-D1 (#8)

Bb-D1

DCS3

Bb-B4

Cd-E1 Cm-E1 (#3)

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Cm-B2 (#1)

L2 Bm-B2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0)

PL1

Pdc1

Cm-E1 (#3)

Cm-A1 (#4)

U1 kV

Bm-A1 Pdc4

L2 Bm-B2

Cm-F1 (#2)

DCS2

L0

Bm-B3

PL1

Bm-C1

DCS1

Cm-C1 (#5)

#7

L1

Cm-A1 (#4)

U1 kV

Bm-A1

Pdc5

Pdc4

Bm-C1

F1

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

AC-DC

DC-DC

DCS3

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Cb-D1 (#8)

Bb-D1

Pdc9 DC-DC

DCS3

Pdc8 Cd-E1 Cm-E1 (#3)

Pdc6

Bb-B4

Cm-B2 (#1)

F2 L2

Bm-B2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0) Pdc1

PL1 L1

Pdc0

Cb-D1 (#8)

Bb-D1

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Cb-B1 (#6)

Pdc8 Cd-E1 Cm-E1 (#3)

Pdc6

Cb-B2 (#7)

Bm-E1 Pdc7

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc9

Bb-B4

Cm-C1 (#5)

Pdc5

Cb-A1 (#10)

Pdc10

Cb-B2 (#7)

Cm-F1 (#2)

Pdc2

DCS1

U2 kV

PL2

Bm-F1

Pdc0

Cb-A1 (#10)

AC-DC

PL0

Cm-B3 (#0)

Pdc2

F1

Pdc3

F2

Pdc1

L1

U2 kV

Cm-B2 (#1)

PL2

Bm-F1

Pdc0

Cd-E1

Bm-E1 Pdc7

Pdc3

F2

Pdc8

Pdc6

Cb-B2 (#7)

Bm-E1 Pdc7

Cb-D1 (#8)

Bb-D1

DCS3

Pdc8

Pdc6

Cb-B2 (#7)

#6

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc9

Bb-B4

Cm-C1 (#5)

Pdc5

Cb-A1 (#10)

Pdc10

AC-DC

PL0

U2 kV

F1

Pdc4

L2

Cm-B2 (#1)

PL2

Pdc3

F2

Pdc1

DCS1

U2 kV

Pdc7

Pdc2

Pdc0

Cd-E1

Bm-E1

Bm-F1

L1

Pdc8

Pdc6

Cb-B2 (#7)

Pdc3

F2

Cb-D1 (#8)

Bb-D1

Bb-B1 Bb-E1 Bb-B1s Cd-B1

Bm-E1 Pdc7

Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

AC-DC Pdc9

Bb-B4

#4

Pdc5

Cb-A1 (#10) Cb-C2 (#9)

Bb-C2

Bb-A1

Pdc10

Cb-B2 (#7)

Cm-C1 (#5)

Bm-C1

DCS1 F1

Pdc4

Cb-A1 (#10)

AC-DC

Bm-A1

U2 kV

F1

Pdc4

Cm-C1 (#5)

Bm-C1

DCS1

U2 kV

Bm-E1 Pdc7

Pdc3 Cm-B2 (#1)

PL2

L2 Bm-B2

Cm-F1 (#2)

Bm-F1

F2

PL0

DCS2

L0

Bm-B3

Cm-B3 (#0) Pdc1

Pdc2

PL1 L1

Pdc0

Pdc3 PL2 Cm-F1 (#2)

Bm-F1

Pdc2

Fig. 5.11 Greater CIGRÉ DC grid consisting of multiple CIGRÉ B4 systems

#8

204

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

Fig. 5.12 OpenMP® pseudo-code for multi-core MTDC system CPU simulation

while tΔV2

Vdc4 Vdc5

Vdc4 ΔV1 ΔV2

Vdc5

ΔV1>ΔV2 4.0

208 206 204 202 200 198 196

5.0 6.0 Time (s) (c)

7.0

Vdc4

Vdc5

208 206 204 202 200 198 196 2.6

Vdc4 Vdc5

Vdc4

Vdc4 Vdc5

Vdc4

Vdc5 2.8

3.0 3.2 Time (s) (f)

Vdc4 Vdc5 3.4

Fig. 5.14 Subsystem DCS1 results of GPU simulation (top) validated by PSCAD™ /EMTDC™ (bottom). (a) System simultaneous start, (b)–(c) rectifier station power step tests, (d) inverter voltage step test, and (e)–(f) DC line-to-line fault lasting 5 ms

transmission, because the DC current in this scenario halves, causing less voltage drop on the transmission corridor. The inverter voltage step test results are given in Fig. 5.14d, where the pole-topole voltage is shown. Before t = 3 s, the DC voltages are kept at approximately 1 p.u., with the rectifier station having a slight margin. Then, both curves drop as the voltage order in the inverter station is altered to 0.8 p.u., and the HVDC system operates under reduced voltage mode until 3 s later when the voltages are recovered as the order steps up to 1 p.u. Figure 5.14e,f shows results of DC line-to-line fault which lasts momentarily for 5 ms, marked as F1 at the rectifier Cm-A1 side in Fig. 5.1. The HHBs are disabled, so the fault current soars to over 11 kA immediately after the fault occurs, followed by damped oscillations lasting dozens of milliseconds. Afterward, the current is able to restore to the pre-fault value; nevertheless, with 100 mH inductors installed in the DC yards, the fault’s instantaneous impact on converters’ DC voltages is negligible. The corresponding off-line simulations are conducted with PSCAD™ /EMTDC™ , whose

5.2 Generic MTDC Grid Fine-Grained Partitioning

209

virtually identical waveforms prove that the GPU simulation is more efficient, while its results are as accurate.

2.996

5

3.0

Idc/kA

iMB sag

vM

iLCS

3.5 4.0 Time (s) (b)

2

vM

Pdc3 Pdc2

2

Pdc0

1 0

Pdc3

-1

iLCS 3

3.004 Time (s) (d)

3.008

Pdc1

0

3

iMB

3 2 1 0 -1 -2

Pdc0

1

-1

iM

Pdc1

Pdc2

3

iM

3 2 1 0 -1 -2

0

2.5

4.5

2.99

Idc0 Idc3

3.5 4.0 Time (s) (a)

v, i/kV, kA

3.0

Idc1

Idc2

Vdc2

Vdc1

4.5

0

Idc3

Vdc3

Fault Clearance

6 Breaking 5 4 3 2 Steady 1 State 0 -1 -2

Idc0

-5 10

Vdc0

Idc1

v, i/kV, kA

6 5 4 3 2 1 0 -1 -2

5

Pline/GW

6 5 4 3 2 1 0 -1 -2

Vdc3

Fault Clearance

6 Breaking 5 4 3 2 Steady 1 State 0 -1 -2

Idc2

Pline/GW

v, i/kV, kA

v, i/kV, kA

2.5

Vdc1

Idc/kA

430 420 410 400 390 380 370

10

Vdc2

Vdc0

Pdc/GW

430 420 410 400 390 380 370

Pdc/GW

Vdc/kV

Vdc/kV

5.2.6.3 MTDC Grid Test Cases The MTDC system is a promising topology, and currently, several projects have been constructed with a few terminals linking each other. The DCS2 subsystem in Fig. 5.1 could be taken as a typical example since its scale is very close to existing projects as well as those under research and development. Installation of HHBs in the MTDC system would enhance its resilience to DC line faults, and Fig. 5.15 provides such test results of the four-terminal DC system. Before the line fault taking place at t = 3 s, the DC voltages of all stations are around 1 p.u., with rectifier stations slightly above their counterparts, as Fig. 5.15a shows where the pole-to-pole voltages are drawn. It can be seen that neither of them is severely affected by the fault due to a proper action of HHBs. On the contrary, Fig. 5.15b shows that the currents in DC yards have significant surges at both MMC2

2.5

3.0

3.5 4.0 Time (s) (e)

4.5

2.5

vM Idc3

vM Idc3

3 3.01 Time (s) (c)

Line fault F2 PL0 PL1 PL2

Line fault F2 PL0 PL1 PL2

3.0

3.5 4.0 Time (s) (f)

4.5

Fig. 5.15 Four-terminal MTDC results of GPU simulation (top) validated by PSCAD™ /EMTDC™ (bottom). (a) DC voltages of all stations, (b) DC line currents, (c) current waveform amplification of Lm1 at Cm-E1 side, (d) detailed actions of Cm-E1 HHB, (e) power export of each station, and (f) power transferred on DC lines

210

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

(Cm-F1) and MMC3 (Cm-E1) as the fault F2 occurs between them. For Idc2 that flows to Cm-E1, it keeps increasing before the fault is isolated, while for Idc3 , its polarity is reversed, as the fault forces Cm-E1 to operate as a freewheeling rectifier, rather than an inverter station under normal conditions. The power transfer restores in about 0.5 s, and since Cm-E1 is isolated, MMC1 (Cm-B2) receives all power from the other two terminals, and therefore, its current Idc1 is doubled. The function of HHBs on both terminals of the fault line can be illustrated by its voltages’ phase relation with line current. On the Cm-E1 side, as in Fig. 5.15c, the current polarity reverses immediately after the fault, and in the next 2 ms, it keeps rising as the breaking stage is undergoing. Then, the current is forced to divert to the MOV whose voltage is clamped at around 3.4 kV when all IGBTs in the HHB are turned off. Thus, the current begins to drop, with the slope determined by the MOV’s protection voltage. And from Fig. 5.15d, specific HHB operation principles can be inferred. Initially, Cm-E1 receives power from Cm-F1, and the UFD-LCS is the main branch that the DC current passes through. When the fault is detected, the LCS turns off, and consequently iLCS drops to zero; while the main branch keeps on for the next 2 ms, the current diverts to it, and because of the existence of the current limiting inductor, iMB rises gradually from a negative value to positive. Following the turnoff of MB, the current again is diverted to the MOV where it is quenched in the form of iM . Figure 5.15e, f show power flow at different positions. Prior to the fault, CmB2 and Cm-F1—the two rectifiers—send approximately 800 MW power to Cb-B2 and Cm-E1; thus the power exchange PL1 on L1 is virtually zero. After the fault is cleared, Cm-F1 is no longer able to send power to Cm-E1; instead, its export entirely goes to Cb-B2. Thus, the power flow on L1 rises to 800 MW, and alongside the power from MMC0 (Cm-B3), the remaining inverter receives nearly 1600 MW through L0 . Details from PSCAD™ /EMTDC™ are also given for validation. It should be pointed out that the simplified HHB model without snubber is used in the PSCAD™ /EMTDC™ simulation package; thus, it cannot reveal phenomena peculiar to a full HHB model, such as the voltage sag over the MOV caused by the snubber. Table 5.3 indicates that the GPU simulation is hugely advantageous over CPU even in simulating medium-scale MTDC systems. With the default single-CPU mode, it takes 562 s to execute the simulation of the ±200 kV DCS2 over 1 s, and this value rises dramatically when the MMC level becomes normal to withstand high voltage, reaching almost 10,500 s when the level is 513. In stark contrast, the GPU execution time is similar to its performance for an HVDC system, even though the scale has been doubled. Thus, in this case, the GPU attains a higher speedup, approximately 50 times for a normal four-terminal DC system with a reasonable voltage level. On the other hand, there could be up to 12,288 SMs in DCS2 when the voltage level is 513. Thus, the multi-core CPU framework is also tested. Compared with the default mode, its execution time merely increases by around two times. The computational capability of MCPU architecture cannot be fully utilized when the MMC voltage level is low, as launching multiple threads would take a significant part of the total time; when the MMC level reaches hundreds, MCPU gains a higher speedup over a single CPU, but still, it is about 20 times slower than GPU.

Case MMC level 5-L 9-L 17-L 33-L 65-L 129-L 257-L 513-L

DCS2 texe (s) CPU MCPU 561.8 387.1 636.4 455.5 808.9 521.6 1149.6 735.6 1772.4 941.6 2988.0 1372.9 5577.3 2424.1 10,427.9 4352.0

GPU 77.0 78.2 79.2 80.8 89.2 100.9 132.8 194.4

CPU GPU

7.30 8.14 10.21 14.23 19.87 29.61 42.00 53.64

CPU MCPU

1.45 1.40 1.55 1.56 1.88 2.18 2.30 2.40

Speedup 5.03 5.82 6.59 9.10 10.57 13.61 18.25 22.39

MCPU GPU

CIGRÉ texe CPU 1959.5 2238.7 2658.8 3594.3 5279.5 8847.7 15,819.6 29,939.6

(s) MCPU 1010.0 1033.9 1069.2 1080.6 1507.3 2031.2 3118.2 5724.9

Table 5.3 CPU and GPU execution times of ±200 kV DCS2 and the CIGRÉ B4 DC system for 1 s simulation GPU 75.2 77.7 79.8 82.0 92.2 111.3 194.6 334.3

1.94 2.17 2.49 3.33 3.50 4.36 5.07 5.23

CPU MCPU

Speedup 26.1 28.8 33.3 43.8 57.3 79.5 81.3 89.6

CPU GPU

13.4 13.3 13.4 13.2 16.3 18.2 16.0 17.1

MCPU GPU

5.2 Generic MTDC Grid Fine-Grained Partitioning 211

212

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

As the scale of the DC grid enlarges, the speedup will also increase. In the CIGRÉ DC system, it takes the CPU thousands of seconds to compute 1 s of results even though the MMC has only five voltage levels, and it soars up to 30,000 s when the MMCs are in 513 level. The situation is slightly improved by adopting MCPU; however, it still requires a few thousand seconds. In the meantime, the GPU simulation time remains mostly the same as DCS2 albeit the scale has been nearly quadrupled. As a result, it gains a speedup ranging from 26 to 90, much higher than the MCPU simulation which only has 2–5 times of speedup. Some tests are also carried out to show that the GPU is the more efficient platform for studying the CIGRÉ DC test system. Power reversal is conducted by ordering the output power of DC-DC converter Cd-E1 to ramp from −200 MW to 400 MW. And the impact of this single converter’s behavior on the overall system is given in Fig. 5.16. Initially, MMC0 and MMC2 as rectifiers in DCS2 release 1.2 GW, and MMC1 and MMC3 receive around 660 MW and 330 MW, respectively. The surplus 200 MW is fed to DCS3, as it can be seen that the combined amount of output power from MMC6, MMC8, and MMC10 is 3.2 GW, but the inverters MMC7 and MMC9 get approximately 0.2 GW more. During the power ramp process, as expected, the output power of all rectifier stations remains virtually constant, while only the inverter MMC5 absorbs a fixed 800 MW power as DCS1 is relatively isolated from its counterparts. In DCS2, the power MMC3 receives almost triples after the process, while MMC1 is slightly affected during the process, and after that, it restores. Meanwhile, as the power is flowing from DCS3 to DCS2, the power received by MMC7 and MMC9 both reduces to around 1.55 GW and 1.24 GW, the summation of which has a deficiency of 400 MW compared with that provided by rectifiers in that subsystem. The above process and its impact on the CIGRÉ DC system are validated by PSCAD™ /EMTDC™ . With regard to the Greater CIGRÉ DC grid, the leverage that GPU holds is supposed to be larger. Take the 101-level MMC for example, when the number of CIGRÉ DC system rises from two to eight, it takes CPU and multi-CPU four times and 2.8 times longer, respectively, to compute, while this value merely increases by less than 1.4 times in the GPU case. Thus, compared with single-CPU mode, GPU simulation is able to seize about 90–270 times of speedup; on the contrary, multiCPU could only achieve a speedup of approximately 7–11, as shown in Table 5.4. The GPU’s performance in simulating DC systems with both TSSM and the TCFM is summarized in Fig. 5.17. All three figures share the trait that it takes a slightly longer time for the GPU to compute when the switch model shifts to TCFM regardless of the MMC level, while both the CPU and MCPU frameworks witness a dramatic rise even in the logarithmic axes, which accounts for the fact that devicelevel semiconductor models are rarely used in CPU-based large system simulation. Meanwhile, it demonstrates that with the TSSM for only system-level simulation, the GPU is still able to attain over a dozen times of speedup, let alone the more complex switch model, which showcases a much higher speedup. The adoption of GPU greatly alleviates the computational burden caused by the complexity, making the involvement of device-level models in system-level simulation feasible.

5.2 Generic MTDC Grid Fine-Grained Partitioning

Pdc3

Pdc1 920 MW

660 MW

DCS1 & DC-DC Pdc5 1.6 GW

Pdc6

0.8 GW

Pdc8 & Pdc10 DCS3

Pdc9 1.54 GW 1.85 GW Pdc7 1.0 0 2.0 3.0 Time (s)

DCS2

330 MW Pdc3

Pdc1 920 MW

660 MW Pdc4

Pdcdc & Pdcdc_ref

Pdc/GW

Pdcdc & Pdcdc_ref

Pdc0

Pdc2

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

Pdc4

Pdc/GW Pdc/GW

Pdc/GW

DCS2

330 MW

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 1.5 1.0 0.5 0 -0.5 -1.0 -1.5

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

Pdc0

Pdc2

DCS1 & DC-DC Pdc5

1.5 1.6 GW Pdc6 1.0 0.8 GW Pdc8 & Pdc10 0.5 0 DCS3 -0.5 -1.0 1.54 GW Pdc9 1.24 GW -1.5 1.85 GW 1.55 GW Pdc7 0 1.0 2.0 3.0 4.0 Time (s)

Pdc/GW

Pdc/GW

0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8

213

1.24 GW 1.55 GW 4.0

Fig. 5.16 CIGRÉ DC grid power reversal simulation by GPU (left) and PSCAD™ /EMTDC™ (right)

Table 5.4 CPU and GPU execution times of the Greater CIGRÉ DC system for 1 s simulation Number of CIGRÉ systems 2 3 4 5 6 7 8

Execution time (s) SMs 13,200 19,800 26,400 33,000 39,600 46,200 52,800

CPU 14,549 22,264 30,249 36,963 44,391 52,792 60,121

MCPU 1995 2810 3103 4067 4284 4868 5538

Speedup GPU 162.1 179.9 195.6 200.9 208.4 210.0 221.9

CPU MCPU

CPU GPU

MCPU GPU

7.3 7.9 9.7 9.1 10.4 10.8 10.9

89.8 123.8 154.6 184.0 213.0 251.4 270.9

12.3 15.6 15.9 20.2 20.6 23.2 25.0

214

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

103

T C F M

30

TCFM GPU\CPU sp.

T S S M

TCFM GPU\MCPU sp.

25

TSSM GPU\CPU sp. TSSM GPU\MCPU sp.

GPU CPU MCPU

20 15

102

10

Speedup (sp.)

Execution time (s)

104

5 101

104

9

T C F M

17

33

65

MMC level (a)

129

256

513

0

60

TCFM GPU\CPU sp.

T S S M

TCFM GPU\MCPU sp.

50

TSSM GPU\CPU sp. TSSM GPU\MCPU sp.

GPU CPU MCPU

40

103

30 20

Speedup (sp.)

Execution time (s)

105

5

102 10 101

10

4

9

T C F M

17

33

65

MMC level (b)

129

256

513

90

TCFM GPU\CPU sp.

T S S M

TCFM GPU\MCPU sp.

80

TSSM GPU\CPU sp.

70

TSSM GPU\MCPU sp.

GPU CPU MCPU

0

60 50

103

40 30 102

Speedup (sp.)

Execution time (s)

105

5

20 10

101

5

9

17

33

65

MMC level (c)

129

256

513

0

Fig. 5.17 GPU performance in simulation of different DC systems with IGBT TSSM and TCFM. (a) HVDC with HHB, (b) DCS-2, and (c) CIGRÉ B4 DC system

5.3 General Nonlinear MMC Parallel Solution Method

5.3

215

General Nonlinear MMC Parallel Solution Method

Under the circumstance that nonlinear switch models are involved in constructing the MMC-based converter station, the Newton-Raphson method is normally required for convergent results. Thus, considering a high dimension of the admittance matrix brought by the high-order nonlinear model and the MMC circuit, a general massively parallel algorithm is introduced in the section.

5.3.1

Massive-Thread Parallel Implementation of Newton-Raphson Method

5.3.1.1 Algorithm To find the solution of a nonlinear system F (X) consisting of k unknown variables, utilizing the Jacobian matrix J F (X) for linear approximation gives the following equation: 0 = F (Xn ) + J F (Xn )(Xn+1 − Xn ),

(5.23)

where the Jacobian matrix J F (X) is a k × k matrix of first-order partial derivatives of F given as

JF

⎡ ∂F 1 ⎢ ∂X1 ⎢ . dF .. =⎢ = dX ⎢ ⎣ ∂F k ∂X1

∂F1 ∂Xk . .. . .. ∂Fk ... ∂Xk

···

⎤ ⎥ ⎥ ⎥. ⎥ ⎦

(5.24)

Solving for the root of F (X) is numerically replaced by solving (5.23) for (X n+1 − Xn ) and updating Xn+1 from X n . The solution process is repeated until the difference ||Xn+1 − Xn || is below a tolerable threshold. According to Kirchhoff’s current law (KCL), the sum of currents leaving each node is zero, i.e., I (V ) = 0.

(5.25)

Applying (5.23)–(5.25) results in J V n (V n+1 − V n ) = −I n .

(5.26)

Take the nonlinear physics-based IGBT model for example, the voltage vector for the solution of KCL at each node of collector, gate, anode, drain, and emitter is given as

216

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

− V n = [v c (n)

v g (n)

v a (n)

vd

(n)

v e (n) ]T ,

(5.27)

and applying the Newton-Raphson method results in GIGBT

(n)

(V n+1 − V n ) = −I n ,

where the conductance matrix GIGBT hand side vector is given as −I n = [i c (n)

(n)

i g (n)

(5.28)

is the Jacobian matrix J V n and the right-

i a (n)

id

(n)

i e (n) ]T .

(5.29)

5.3.1.2 Massive-Thread Parallel Implementation In the massive-thread parallel module for the Newton-Raphson method, four kernels and Algorithm 5.1 are involved, as shown in Fig. 5.18. First, Kernel0 and Kernel1 update I n , which equals the sum of currents leaving the nodes, and the Jacobian matrix J F (Xn ), which is the conductance matrix with current units (CUs) and Jacobian matrix units (JMs), respectively. All parameters are copied to shared memory for Kernel2 to solve the equation (5.23) by the Gaussian elimination method. Then, V n+1 is updated and verified if it satisfies the convergence criteria in Kernel3 with voltage units (VUs). The iterations continue until V is converged or the maximum number of iterations is reached. Algorithm 5.1 Newton-Raphson kernel procedure N-R ITERATION Iterative loop: Calculate F (Xn ) Calculate J F (Xn ) from Xn Copy J F (Xn ) and F (X n ) into shared memory Solve X n in J F (Xn )X n = −F (Xn ) Update X n+1 ← Xn + X n Calculate ||X n || Store ||X n || into global memory if ||X n || < then t ← t + t else go toIterative loop

 Kernel0  Kernel1  Kernel2  Kernel3

5.3 General Nonlinear MMC Parallel Solution Method

217

Voltage vector Vn Sum of current at each node In

Global Memory

Jacobian matrxi Gn Vn

Kernel0 CU1 CU2

Gn In

ΔVn

ΔVn Kernel3

Kernel1 JM1 Gn JM2

VU1 VU2

Vn

...

Vn

Gaussiann

CUn

Vn

Kernel2 Gaussian1 Gaussian2 ...

...

In

Voltage difference ΔVn

VUn

ΔVn ≤ ε? No

Yes

Shared memory

... JMn

Fig. 5.18 Massive-thread parallel implementation of Newton-Raphson method



orig

GSM =

G1,1 G2,1 ⎢ G3,1 ⎢ ⎢ G4,1 ⎢ G5,1 ⎢G ⎢ 6,1 ⎢ G7,1 ⎢G ⎢ 8,1 ⎣ G9,1 G10,1 G11,1



GSM =

5.3.2

G1,1 G2,1 ⎢ G3,1 ⎢ ⎢ G4,1 ⎢ G5,1 ⎢G ⎢ 6,1 ⎢ G7,1 ⎢G ⎢ 8,1 ⎣ G9,1 G10,1 G11,1

G1,2 G2,2 G3,2 G4,2 G5,2 G6,2

G1,2 G2,2 G3,2 G4,2 G5,2 G6,2

G1,3 G2,3 G3,3 G4,3 G5,3 G6,3

G1,3 G2,3 G3,3 G4,3 G5,3 G6,3

G1,4 G2,4 G3,4 G4,4 G5,4 G6,4

G1,4 G2,4 G3,4 G4,4 G5,4 G6,4

G1,5 G2,5 G3,5 G4,5 G5,5 G6,5

G1,5 G2,5 G3,5 G4,5 G5,5 G6,5

G1,6 G1,7 G2,6 G3,6 G4,6 G5,6 G6,6 G7,7 G8,7 G9,7 G10,7 G11,6 G11,7 G1,6 G1,7 G2,6 G3,6 G4,6 G5,6 G6,6 G7,7 G8,7 G9,7 G10,7 G11,7

G1,8 G1,9 G1,10 G1,11 ⎤

G7,8 G8,8 G9,8 G10,8 G11,8

G7,9 G8,9 G9,9 G10,9 G11,9

⎥ ⎥ ⎥ ⎥ ⎥ G6,11 ⎥ G7,11 ⎥ ⎥ G8,11 ⎥ G9,11 ⎦

(5.30)

G7,10 G8,10 G9,10 G10,10 G10,11 G11,10 G11,11

G1,8 G1,9 G1,10 G1,11 ⎤

G7,8 G8,8 G9,8 G10,8 G11,8

G7,9 G8,9 G9,9 G10,9 G11,9

⎥ ⎥ ⎥ ⎥ ⎥ ⎥ G7,11 ⎥ ⎥ G8,11 ⎥ G9,11 ⎦

(5.31)

G7,10 G8,10 G9,10 G10,10 G10,11 G11,10 G11,11

Block Jacobian Matrix Decomposition for MMC

Applying the physics-based IGBT and power diode module introduced in Chap. 5, each SM has five nodes for the IGBT and one more inside node for the power diode, which leads to the Jacobian matrix in (5.30).

218

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

5.3.2.1 Matrix Update Using a Relaxation Algorithm orig Although the sparse matrix GSM is irregular, it can be reshaped by eliminating two elements, G6,11 and G11,6 , brought about by the capacitance, using the relaxation algorithm. After the transformation, the linearized equation of a submodule is given as orig(n)

GSM

· V n+1 = −I orig(n) ,

(5.32)

which is updated to GnSM · V n+1 = −I n ,

(5.33)

where GSM is given in (5.31) and I n comes from I orig(n) whose 6th and 11th n with known values from previous elements are adjusted by G6 v6n and G11 v11 iteration. Although there is an outer Gauss-Jacobi loop over the Newton-Raphson iterations to guarantee the convergence of v n , the updated GSM results in a better bordered block pattern, which benefits to the parallel implementation on the GPU.

5.3.2.2 Partial LU Decomposition for MMC For a k-SM MMC, each cascade node connects to two nonlinear IGBT-diode units. Therefore, the complete structure of the Jacobian matrix GMMC is shown in Fig. 5.19a. The connection on node 1 and node 11 of every GSM is relaxed to decouple the large Jacobian matrix. Thus, the updated Jacobian matrix G∗MMC is shown in Fig. 5.19b, and the Newton-Raphson linearized equation, given as GnMMC · V n+1 = −I n ,

(5.34)

is also updated as ∗(n)

GMMC · V n+1 = −I ∗(n) ,

(5.35)

where −I ∗(n) is −I n adjusted by previous iteration values. Processing the LU decomposition for all A matrices in G∗MMC yields l j · u j = Aj

(j = 1, 2, · · · , 2k).

(5.36)

Then, the border vectors, f j and g j in Fig. 5.20, can be found according to the following equations: 

f j · uj = cj lj · gj = d j

.

(5.37)

Since l j and uj are lower and upper triangular matrices, f j and g j can be solved with forward and backward substitutions. The elements at the connecting nodes, hi (i = 1, 2, · · · , k) in Fig. 5.20, are calculated with ei in Fig. 5.19, as

5.3 General Nonlinear MMC Parallel Solution Method

e

c

d

A

219

c

d

e

c

d

A

A

c

d

A

e

c

c

d ek c d k

d

A

k

k

c

d

A

k

A

k

k

d

e

c

(a)

c

d

e

c

d

A

A

c

d

A

e

c

c

d ek c d

d

k

A

k

c

k

k

d k

A

k

(b) Fig. 5.19 Jacobian matrices for the MMC circuit: (a) original GMMC ; (b) updated G∗MMC using relaxation

220

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

1

f1

L

f2

l1 l2

f3

f4

l3 l4

f5

f6 f2k-1

f2k

l2k-1 l2k

U

h1 g1

u1

g2

u2 h2 g3

u3

g4

u4 h3 g5 hk g2k-1

g6

g2k

Fig. 5.20 Partial LU decomposition for G∗MMC

u2k-1 u2k

5.3 General Nonlinear MMC Parallel Solution Method

hi = ei − f 2i−1 · g 2i−1 − f 2i · g 2i .

221

(5.38)

The partial LU decomposition can be computed in parallel due to the decoupled structure of updated Jacobian matrix G∗MMC , including the LU decomposition of Aj , calculation of border vectors f j and g j , and updating of connecting node elements hi .

5.3.2.3 Blocked Forward and Backward Substitutions After obtaining the semi-lower and semi-upper triangular matrices utilizing partial LU decomposition, blocked forward and backward substitutions are utilized to obtain the solution of the linear system. For the equation LU · x = b, defining U · x = y,

(5.39)

L · y = b.

(5.40)

it can be obtained that

The blocked forward substitution shown in Fig. 5.21a is used to solve for y in (5.40). All elements of y 2i−1 and most elements of y 2i except for the last one can be solved directly. Then, the last element of the previous block, y 2i−2 , can be obtained with solved elements in y 2i−1 and y 2i ; and so, the missing element of y 2i can be obtained from the solution of the next block. Fortunately, y 2k in the last block can be fully solved since it has no overlap border vector f . Because of the decoupled structure of L, the blocked forward substitution can be accomplished by GPU-based parallelism. Similarly, the solution of (5.39) can be obtained by the blocked backward substitution shown in Fig. 5.21b with y j . First, all connecting elements hi can be solved in parallel; second, the border vector g j can be eliminated by updating the right-side vector; finally, backward substitution is implemented on all blocked U j in parallel. In the MMC circuit, the size of the Jacobian matrix grows with the number of output voltage levels. Instead of solving a system containing a (10k + 1) × (10k + 1) large Jacobian matrix, the 5 × 5 block perfectly accommodates the parallel scheme of GPU with its limited shared memory to reduce the data transmission cost.

5.3.3

Parallel Massive-Thread Mapping

As shown in Fig. 5.22 and Algorithm 5.2, four kernels are involved in the parallel module for the partial LU decomposition method and blocked linear solution in MMC. Kernel0 updates the matrix equation using the relaxation algorithm, and the original GMMC matrix is reshaped to G∗MMC which is output to Kernel1 . To calculate

222

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

f2i

b2i-1 b2i

f2i+2

=

y2i

f2i+1

y2i-1

.

l2i-1 l2i

b2i-2

f2i-1

y2i-2

l2i-2

(a)

hi

=

y2i-1 y2i

x2i

g2i

u2i

x2i-1

g2i-1

.

u2i-1

y2i-2

x2i-2

u2i-2

hi+1

(b) Fig. 5.21 Blocked linear MMC system solver: (a) blocked forward substitution; (b) blocked backward substitution

the semi-lower and semi-upper triangular matrices L and U in an MMC circuit containing k SMs, there are three steps inside Kernel1 listed as follows: • Normal LU decomposition in each block to get Lj , where j = (1, 2, · · · , 2k). • Backward substitution to obtain f j from U j and cj , forward substitution to obtain g j from Lj , and d j in each block. • Update U j with ei , where i = (1, 2, · · · , k). Kernel2 has two steps as follows:

5.3 General Nonlinear MMC Parallel Solution Method

Jacobian matrix GMMC

223

Semi-lower triangular matrix U

Global Memory

Vector y

Vector b

Vector x

Semi-upper triangular matrix L Kernel0

Kernel1

Matrix update

Block LU decomposition

bj A1 bj

gj bj

Uj

hi H 1

y yj j Uj H2 gj yj

FS 2k

Y2k

H2k

Shared Memory gj Y1 Uj BS1 BS2 Y2 xi yj

...

Lj

Block backward substitution

yj

...

Uj

Shared Memory bj FS 1 fj Y 1 FS 2 Y2 Lj yj

...

U2k

Block forward substitution

Lj

...

FB 2k U j

...

L2k

...

...

...

Aj

Aj

L1 L2

Shared Memory Lj FB1 U1 fj U2 FB 2 Uj gj

Kernel3

...

A2

GMMC A 2k

Kernel2

Y2k

BS2k

xj xj

Fig. 5.22 Massive-thread parallel implementation of partial LU decomposition

Algorithm 5.2 Partial LU decomposition for MMC procedure SOLVE MATRIX EQUATION FOR MMC CIRCUIT USING PARTIAL LU DECOMPOSITION

Produce semi-upper and semi-lower triangular matrices: Update matrix equation using relaxation algorithm

 Kernel0

Block LU decomposition to get Lj and U j Backward and forward substitution for f j and g j Update U j with ei

 Kernel1

Blocked forward substitution: Forward substitution in block y j update

 Kernel2

Blocked backward substitution: Compute hi directly Subtract effect of hi to update y j Backward substitution in block

 Kernel3

• Forward substitution in block for y j . • Update y j with border vector f j . And Kernel3 has the following three steps, i.e., • calculate hi , • update y j with hi , • backward substitution in block, to obtain the final result x j .

5.3.4

Predictor-Corrector Variable Time-Stepping Scheme

A variable time-stepping scheme is used for the numerical solution of the transient nonlinear system response to ensure both efficiency and accuracy. In most power

224

5 Large-Scale Electromagnetic Transient Simulation of DC Grids

Start: initialize t, Δt, LTE No

Reset Δt

Predictor Scheme

Update t Forward Euler approximation for V_pred

Corrector N-R Scheme

Yes

Backward Euler approximation for V_corr Update error between latest two iteration errorT ? j=0-N

No (b)

Fig. 6.1 Hybrid FTS-VTS scheme for electric power system simulation: (a) centralized system structure and (b) time instant synchronization

260

6.2.3

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

VTS-Based MMC Models

In this section, the three VTS schemes are applied to the MMC as a demonstration of expediting large-scale system simulation. Taking the power semiconductor switch as a two-state resistor is prevalent in EMT-type tools, but when device-level behaviors and higher accuracy are pursued, the NBM becomes a better choice. Regardless of the switch model adopted, a heavy computational burden always presents considering a large number of submodules required to yield a high output voltage level.

6.2.3.1 TSSM-Based MMC VTS Model The TSSM is the simplest model for power semiconductor switches with both turn-on and turn-off actions completed instantaneously, and the transition of two distinct states lasts merely one time-step. Due to an omission of device specifics, system-level results are of interest when this model is applied in power converters. Therefore, the switching is not taken as a criterion for time-step control, nor is the NR iteration due to the absence of nonlinearity. Nonetheless, discrete events are still a criterion for indicating state shift in other components such as the transmission line. The LTE as a general method applies to the MMC, as it contains many capacitors and six arm inductors. Therefore, for a DC grid based on the ideal switch, both LTE and event-correlated criteria can be utilized. Take the partitioned two-node HBSM in Fig. 6.2, for instance. It has the following matrix equation:       U1 (t) ICeq (t) GC + R1−1 −R1−1 · = U2 (t) Js (t − t) −R1−1 R1−1 + R2−1

(6.9)

for computing nodal voltages, where GC and ICeq are the discrete-time companion model of the submodule capacitor; R1 and R2 , respectively, represent the equivalent resistance of the upper and lower switch; and Js (t −t) indicates arm current injection with one time-step delay. The calculated capacitor voltage UC (t) equivalent to U1 (t) is then compared with its predicted value which is calculated by (6.6) where Fm =

1 (GC · UC (t − (n + 1 − m)t) − ICeq (t − (n + 1 − m)t)), C

(6.10)

is applicable to all capacitors. The subscripts m = (n + 1, n, n − 1, n − 2) denote values at the current time-step and previous time-steps.

6.2.3.2 MMC Main Circuit VTS Model The MMC main circuit is always the same after circuit partitioning regardless of submodule topology. It contains five nodes after converting into the Norton equivalent circuit, i.e., three nodes (#1–#3) on the AC side and the other two (#4– #5) on the DC side, as shown in Fig. 6.2. The matrix equation for this universal part contains

6.2 Variable Time-Stepping Simulation

261

3- phase MMC

4



GΣ GC

0

C

HBSM

A

1

U2(t)

Js

IC

S1

B

2

C

3 GΣ

JΣBd

JΣAd

U1(t)

JΣCu GΣ





IC2eq 5

JΣBu

JΣAu

IC1eq

FBSM

S1

JΣCd

IC

ICeq

S4

GC

GC ICeq

S2

Js

S2

S3

Fig. 6.2 Partitioned MMC EMT model

 G = Gext +

 [−G ]3×2 2G · I3×3 , [−G ]2×3 (3G + GCC ) · I2×2

(6.11)

J = [−JAu + JAd , −JBu + JBd , −JCu + JCd , JAu + JBu + JCu + IC1 eq , −JAd − JBd − JCd − IC2 eq ] + Jext , (6.12) where matrices Gext and Jext represent elements contributed by external AC and DC grids the MMC connects to, GCC and ICieq (i = 1, 2) denote the DC bus capacitor in case they exist, and G and J are the companion model of an MMC arm where subscripts u and d stand for the upper and lower arm, respectively. Then, the arm currents can be derived after solving the nodal voltages, as expressed below: ! Iarm =

G (U4 − Uk ) − Ju , G (Uk − U5 ) − Jd ,

(upper arm) (lower arm)

(6.13)

where k = 1, 2, 3 represents the node number on the AC side. For variable time-stepping control, the predicted arm currents are calculated similarly by (6.6) with the history terms of an inductor expressed as Fm =

1 (UL (t − (n + 1 − m)t)), L

where m = (n + 1, n, n − 1, n − 2) are the time-steps.

(6.14)

262

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

6.2.3.3 NBM-Based VTS Submodule Model The nonlinear behavioral model improves the MMC accuracy at the cost of computational burden. As can be seen, an HBSM has eight nodes when type II reduced IGBT/diode model containing five nodes shown in Fig. 4.6 is used. Moreover, an iterative N-R method is required for convergent results. Thus, computing this type of submodule is significantly time-consuming than the TSSM and other ideal models, which is the reason that it is rarely seen in system-level simulation. The admittance matrix of the NBM-based HBSM is constructed as       GC 0 GT5×5 0 0 0 , (6.15) G8×8 = + + 0 0 0 0 0 GT4×4 and the current contribution takes the form of    J = ICeq 01×3 Js 01×3 + JT1×5 0 + 0 GT1×4 ,

(6.16)

where GT together with JT are elements of the nonlinear IGBT/diode model. The full-bridge SM (FBSM)-based MMC is another common topology that can block the DC fault when all of its IGBTs are turned off. Compared with its halfbridge counterpart, it has 13 nodes. Its matrix equation formulation follows the same principle, e.g., the negative terminal of the current source Js is taken as the virtual ground and therefore not counted as an unknown node. Raising the time-step is a common method in FTS simulation for a faster speed, but the inclusion of nonlinear submodules implies that the achievable time-step cannot be too large in order to attain the twofold goals of keeping the results convergent and maintaining high accuracy. Considering that the computation of transient stages is prone to numerical divergence, a typical time-step of 10 ns is adopted in the FTS simulation so that the switching transients can be captured properly. In contrast, the VTS schemes realize these two goals simultaneously by adding either of the three types of algorithms, along with its main motive of expediting the simulation. As a universal approach to systems that comprise reactive components, the LTE is one choice for time-step regulation, and the procedure is the same as illustrated in the TSSM-based VTS model. According to (4.16), the capacitor voltage vCge is one of the main factors responsible for the IGBT’s transient behavior; thus, it is selected in the LTE scheme for dynamic time-step adjustment. Its EMT simulation outcome based on one-step integration approximation can be instantly obtained following the solution of the submodule matrix equation. On the other hand, the prediction is calculated by the fourth-order AM formula (6.6), and the history terms Fm (m = (n + 1, n, n − 1, n − 2)) are calculated in the same manner as (6.10). Meanwhile, a proper judgment on the events can also be utilized, e.g., vCge is an indicator for switching behavior: when the IGBT turns on or off, it approaches the gate voltage, making dv/dt nonzero; otherwise, its derivative is nearly 0. Therefore, the time-step can be regulated by comparing the dv/dt value with a number of preset

6.2 Variable Time-Stepping Simulation

263

thresholds. The N-R iteration count is the most convenient criterion for nonlinear systems. For the solution of the submodule matrix equation (6.7), the steady state takes the fewest number of iterations, and it is tolerable of a time-step of up to 500 ns, which is set as the upper limit in this case. On the other hand, the transient stage requires more iterations, and it is prone to divergence if the time-step is too large; thus, the lower limit is 10 ns. Then, the time-step can be mapped to the number of N-R iterations: the more the number of iterations, the smaller the timestep becomes. Since the basic unit for applying the VTS scheme is one submodule, there will be many time-steps for an MMC, which is consequently taken as a hybrid FTS-VTS system. It is obvious that among all the time-steps produced by various submodules in the same converter, the minimum value should be chosen as the localized timestep for the entire MMC because it satisfies the requirements of convergence and accuracy of all (6N + 1) subsystems created by circuit partitioning, including the linear main circuit.

6.2.4

VTS-Based MMC GPU Kernel Design

The GPU program of the NBM-based MMC with N-R iteration VTS scheme is specified in Fig. 6.3 as an example. The outputs of all functions—regardless of device or global—are stored in the GPU global memory so that they can be accessed by other kernels or device functions. The kernels are designed according to the number of functions the partitioned MMC has. Among them, the SM kernel is the most complex part. The IGBT/diode model is programmed as a GPU device function which could be instantly called by the SM kernel. Their outputs are properly organized according to (6.15) and (6.16). The N-R iteration of the matrix equation (6.7) repeats until all nodal voltages are convergent, and the final iteration count KN Ri of that submodule is stored in global memory so that it can be read by

VTS KNR=max KNR1 KNR2

Global memory

IGBT

Δt

KNRi KNR=>Δt Device K function NRi

G, J

Uk=GkJk

IGBT IGBT Vg

Uk

Js

Uk-1=Uk

No k++ k

KNR U vcge qcg Sync. Vp ILd

Fig. 6.3 Nonlinear behavioral SM kernel with VTS scheme

Converge?

Yes SM kernel

264

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

CPU Start

t=0

GPU SIMT

kernel1

PCIe Memcpy

kernelN kernel0 Device function

PCIe Memcpy

End

Y

tend?

N block

a×b threads

Fig. 6.4 GPU SIMT architecture for VTS simulation

the V T S module which first extracts the maximum iteration number KN R and then produces a proper time increment for the next calculation. It is noticed that not all threads launched by the same kernel implement exactly identical instructions, e.g., the number of N-R iterations conducted by the SM kernel varies in different SMs, and therefore, synchronization of all threads is implemented at the end. An NVIDIA GPU with a compute capability higher than 3.5 allows dynamic parallelism to be utilized for time-domain circuit simulation, as Fig. 6.4 shows. The EMT program is first initiated in the CPU, and since variables defined there cannot be accessed by the GPU directly, they are copied to the global memory via the PCIe 3.0 interface. The GPU kernel activated by the CPU is also able to launch its own kernels. For simulation of MTDC grid, the entire system is programmed as kernel0 , and within it, kernel1 –kernelN represent various circuit components. When any of them is launched in a blocks each containing b threads, an exact a × b real components are calculated concurrently. In addition, device functions are also introduced following the kernels to deal with variable time-stepping. At the end of the simulation, all needed data are copied back to the CPU via the PCIe bus for analysis.

6.2 Variable Time-Stepping Simulation

6.2.5

265

VTS Simulation Results and Validation

6.2.5.1 System Setup Figure 6.5a shows the testing system with two HVDC links integrated with offshore wind farms (OWFs), and by connecting them, e.g., between buses B3 and B4, a four-terminal DC system is formed. MMC1 and MMC2 are rectifiers that provide stable AC voltage for the OWFs while simultaneously receiving energy. MMC3 and MMC4 operate as inverters regulating the converter-side DC voltage Vdco3 and Vdco4 , respectively, as the general control scheme in d-q frame for the converter stations in Fig. 6.5b shows. After inverse Park’s transformation, the control signals vabc are decoupled and sent to individual MMC inner-loop controllers using phaseshift control (PSC) strategy. The angle reference θ is derived from the AC grid on the inverter side, while it is an integral of the targeted frequency on the rectifier side. To curb a rapid current rise in case of DC fault, a DC inductor ZL is installed at each pole between the MMC and the DC bus connecting the transmission line, which adopts Bergeron’s traveling wave model [105], and transformers are required on the rectifier side for wind energy integration. Each OWF is modeled as an aggregation of 100 doubly-fed induction generators (DFIGs). Once the line T L3 connects both HVDC links, the power exchange between them can be estimated ∗, under the circumstance that both inverters have the same DC voltage reference Vdc as the current Iexc is approximately Iexc = 2

Idc1 − Idc2 ZL , ZT L + 4ZL

(6.17)

where ZT L is the line impedance and Idc1 and Idc2 are the currents on the neighboring lines T L1 and T L2 . As can be seen from Fig. 6.5a, when applying the hybrid FTS-VTS scheme to the MTDC grid, system-level models including the transmission line and the OWF are calculated with a fixed time-step of 20 μs, and the MMCs each of which adopts individual VTS scheme fall within a time-step range of 10 ns and 500 ns. To test the performance of point-to-point HVDC, MMC1 has an option of connecting to a stiff AC grid, and in this case, the overall converter station is taken as the subsystem V T S1 . The VTS simulations are conducted on both the CPU and GPU under the 64-bit Windows® 10 operating system on the 2.2 GHz Intel® Xeon E5-2698 v4 CPU and 192 GB RAM. The device-level and system-level results are validated by SaberRD® and PSCAD™ /EMTDC™ , respectively. The former commercial tool has its own inherent VTS scheme, and the same maximum time-step of 500 ns is set, while the latter tool as a system-level EMT-type solver determines that its fixed time-step can be much larger and therefore a typical value of 20 μs is chosen.

6.2.5.2 VTS in Device- and System-Level Simulation on CPU In Fig. 6.6, all three types of VTS schemes are tested in simulating a nonlinear single-phase nine-level MMC fed with 8 kV DC bus voltage and switched at 1 kHz.

266

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Grid3

MMC3 VTS3 B3 Idc3 2ZL Vdco3

Vgrid3

Vgrid4

MMC4 VTS4

Vdco

Vdco4

Vdc3

Vgd*

Vdc1

Iexc

FTS TL3 B2 HVDC2 Idc2

B4

TL2

Idc4

Vdc4

Vdc2

(a)

Vdc*

Vgd Vgq Q

TL1

2ZL

Grid4

Idc1 B1

HVDC1

PI

id*

PI

igd igq

PI Vgq* Q *

VTS1 MMC1 Grid1 2ZL Vgrid1

Vdco1

2

Vgrid2

2ZL

OWF1 OWF2

Vdco2 Grid2 VTS2 MMC2

Vgd

vabc

PI

iq*

1

PSC

dq ωL abc

PI (b)

θ

PSC PSC

VGA VGB VGC

Vgq

Fig. 6.5 MMC-based MTDC grid with wind farm integration: (a) system configuration and (b) MMC controller

For the convenience of results validation by SaberRD® , the IGBT module BSM 300GA 160D is selected, and the maximum time-step is 500 ns. The output voltages are shown on the left, which are virtually the same, albeit the time-stepping schemes are different. The time-step variation in a zoomed 0.5 ms segment is shown on the right. As can be seen, the three schemes yield different time-steps, but all of them give an apparent regulation in the time-step under different stages. Under steady state, the LTE and N-R iteration count give the maximum time-step, while the eventbased criterion can hardly produce a fixed 500 ns. In contrast, the time-step changes dramatically under transient states, when it repeatedly reaches the minimum 10 ns to keep the simulation convergent and capture the transients accurately. The efficiency of the schemes in computing low-level MMCs for a 100 ms duration by CPU is summarized in Table 6.1. With circuit partitioning, MMCs having more than nine voltage levels can be computed, which SaberRD® is unable to achieve. The N-R iteration method has the highest efficiency, gaining a speedup Sp1 and Sp2 of 18 and 16 times over SaberRD® and FTS for 5-L MMC, respectively. In Fig. 6.7, device-level results are given from the 9-L MMC with its time-step being controlled by N-R iteration count. Figure 6.7a, b gives the IGBT turn-on and turn-off waveforms, which show that the density of points is higher during the transient stage, and it is also varying, meaning the MMC is computed at a variable frequency. The diode reverse recovery waveforms in Fig. 6.7c also demonstrate the same phenomenon. The power loss eventually induces junction temperature

6.2 Variable Time-Stepping Simulation

267

2 0 -2

Δt (ns)

vout (kV)

500ns

vout

-4 -6

zoomed-in area

steady state

transient state

Δt

0ns

(a) 2 0 -2 -4 -6

Δt (ns)

vout (kV)

500ns

vout zoomed-in area

200ns 10ns

Δt

500ns

steady state

steady state

transient state

(b)

-4 -6

Δt (ns)

vout (kV)

2 0 -2

vout zoomed-in area

transient state

Δt

10ns

(c) 2 0 -2 -4 -6 0

vout 20

40

Δt (ns)

vout (kV)

500ns

10ns

zoomed-in area 60

80

t/ms

(d)

0

0.1

steady state

transient state

Δt 0.2

0.3

0.4

t/ms

Fig. 6.6 VTS schemes for nonlinear MMC simulation (left, output voltage; right, time-steps in zoomed-in area): (a) SaberRD® results, (b) event-correlated criterion, (c) LTE, and (d) N-R iteration count

variation, as shown in Fig. 6.7d. The temperature of the lower IGBT/diode Tvj 2 surges to over 100 ◦ C immediately after the converter is started, but it is still within the normal operating region. On the other hand, the temperature of upper IGBT/diode Tvj 1 is much lower, and finally, they all reach around 30 ◦ C. SaberRD® verifies these results by showing identical waveforms. Meanwhile, the LTE is suitable for accelerating TSSM-based MMC-HVDC systems employing the most prevalent TSSM. With a relative error of arm currents less than 5% and the time-step range of 1 μs to 30 μs, speedups around 30 are gained in Table 6.2 for the two types of MMCs.

268

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Table 6.1 Comparison of VTS schemes’ efficiency on CPU Execution time(s) SaberRD® FTS 247 218 367 303 608 557 − 653 − 1102

vCE (kV),iC (kA)

1.0

/

vCE

0.8

Proposed model SaberRD

0.6 0.4

iC

0.2

N-R 13.8 20.0 42.5 60.1 142.6

Speedup Sp1 18 18 14 – –

0.8

Proposed model

vCE

0.6

SaberRD

Event 28.1 44.6 65.1 183.6 555.2

LTE 29.3 40.3 50.4 132.9 305.3

1.0

0.2

iC

0 0

1.0

2.0

(a)

3.0

4.0

t/μs

0.2

0 100

vD

-0.2

iD

-0.4 -0.6 -0.8 -1.0 0

/

(c)

(b)

3.0

4.0

t/μs

Proposed model SaberRD

80

Tvj1

40

SaberRD 2.0

2.0

Tvj2

60

Proposed model 1.0

1.0

Tvj (ºC )

0

Sp2 16 15 13 11 8

0.4

0

vD (kV),iD (kA)

/

vCE (kV),iC (kA)

Level 5-L 7-L 9-L 17-L 33-L

3.0

4.0

t/μs

20

0

20

40

(d)

60

80

t/ms

Fig. 6.7 IGBT/diode nonlinear behavioral model VTS control: (a) IGBT turn-on, (b) IGBT turnoff, (c) diode reverse recovery, and (d) junction temperatures

6.2.5.3 Nonlinear MTDC System Preview on GPU The VTS-NBM MMC model is also applied in system studies of the ±100 kV fourterminal DC grid given in Fig. 6.5a, and the parameters are listed in the Appendix. In Fig. 6.8, results of long-term pole-pole fault with a resistance of 1 occurring at t = 5 s in HVDC Link1 are given. To show the response of a typical point-topoint HVDC system as well as the importance of device-level switch models in determining the fault current, the DC line T L3 is disconnected, and the AC sides of both converters are supported by a stiff grid with a line-line voltage of 135 kV. Immediately after detecting the fault, all IGBTs are blocked. However, with HBSM topology, as given in Fig. 6.8a, the DC system is still interactive with the AC grid,

6.2 Variable Time-Stepping Simulation Table 6.2 TSSM-based HVDC system speedups with 1 s duration

269 MMC CPU HBSM texe /s Level FTS VTS Sp3 51-L 23.2 0.91 25 101-L 41.6 1.53 27 201-L 77.1 3.67 21 401-L 149.4 5.33 28

CPU FBSM texe /s FTS VTS Sp4 80.3 2.80 29 155.3 5.50 28 310.2 10.63 29 628.7 21.03 29

because the freewheeling diodes are operating as a rectifier. Thus, a residual lineline voltage of 25 kV is observed and also a DC current of 9.4 kA. The fact that the residual current is dependent on the resistance of the switch leads to various values in PSCAD™ /EMTDC™ , while with NBM, the current is definitive. As can be seen, since a fixed on-state resistance of 0.4 m yields closer results to that of the nonlinear IGBT/diode pair, it is set as the default in the system-level simulation tool. In Fig. 6.8b, the blocking capability of FBSM-MMC enables the DC lineline voltages and currents to eventually remain at 0. PSCAD™ /EMTDC™ shows similar results, where a distinct fixed switch resistance leads to slightly different results. The IGBT junction temperatures during the protection process are given in Fig. 6.8c. With HBSM, the upper IGBT junction temperature Tvj 1 drops after it is blocked; however, the lower IGBT witnessed a dramatic rise in the junction temperature Tvj 2 to over 140 ◦ C since its freewheeling diode is conducting a large current, meaning other protection measures are required. On the other hand, the IGBT junction temperatures in FBSM are low since the submodule is able to block the fault completely. As a comparison, DC fault in the HBSM-MMC-based HVDC Link1 with the rectifier supported by OW F1 was also tested, and the results are shown in Fig. 6.9. Before the fault, the AC-side voltage of MMC1 has a peak phase voltage of around ∗ , as given in Fig. 6.9a. Then, it reduces 90 kV, which fits with the control target Vgd to around 0 after the MMC and converters within the DFIG block following the DC fault at t = 5 s. On the DC side, the rectifier has a slight margin over its counterpart in the voltage under normal operation. However, once both MMCs are blocked after the fault, their relationship is reversed: the voltage on the DC side of MMC3 turns out to be higher since it is operating as a rectifier, while MMC1 loses a stiff grid voltage support on its AC side under this scenario, as shown in Fig. 6.9b. Figure 6.9c shows the DC currents. Idc3 rises to around 10 kA after fault, while Idc1 decreases to 0 due to the absence of a stiff AC grid. These results are validated by PSCAD™ /EMTDC™ simulation as it produces identical trends in the waveforms. In Fig. 6.10, the impact of wind speed on the MTDC system is shown. Starting at t = 12 s, the wind speed at OW F1 rises linearly from 8 m/s to 11 m/s in 1 s, while the reverse is true for OW F2 . It is observed that the voltage at Grid 1 maintains stability due to the proper functioning of MMC1 , and so does the voltage at OW F2 , both of which are close to sinusoidal waveforms. Due to a stronger wind, the current Igrid1 fed by OW F1 more than doubled, while its counterpart has the opposite trend. As for a single wind turbine, the power of a DFIG at OW F1 increases from

270

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Vdc1

Vdc (kV)

200 100

Vdc1 Vdc1

Vdc3

Vdc1

Vdc3

Vdc3

0

Vdc3

Idc (kA)

-100 8 6 4 2 0

rsw_on=0.4mΩ Idc1

Idc1 Idc3

4.95

5.00 Vdc1

Vdc (kV)

200 100

5.05

Vdc3

0

Idc (kA) Tvj (ºC)

120 100 80 60 40 4.6

Vdc1

Idc1

Idc3

Idc3 5.05

31ºC 26ºC Tvj1

4.8

Vdc3

Idc1

5.00

5.0

5.2

5.10 t/s 4.95 (b) 45 40 Tvj2 35 30 Tvj1 25 20 5.4 t/s (c)

5.10 t/s Vdc1

Vdc3

Vdc3

-100 1.5 1.0 0.5 0 -0.5 -1.0 -1.5 4.95

Idc3 5.00 5.05

5.10 t/s 4.95 (a) Vdc1

rsw_on=40mΩ

5.00

5.05

5.10 t/s

Tvj2 Tvj3

4.8

Tvj4 5.0

Tvj1 5.2

5.4 t/s

Fig. 6.8 HVDC Link1 supported by stiff AC grid pole-pole fault ((a, b) left, GPU implementation; right, PSCAD™ /EMTDC™ ): (a) HBSM-MMC response, (b) FBSM-MMC response, and (c) IGBT junction temperatures of HBSM (left) and FBSM (right)

6.2 Variable Time-Stepping Simulation

Vgrid (kV)

100

271

90kV

90kV

Vgrid1

50

Vgrid1

0 -50

-100 4.5

5.00

Vdc1

Vdc (kV)

200 100

t/s

5.5

Vdc1

Vdc3

0

Vdc3

-100 4.95

5.00

Idc (kA)

10

5.05

5.10

4.5

5.00

(a)

Vdc1 Vdc1

Vdc3

Vdc3

Vdc3

Vdc1 t/s

4.95

(b)

5.00

5.05

5.10

Vdc3 Vdc1 t/s

Idc3

Idc3

8

t/s

5.5

6 4 2

4.95

Idc1

Idc1

0 5.05

5.15

5.25

t/s

4.95

(c)

5.05

5.15

5.25

t/s

Fig. 6.9 OWF-sustained HVDC Link1 pole-pole fault response by HBSM-MMC (left, GPU implementation; right, PSCAD™ /EMTDC™ ): (a) Grid 1 AC voltage, (b) DC voltages, and (c) DC currents

approximately 750 kW to 2.0 MW, while those at OW F2 have the exact opposite output. The variations in wind speed also affect the power flow in the DC grid and the DC line voltage as well. The power delivered by MMC1 and MMC2 has the same trend to a single DFIG in the respective OWFs, other than the fact that values are 100 times larger, and the power received by the two inverters also exchanged position. Meanwhile, minor perturbations are witnessed in DC voltages, but due to the voltage regulation function of the inverter stations, they are recovered immediately. The DC line currents are also given, which show that on the line T L3 , the current Iexc is not 0 and its numerical relationship with Idc1 and Idc2 fits well with (6.17). Table 6.3 summarizes the execution times of different NBM-based MMCs with two time-stepping schemes by the processors, where the CPU execution times with FTS are estimated by running a shorter period using 10 ns as the time-step. Tested under a switching frequency of 200 Hz, the NR-based VTS scheme helps the CPU to achieve over 30 times speedup for both HBSM and FBSM MMCs. The GPU

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

vgrid1 (kV)

272

100 0

Igrid1 (kA)

1.5 1.0 0.5 0 -0.5 -1.0 -1.5

Igrid2 (kA)

1.5 1.0 0.5 0 -0.5 -1.0 -1.5

PDFIG (MW)

-100

2.5 2.0 1.5 1.0 0.5

PDFIG2

Pdc (MW) Vdc (kV) Idc (kA)

1.2 1.0 0.8 0.6 0.4 0.2 0 -0.2 -0.3

PDFIG1

Pdc1

200 100 0 -100 -200 203 202 201 200

PDFIG2

PDFIG1

Pdc1

Pdc2

Pdc2

Pdc3

Pdc3 Pdc4

Pdc4

Vdc2

Vdc2

Vdc1

Vdc3

Vdc3

Vdc4

Idc2

Idc1

10

Vdc4

Idc2

Iexc 8

Vdc1

Idc1

Iexc 12

14

16

t/s 8

10

12

14

16 t/s

Fig. 6.10 MTDC system dynamics with wind farms (left, GPU implementation; right, PSCAD™ /EMTDC™ )

6.3 Heterogeneous CPU-GPU Computing Table 6.3 Execution time (texe ) of a 4-T NBM-based DC system for 0.1 s duration

MMC Level 51-L 101-L 201-L 401-L MMC Level 51-L 101-L 201-L 401-L

273 CPU HBSM texe /s FTS VTS Sp5 11,539 349 33 24,442 821 30 50,484 1574 32 102,084 3177 32 CPU FBSM texe /s FTS VTS Sp8 25,471 571 45 51,421 1138 45 94,309 2201 43 186,410 4409 42

GPU HBSM texe /s FTS VTS Sp6 554 23.2 24 550 22.1 25 548 26.4 21 574 75.0 7.7 GPU FBSM texe /s FTS VTS Sp9 1181 79 15 1155 133 8.7 1355 154 8.8 1203 197 6.1

Speedup Sp7 497 1106 1912 1361 Speedup Sp10 322 387 612 946

gains 6 to 15 times speedup with FBSM-MMC and about 8 to 24 times for HBSMMMCs with various voltage levels. Therefore, the VTS scheme implemented on GPU is able to attain a dramatic speedup over the CPU with the FTS scheme, e.g., for the two types of MMCs, the speedup Sp7 and Sp10 could reach approximately 2000 and 1000 times, respectively. It is noticed that with an increase of the MMC voltage level, the speedup that the VTS method could gain decreases, especially in the case of HBSM-based MMC on GPU. The asynchronous switching behaviors of the IGBTs account for this phenomenon. The occurrence of the switching process of an IGBT in any submodule will bring down the overall time-step of the entire MMC since the ultimate time-step is determined by the minimum value; therefore, a higher level MMC results in a higher chance of switching process which forces the adoption of a small time-step and consequently leads to a lower speedup.

6.3

Heterogeneous CPU-GPU Computing

Arbitrariness of the AC-DC grid under EMT and electromechanical dynamic studies implies the maximum simulation efficiency could not always be achieved by a single type of processor, noticing that the GPU only shows an advantage in large and homogeneous electric power system computation, while CPU is better to handle systems with a small scale or dominant inhomogeneity. Therefore, taking the photovoltaic (PV) system as an example, the heterogeneous CPU-GPU cosimulation [214] is introduced in this section. Spanning over a wide region, the irradiation every panel in a PV plant receives could vary significantly, and it is quite common to encounter partial shading [215– 217]. Theoretically, the total output power of a PV station can be calculated precisely if each PV panel is taken individually. However, computing hundreds of thousands or even millions of PV models using the Newton-Raphson iteration method for solving the nonlinear transcendental i–v equation poses a remarkable challenge to the capacity of CPU [218]. Consequently, the GPU takes over tasks that otherwise would be a computational issue for the CPU.

274

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Fig. 6.11 PV unit model: (a) equivalent circuit and (b) EMT model

Sirr

Iph

Rs

iPV

iD vD D

Rp vPV

(a) 6.3.1

iPV GPV

JPVeq (b)

Detailed Photovoltaic System EMT Model

Lumping altogether solar panels is a general method used in system-level studies; however, it hinders a thorough investigation of the dynamic response of the system under study. Starting from a basic PV unit, the flexible model combining detailed arrays and the aggregated unit of a massive number of PV modules is first introduced.

6.3.1.1 Basic PV Unit Figure 6.11a shows the single-diode equivalent circuit of a basic PV unit whose photoelectric effect is represented by the irradiance-dependent current source [219] Iph =

Sirr ∗ ∗ ∗ · Iph (1 + αT · (TK − TK )), Sirr

(6.18)

which has variables with the superscript ∗ as references, Sirr the solar irradiance, αT the temperature coefficient, and TK the absolute temperature. In addition, the model is also comprised of the shunt and series resistors Rp and Rs , respectively, and the antiparallel diode D which has the following exponential i–v characteristics:  v (t)  D V iD (t) = Is · e T − 1 ,

(6.19)

where Is is the saturation current and VT denotes the thermal voltage. After discretization using partial derivatives for EMT computation, the nonlinear diode yields an equivalent conductance GD and current IDeq , as expressed by GD =

vD (t) Is ∂vD (t) = · e VT , ∂iD (t) VT

IDeq = iD (t) − GD · vD (t).

(6.20) (6.21)

Therefore, with all of its components represented by current sources and conductors or resistors, the PV unit can be converted into the most concise two-

6.3 Heterogeneous CPU-GPU Computing

275

node Norton equivalent circuit, as shown in Fig. 6.11b, where JP V eq =

GP V =

Iph − IDeq

,

(6.22)

.

(6.23)

GD Rs + Rs Rp−1 + 1 GD + Rp−1 GD Rs + Rs Rp−1 + 1

6.3.1.2 Scalable PV Array Model In large-scale PV plants, a substantial number of panels are arranged in an array in the centralized configuration in order to generate sufficient energy to an inverter that the whole array connects to. For an arbitrary PV array with Np parallel strings each of which containing Ns series panels, as shown in Fig. 6.12a, the equivalent circuit in Fig. 6.11a is still applicable in describing its i–v characteristics, expressed by  iP V (t) = Np Iph − Np Is · e

−1 R i vP V (t)+Ns Np s P V (t) Ns VT

  − Gp iP V (t)Rs + Np Ns−1 vP V (t) ,

−1 (6.24)

when the lumped model is adopted. The transcendental equation cannot be discretized directly in an identical manner as of (6.20) and (6.21). Therefore, using the same method, each component in the Thévenin or Norton equivalent circuit of the lumped PV model is calculated separately and then aggregated, leading to the equivalent conductance and current GP V ary and JP V ary with a similar form to (6.22) and (6.23). A major shortcoming of the lumped model is that the characteristics of PV panels subjected to the various conditions—most notably the solar irradiance— cannot be revealed in this case, for example, when the irradiance exhibits the normal distribution, as   1 f Sirr |μ, σ 2 = √ e 2π σ

 −

(Sirr −μ)2 2σ 2



,

(6.25)

where μ is the mean value of the distribution and σ denotes the standard deviation. Therefore, the performance of each PV panel needs to be considered to achieve the highest simulation fidelity, meaning a Np × Ns array corresponds to an admittance matrix virtually 4 times larger in its dimension undergoing the Newton-Raphson iteration that makes the simulation extremely slow. The Norton equivalent circuit lays the foundation for a simple solution, as based on (6.22) and (6.23), by first merging all Ns panels in every string and then the Np branches, the equivalent circuit of an array can be derived as

276

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Np2

GPVary Ns IPVb

vPV JPV (i)

GPV (i)

Np1

Np (a)

Np2

JPVary

(b)

Lumped Model

Np2

Fig. 6.12 An arbitrary array of PV panels: (a) PV array of Np × Ns panels and (b) the scalable EMT model

GP V ary

⎞−1 ⎛ Np Ns   ⎝ = G−1 (i, j )⎠ , PV

i=1

JP V ary

⎡ ⎛ ⎞−1 ⎤ Np Ns  Ns     ⎢ ⎝ ⎠ ⎥ JP V (i)G−1 = G−1 ⎣ ⎦. P V (i) · P V (i) i=1

j =1

(6.26)

j =1

(6.27)

j =1

where i denotes an arbitrary PV board along with j . Nevertheless, aggregating millions of PV panels in one or a few plants still constitutes a tremendous computational burden on the CPU, because (6.20)–(6.27) other than (6.24) need to be calculated repeatedly at every time instant. Noticing that it is not necessary to distinguish every panel from each other, as a large proportion of the panels normally operate under virtually the same condition, a scalable PV array model is derived to utilize the low computational burden achieved by the lumped model while in the meantime retaining a maximum possible individuality of the remaining panels. In Fig. 6.12b, a flexible number of Np1 × Ns panels in the Np × Ns array are modeled in detail, and consequently, the lumped model is applied to the rest Np2 = Np − Np1 strings. As an outcome of the hybrid modeling method, the overall EMT model of a PV array can be obtained by summation

6.3 Heterogeneous CPU-GPU Computing

GP V ary =

N GP p2 V ary

277

+

Np1  Ns   1

N

JP V ary = JP Vp2ary

−1

G−1 P V (i)

,

⎡ )N *−1 ⎤ Np1 Ns

s    ⎣ ⎦. JP V (i)G−1 (i) · + G−1 (i) PV

1

1

(6.28)

1

PV

(6.29)

1

The selection of Np1 is determined primarily by the requirement on simulation accuracy since it is obvious that a zero Np1 means the scalable model degenerates into the absolute lumped model, while more PV panels are depicted when Np1 approaches Np . Therefore, Np1 is defined as a variable in the program to leave sufficient room for adjustment so that a trade-off between simulation efficiency and the extent of information to be revealed can be made. The combination of lumped and discrete PV parts enables less computational work on the processors; in the meantime, since both models share virtually identical equations, it is not necessary to distinguish them when parallel processing is carried out.

6.3.2

PV-Integrated AC-DC Grid

Figure 6.13 shows the integrated AC-DC grid comprising of the following three parts: the AC grid based on the IEEE 39-bus system; the multi-terminal DC grid where stations MMC5 and MMC6 are inverters, while the other four are rectifiers; and the four PV farms with each having a capacity of 500 MW. The transient stability is of concern in the IEEE 39-bus system, and therefore, dynamic simulation is conducted; on the contrary, EMT simulation is required to reveal the exact behavior of the PV farms as well as the DC grid.

6.3.2.1 AC Grid As introduced in Chap. 2, transient stability analysis of the AC grid is conducted based on a set of differential-algebraic equations given as (2.1) and (2.2), with an initialization set in (2.3) to enable the dynamic simulation to start properly. The differential equation describes the dynamics of the synchronous generators using the ninth-order model. It should be discretized prior to the solution, as it takes the form of x(t) = x(t − t) +

t (f(x, V, t) + f(x, V, t − t)), 2

(6.30)

when the second-order trapezoidal rule is applied, where the vector x contains the nine generator states given in (2.26). As explained, the first two variables, the rotor angle and angular speed derivative, are used in the motion equation, the flux ψ is used to describe the rotor electrical circuit, and the voltages v1,2,3 appear in the excitation system.

278

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Gen8 37

Gen1 30

25 18

1

38 28

26 2

Gen9

Dynamic Simulation Part IEEE 39-bus system

29

27 24

17

3

Gen6

16

35

15 4

39

14

5 12

6

Gen10

7 8

19

21 F

22

23

11 13

31

9

20

33

10

Bus

P+jQ HVDC U θ

Gen2 AC-DC Interface

34

32 Gen3

Gen4

36

Gen7

Gen5

EMT simulation part: HVDC Grid TL1

IEEE 39-bus System

Bus20

TL2

MMC 5

MMC 1

PV Plant 1

MMC 2

PV Plant 2

MMC 3

PV Plant 3

TL3 Bus39

MMC 6

TL4 TL5

PV Plant 4

MMC 4

Fig. 6.13 IEEE 39-bus system integrated with DC grid connected to PV plants

Following the solution of the differential equation, the algebraic equation representing the network can also be solved in conjunction with the generator’s stator equations

6.3 Heterogeneous CPU-GPU Computing



Im Ir





Ymm Ymr = Yrm Yrr

279



 Vm , Vr

(6.31)

where m is the number of synchronous generator nodes and r is the number of the remaining nodes in the network.

6.3.2.2 Multi-Terminal DC System The DC grid terminals employ the modular multilevel converter which has two prevalent models, i.e., the averaged-value model and the detailed equivalent model. The former type is preferred in power flow analysis for its simplicity, while the latter one provides full details, especially in case of DC faults when the diode freewheeling effect cannot be revealed by its counterpart. It is time-consuming to simulate the MMC in its full detail due to the repetitive computation of a remarkable number of submodules included for safe operation as well as a large admittance matrix it presents. When a half-bridge submodule is under normal operation, its terminal voltage can be found as a function of upper switch gate signal Vg : t2 vSM,k (t) = iSM,k (t) · ron + Vg,k (t) t1

iSM,k (t) dt, CSM,k

(6.32)

where k represents an arbitrary submodule, ron is the on-state resistance of a power semiconductor switch, CSM is the DC capacitor voltage, and iSM,k equals to the arm current iarm where it locates. A methodology for avoiding a large admittance matrix is therefore available since all the submodules can be taken as a voltage source with a value of vSM,k , which can be summed conveniently; in the meantime, investigation of the submodule operation status becomes independent from overall MMC circuit solution. Specifically, in EMT simulation, a detailed MMC arm takes the form of varm (t) =

N 

i vSM,k (t) + iarm (t) · ZLu,d + 2vLu,d (t),

(6.33)

k=1 i is the incident where ZLu,d is the impedance of the arm inductor Lu,d and vLu,d pulse in transmission line modeling of an inductor [189]. It can be noticed that all submodules are excluded from the arm by v–i coupling to attain a lowdimension admittance matrix which consequently contributes to the improvement in computational efficiency.

6.3.2.3 PV Plant As shown in Fig. 6.14a, a large-scale PV plant normally has a capacity of dozens to hundreds of megawatts sustained by a considerable number of inverters. An inverter with a rated power of 1 MW is able to accommodate an array of up to 200 × 25

280

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

iPV

PV Inverter P1

vPV

Np×Ns PV panels (a) * vPV

PI

vPV iPV MPPT

vPV

* vPV

Id* Id Iq Iq*

HVDC Station

P2 P3 P4 Pn

Vg

PV Plant Bus

Vd Current Loop

PI

dq

ωL

PWM

Vg

abc

PI Vq

θ

(b) Fig. 6.14 Schematic of HVDC rectifier station AC side: (a) aggregation of PV inverters and (b) PV inverter controller

1STH-215-P PV panels manufactured by 1 SolTech INC® . Therefore, a group of PV plants with a total rating of thousands of megawatts is literally comprised of thousands of PV inverters which reach the criterion of massive parallelism. Figure 6.14b shows that each PV inverter regulates its own DC voltage exerted on the PV array using maximum power point tracking, which is in charge of calculating an optimized voltage that enables the maximum output power, and consequently it is deemed as the voltage reference in the controller based on d-q frame. Then, the inverter DC-side voltage is found as VP V ,DC = (GC + GP V ary )−1 (IP V ,DC + JP V ary + 2vCi (t)),

(6.34)

where GC and vCi are the conductance and incident pulse of a capacitor modeled by a lossless transmission line and VP V ,DC and IP V ,DC are PV inverter DC-side voltage and current, respectively. Then, all PV inverter powers are summed and taken as the input of a rectifier station in the DC grid. Following the solution of PV inverter DC side, the internal node of the diode in the PV module should be updated, where its voltage vD at the simulation instant t takes the form of −1  vD (t) = −IP V ,b · GD + Rp−1 + VP V ,b ,

(6.35)

6.3 Heterogeneous CPU-GPU Computing

IEEE 39-bus System

281

Vbus θ PDC+jQDC vbus(t) PDC+jQDC

ΣIac

MTDC Grid

Vac

MMC Rectifier

MMC Inverter

EMT-Dynamic Interface: S-V Coupling ΣIac

Iac499

Iac1 Vac /Nc

Vac /Nc #499 PV Inverter

#1 PV Inverter

Iac500

Iac2 Vac /Nc

Vac /Nc

#2 PV Inverter

#500 PV Inverter

Vac

MTDC-PV Interface: V-I Coupling

Fig. 6.15 Interfaces for integrated AC-DC grid EMT-dynamic co-simulation

with the branch voltage and current expressed as VP V ,b =

(Iph − IDeq ) GD + Rp−1

,

IP V ,b = JP V ,b − VP V ,DC · GP V ,b .

(6.36) (6.37)

It should be noted that the branch variables with subscript b could be either the lumped or the discrete part in the scalable PV array model.

6.3.2.4 EMT-Dynamic Co-simulation Interfaces It would be impractical to take the AC system undergoing transient stability analysis and its EMT-simulation DC counterpart as one computing objective since the two types are not instantly compatible. Thus, a power-voltage-based interface is introduced to enable the availability of two transient simulations of one integral system. As illustrated in Fig. 6.15, since the power flow is the principal variable in dynamic simulation, the external DC grid can be taken as a load to the AC bus it connects to and converted into the conductance by YDC =

(PDC + j · QDC ) , 2 VBus

(6.38)

where PDC and QDC are the MMC-based inverter AC-side active and reactive powers obtained in EMT simulation, respectively, and Vbus denotes the bus voltage amplitude. Consequently, by taking the DC grid as conductance YDC , the AC grid constitutes an independent subsystem that undergoes solely the transient stability analysis.

282

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

On the other hand, the solution of the algebraic equation in the dynamic simulation yields the AC bus voltage Vbus in conjunction with its phase angle θ as a complex variable under d-q frame, which is, in fact, the input of the inverter station in EMT simulation. As a result, the two types of simulations conducting separately become interactive, and the co-simulation is realized by exchanging the complex power-voltage signals on both sides of the AC-DC grid. Although both the MTDC grid and the PV plant run the same type of simulation, linking the MMC with hundreds of PV inverters leads to a huge electrical system. A pair of coupled voltage-current sources is inserted between them to avoid a heavy computational burden. Since the MMC controls the instantaneous AC voltage Vac , the PV inverters are connected to the voltage source Vac /Nc where Nc is a coefficient reflecting the winding turn ratio of the transformer between the PV inverter and the MMC converter transformer. Then, each PV inverter constitutes an independent circuit that can be solved without the participation of its counterparts as well as the MTDC grid. The instantaneous current obtained from the solution can then be added together and sent to the MMC AC-side current source to enable the DC grid solution, which in turn prepares the AC voltage for the next time-step.

6.3.3

Heterogeneous Computing Framework

6.3.3.1 CPU-GPU Program Architecture Extensive parallelism exists in both the AC and DC grids, especially the four PV plants in the test system in Fig. 6.13. For example, the AC grid has 10 synchronous generators and 39 buses; the DC grid, depending on the model, may contain thousands of submodules, and it connects to millions of PV panels. While the handling of PV plants accounts for the major computational burden of simulating the integrated power system, the remaining equipment also has a significant contribution. The number of a circuit component type determines where and subsequently how it will be processed. The four PV stations and the detailed MMC model are predominantly homogeneous, and therefore, both of them are allocated to the GPU for parallel processing under the SIMT paradigm which ensures a particularly efficient implementation. On the contrary, the numbers of buses and synchronous generators in the AC grid fall short of massive parallelism, and so is the DC grid if the AVM is adopted, making CPU the better option. Therefore, heterogeneous computing of the hybrid AC-DC grid is based on the CPU and GPU architecture termed as host and device. On the device, taking apart the two types of PV models in the scalable array will jeopardize the parallelism as both may have a large quantity. The capability of the SIMT paradigm being tolerant to slight differences between the lumped model and the discrete PV model enables their computation by one GPU global function or, terminologically, the same kernel using the programming language CUDA C++ [208] or CUDA C [183]. Determined by the number of threads, various kernels are designed to describe the PV system and the MMC. For example, in Fig. 6.16, processing all PV panels involves six stages. The first kernel initializes the model parameters, followed by the

6.3 Heterogeneous CPU-GPU Computing

HVDC Kernels

283

PSC-BC Kernel

MMC PSC Ave Ctrl. vabc VC

BC

BC

BC

SM

SM

SM

BC

BC

BC

SM

SM

SM

BC

BC

SM

SM

Vdq, Idq

MMC d-q Frame Ctrl.

SM Kernel

VC

Vg

Icp

Vcp MMC Linear Circuit

vabc P/Q

PV Kernel 1

vD

PV Kernel 3

vac

Host-Device Memcpy vn CPU (host)

VT, Is, Iph

Rp/s, vpv

PV Kernel 2

PV Kernel 4

Gpv, Jpv

PV Kernel 6

ΣGPV, ΣJPV

VPVDC, P

GPVary, JPVary

Global Memory

G Matrix

PV Kernel 5

vD

P

G Network Solution ψ, in Discretize ODE

J, F ODE Solution xn

Fig. 6.16 CPU-GPU co-simulation program architecture

second kernel which deals with model discretization. Since both processes apply to every PV panel, either in lumped or discrete form, both kernels need to launch a total number of NP V × (Np1 · Ns + 1) threads, where NP V denotes the total number of PV inverters. The third kernel is specifically designed for the discrete PV panels which need to be aggregated, and therefore, the total thread number is NP V × Np1 . Dealing with calculations of all PV inverters in the following two kernels means that the thread number shrinks to NP V temporarily before it restores in the last kernel which updates the circuit information for the next time-step. As can be seen, data exchange between various kernels occurs frequently, and therefore, all those variables are stored in the global memory to enable convenient access. Though the above process demonstrates that numerous threads are invoked by the same kernel, the content of each thread, including the PV parameters, could be different, and this individuality is achieved by proper identification of the thread. Consequently, the compatibility of various PV module types and their parameters

284

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

extends the parallelism to the maximum possible level. In this context, the selection of either one or a few PV module types does not affect the computing speed since the mere difference between these two options is the parameter initialization, which occurs only once, and the burden it imposes on the processors is negligible compared with that of the bulk program of integrated AC-DC grid. Nevertheless, to reflect the geographical and atmospheric impacts on the PV farms and the power system, different PV parameters, including irradiance and temperature, are considered at the initial stage of the co-simulation. Similarly, the detailed MMC model is computed in five steps. The MMC main circuit connecting to the AC grid receives the bus voltage from the dynamic simulation on CPU, and following its solution, the AC-side current can be derived and sent to the controller based on the d-q frame. The phase-shift control strategy is adopted for the MMC internal submodule voltage regulation. The averaging (Ave) control as its first part seeks the desired mean of DC capacitor voltages in a phase, while the second part, the balancing control (BC) in charge of balancing all submodule voltages, has a corresponding quantity, and therefore, the kernel invokes a massive number of threads. The AC grid containing four major parts, on the other hand, is in a strictly sequential manner on a single CPU core. The solution of the network equations along with the differential equation yields all bus voltages, among which those connected to the HVDC stations are sent to the GPU using CUDA memory copy. Similarly, the power is returned to the host for calculating the admittance matrix of the AC network.

6.3.3.2 Co-simulation Implementation Sharing of the computational burden by CPU in dealing with part of the system distinguishes the introduced heterogeneous computing from previous pure CPU or GPU execution architectures. Nevertheless, the initialization process still needs to take place in the host, when all variables on both processors are defined. Once the co-simulation starts, all CPU functions and GPU kernels are implemented in a largely sequential manner, e.g., the kernels of PV plants are invoked one after another first, followed by those of the MMC, and later the CPU functions on the host to complete an intact cycle before returning to the PV kernels on the GPU again, as given in Fig. 6.17. The two processors, as can be noticed, are not interactive unless the information is exchanged using the CUDA C command cudaMemcpy. However, the two programs have a common timeline since the co-simulation is generally on the hostdevice framework and the GPU kernels are able to access the time instant defined on the CPU as an incremental value without memory copy. Meanwhile, the fact that the AC grid dynamic simulation is able to tolerate a much larger time-step than the EMT simulation enables the adoption of multi-rate implementation, i.e., the program on CPU runs at a time-step of 10 ms, 200 times larger than that of EMT simulation. Therefore, data exchange between the two processors only takes place when the CPU program starts to implement. After the simulation reaches the end, those concerned variables are gathered for system analysis.

6.3 Heterogeneous CPU-GPU Computing

CPU - Host

S Start (S)

Kernel2

E

PV

S

F(3)

F(n)

AC Grid

Kernel3

CPU

CUDA Memcpy CPa

F(2)

GPU - Device

GPU

Initialize variables

285

CUDA Memcpy CP

Function F(1)

End (E) CUDA Memcpy CPb

Yes E Yes

No

Kernel1

Kerneln

Kernel2

Kernel1 MMC

Kernel3

Kerneln

No t=tmax?

t%200=0?

CPU Process

t=t+Δt GPU Process

Fig. 6.17 Heterogeneous CPU-GPU computation implementation

It can be seen that a generic heterogeneous computing framework is designed for the AC-DC grid EMT-dynamic co-simulation. The processing algorithm, along with the program architecture, is not reliant on any specific type of GPU, let alone CPU, since all the functions involved in the design are fundamental to a variety of CPU-GPU platforms.

6.3.4

EMT-Dynamic Co-simulation Results

The CPU-GPU co-simulation of the AC-DC grid involving detailed massive PV panels is conducted on a 64-bit operating system with 80 Intel® Xeon® CPU E52698 v4 processors, 192 GB memory, and the NVIDIA® Tesla® V100 GPU. A few commercial simulation tools, including MATLAB/Simulink® and DSAToolsTM for EMT and dynamic simulation, respectively, are resorted. An experimentally verified PV model in the former tool [220, 221] is adopted for comparison with the scalable model, while the latter tool has been extensively used and widely relied on for power system planning and design.

286

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Simulink

Tc=25

8 7

IPV (A)

PPV (W)

Tc=40

6 5 4

Tc=55

3 2

Simulink

1

Scalable model 0

5

10

15

25

30

35

12

Tc=55 5

10

15

20

25

30

35

Simulink 300

Scalable model

PPV (W)

250

6

S=500

4

Tc=40

100

0

S=1000

8

IPV (A)

VPV (V) (a)

S=1500

10

150

50

20

Tc=25

Scalable model

200

S=1500

200

S=1000

150 100

Simulink

2 0

5

10

15

20

S=500

50

Scalable model 25

30

35

VPV (V) (b)

0

5

10

15

20

25

30

35

Fig. 6.18 Basic PV module i–v and P –v characteristics with (a) different cell temperatures and S = 1000 W/m2 and (b) various irradiance and Tc = 25 ◦ C

6.3.4.1 PV Array The accuracy of the scalable PV model is tested using two configurations, i.e., 1 × 1 and 2 × 25, considering that the computational capability of the off-line EMT-type solver will be soon overwhelmed if the array keeps expanding. Figure 6.18a shows that when a single PV module is under a constant irradiance, a lower temperature leads to a larger current at the maximum power point and consequently the power; the intuitive i–v relationships under different irradiance are quantified in Fig. 6.18b, which also demonstrates that the model has an exact performance to that of the Simulink® model. In the 2×25 array, one string of 25 series boards is represented by the lumped PV model with a constant irradiance of 1000 W/m2 and a temperature of 25 ◦ C, while the other string is formed by cascaded discrete modules to reveal the impact of the environmental conditions. Figure 6.19a shows the i–v characteristics of the array when the discrete 25 panels have a temperature distribution from 25 ◦ C to 49 ◦ C with a linear incremental of 1 ◦ C. It indicates that the output current, and consequently the

6.3 Heterogeneous CPU-GPU Computing

287

Tc=25

16

PPV (kW)

IPV (A)

12

Tc(n)

10 8

Tc=49

6

Simulink

4

0 100 200 300 400 500 600 700 800 900

VPV (V) (a)

16 14

S=1000

S(n)

8 6

Simulink

PPV (kW)

IPV (A)

8 6

Simulink

2

Scalable model

VPV (V) (b)

12

Iact

Imax

IPV (A)

10

6 4

Simulink

2

Scalable model

6

S=1000

S(n) S=520

4

0 100 200 300 400 500 600 700 800 900

10.8kW

Simulink

Imin

8

8

10

PPV (kW)

14

Scalable model

2

0 100 200 300 400 500 600 700 800 900 16

Tc=49

0 100 200 300 400 500 600 700 800 900

10

S=520

4

Tc(n)

4

12 10

Scalable model

2

Scalable model

2

Tc=25

Simulink 10

14

Scalable model

Pact

8 6 4

Pmax

8.4kW

7.8kW

Pmin

2

0 100 200 300 400 500 600 700 800 900

VPV (V) (c)

0 100 200 300 400 500 600 700 800 900

Fig. 6.19 Performance of two PV branches under various (a) cell temperatures and S = 1000 W/m2 , (b) irradiance and Tc = 25 ◦ C, and (c) temperatures and irradiance

power, of the overall two-string array neither equals the upper or lower limits when all boards have a unified temperature of 25 or 49 ◦ C. In Fig. 6.19b, the 25 discrete PV panels are subjected to 520 to 1000 W/m2 irradiance with an incremental of

288

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

102

104 2

(520~1000W/m , -10ºC) 101 2

(20W/m , -20ºC)

102

-1

10

(20W/m2, -10ºC) -2

10-3 0

400

600

(20W/m2, -10ºC)

101

Simulink Scalable model 200

(20W/m2, -20ºC)

PPV /(W)

IPV (A)

100

10

(520~1000W/m2, -10ºC)

103

800

1000

(20W/m2, 25ºC) 0.3

100

VPV (V) (a)

Simulink Scalable model 0

200

200

PPV /(W)

IPV (A)

(20W/m2, 49ºC)

200

300

400

500

(20W/m2, 25ºC)

80

(20W/m2, 49ºC)

60

Simulink Scalable model 100

1000

120 100

0

800

140

(20W/m2,25~49ºC)

0.1

600

Simulink Scalable model

180 160

0.2

400

40

(20W/m2,25~49ºC)

20 600

700

0

VPV (V) (b)

100

200

300

400

500

600

700

Fig. 6.20 Performance of two PV branches under extreme conditions: (a) low temperature operation and (b) low irradiance operation

20 W/m2 per board. It shows that the actual output power is close to the lower limit when all those in the second branch receive 520 W/m2 ; nevertheless, the latter could by no means represent the former when both irradiance and temperature vary, as proven in Fig. 6.19c, which exhibits a minimum gap of 0.6 kW, and the difference will be amplified significantly once hundreds of thousands of branches are taken into consideration in AC grid transient stability analysis. The scalable PV model performance under extreme conditions was also tested for accuracy validation. In Fig. 6.20a, the same 2 × 25 PV array is subjected to two low ambient temperatures, i.e., −10 ◦ C and −20 ◦ C. The maximum output power remains below 300 W when the irradiance is 20 W/m2 . A dramatic increase of two orders of magnitude was witnessed when one of the branches receives solar irradiance ranging from 520 to 1000 W/m2 . Figure 6.20b shows that the adopted model still has the same behavior as the off-line simulation tool under low irradiance and normal temperature, and once again, the low output powers prove that temperature variation has a less significant impact on the total output of a PV array than the irradiance. The exact match to the off-line simulation tool’s outcomes

6.3 Heterogeneous CPU-GPU Computing

289

Table 6.4 Simulation speed comparison between GPU and CPU Np1 :Np2 1:199 10:190 50:150 100:100 150:50 199:1

Module no. 4.48 × 105 8.8 × 105 2.8 × 106 5.2 × 106 7.6 × 106 9.952 × 106

GPU 18.7 s 47.3 s 223 s 482 s 946 s 1139 s

CPU 224 s 2320 s 11,502 s 22,642 s 31,486 s 39,292 s

MC-CPU 34.7 s 373 s 2231 s 4226 s 10,386 s 13,800 s

Sp1 12 49 52 47 33 34

Sp2 1.9 7.9 10 8.8 11 12

in all the above tests indicates that the scalable PV model has identical performance to a real module, meaning it can be used for simulating a much larger system, while commercial off-line simulation tools fall short of doing so due to an extraordinary computational burden. The performance of the aforementioned processors in carrying out a 10 s simulation of 2000 MW PV plants with a massive number of panels is summarized in Table 6.4. Even when the ratio of discrete and lumped PV branches is 1:199, i.e., there are 448,000 equivalent circuits, the GPU is still able to gain a speedup of 12 (denoted as Sp1) and 1.9 (denoted as Sp2), respectively, over the default single CPU and multi-core CPU, let alone when the number of equivalent circuits reaches approximately ten million when the ratio is 199. With a medium level of detail, the GPU is able to achieve a speedup over the prevalent CPU simulation of around 50 times. Even when the multi-core CPU approach yet to be exploited for circuit computing is adopted using OpenMP® , the Tesla® V100 GPU is still about 10 times more efficient. A decent speedup of up to 5 times is maintained over the 80-core CPU even if a common platform with 12 Intel® Xeon® CPU E5-2620 processors, 16 GB memory, and the NVIDIA® GTX 1080 GPU takes over the cosimulation, when the computing duration for the 6 combinations of Np1 and Np2 are 31.2 s, 152.3 s, 686.8 s, 1348 s, 2035 s, and 2675 s, respectively. It should be noted that although the GPU is always more efficient than CPU as well as its multi-core configuration regardless of the proportion that those discrete PV panels account for, the smallest possible Np1 should always be selected to avoid unnecessary numerical operations, especially when a large number of PV panels have virtually identical operation conditions, including irradiance and temperature.

6.3.4.2 AC-DC Grid Interaction The penetration of PV energy into the power system means that the AC grid transient stability is highly vulnerable to the momentary change of weather. If the solar irradiance of all ten million PV panels in four stations conforms to the normal distribution, with a mean value of μ = 1000 W/m2 and σ = 100, the actual output power of a station is approximately 430 MW, which has a vast difference to over 530 MW under a uniform average irradiance, meaning the scalable model enables the derivation of the exact output power of a PV station and, consequently, its impact on the AC-DC grid can be correctly studied. In contrast, the transient stability analysis results will be erroneous and misleading for power system planning and

290

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Pbus39

Δt1

500 0 -500

Pbus20

PMMC5

-1000 15

Δt1=3s Δt2=5s

Δt2 PMMC6

20

25

30

t (s) (a)

fG5

59.0 58.5

Co-simulation DSATools/TSAT

35

40

45

15

20

25

30

t (s) (b)

35

40

45

60 59.95

59.95

Unload at t=23s

59.90

fG10 (Hz)

fG5 (Hz)

59.5

58.0

60

Unload at t=25s

59.85 59.80 59.75

Co-simulation

59.70

20

25

30

t (s) (c)

35

40

Unload at t=23s

59.90 59.85

Unload at t=25s

59.80 59.75

Co-simulation

59.70

DSATools/TSAT

59.65 15

fG10

60

fG5, 10 (Hz)

P (MW)

1000

45

DSATools/TSAT

59.65 15

20

25

30

t (s) (d)

35

40

45

Fig. 6.21 The impact of solar irradiance on AC grid stability: (a) power flow on Buses 20 and 39 and frequency response of Generators 5 and 10 with Buses 20 and 39 (b) having no action, (c) unloading in 3 s, and (d) unloading in 5 s

operation if the lumped model is adopted, as the 100 MW disparity will be further amplified following the integration of more PV stations into the grid. When the mean μ reduces from 1000 W/m2 to 700 W/m2 and 500 W/m2 between 20 s and 25 s at Plants 3 and 4, their output power climbs down to 270 MW and 160 MW, respectively. As a result, the inverter station MMC5 witnesses a roughly 430 MW reduction, while the power at MMC6 only sees a momentary fluctuation before it restores in 10 s, as given in Fig. 6.21a. It turns out in Fig. 6.21b that the AC grid is unstable with the frequency decreasing below the minimum 59.5 Hz. Therefore, 255 MW and 195 MW of loads are removed from Buses 20 and 39 as one option to maintain the normal operation. Figure 6.21c, d proves that the frequency can recover to 60 Hz if the action is taken within 3 s, while it will be below 60 Hz if the unload occurs 5 s later; nevertheless, it is still within the operational range of the AC grid. When the output power of all PV stations is decreasing at a rate of 0.2 MW/s due to sunset, the frequencies of synchronous generators drop. Shedding the load in the

6.3 Heterogeneous CPU-GPU Computing

291

60.87

60.8

fG5,10 (Hz)

60.6

60.4

60.2

60.2 59.96

60

59.8

59.4

t (s) (a)

TSAT Co-simulation

60.3 60.2

60 59.53

150

200

100

Rotor Angle/°

81.5

50

250

300

59.6 59.5

t (s) (b)

50

TSAT

1250

Co-simulation

1200

Generator 5 33.4 17.4

Generator 10

0

Normal operation

59.7

PBus (MW)

59.6

100

150

200

250

PBus39

300

Case 1

1150

Case 2

1100

Case 1 PBus20

650 600

50

Case 2

550

-42.4

-50

Bus21 3-ph fault

59.8

Normal operation

59.7

1 00

TSAT Co-simulation

60.1 59.9

50

60.28

60.3

60

59.5

50 100 150 200 250 300 350 400 450

60.4

59.9 59.8

Co-simulation

60.2

Bus21 3-ph fault

60.1

TSAT

59.6

50 100 150 200 250 300 350 400 450

60.42

Normal operation

59.8

Co-simulation

60.4

fG5,10 (Hz)

60

Normal operation TSAT

59.6

Bus21 3-ph fault

60.6

60.4

59.4

60.72

60.8

Bus21 3-ph fault

100

150

200

250

300

t (s) (c)

50

100

150

200

250

300

Fig. 6.22 AC grid dynamics under the impact of PV output power and 150 ms fault: (a, b) Generator 5 (left) and 10 (right) frequencies and (c) generator angles (left) and Bus 20 and 39 load (right)

AC grid is effective in restoring the grid frequency within the ±0.5 Hz perturbation range. Figure 6.22a shows, in Case 1, both Buses 20 and 39 remove 70 MW and 81.3 MW of load at 100 s and 306 s, respectively. However, if a three-phase fault occurs on Bus 21 at t = 200 s, the frequency will exceed the upper limit of 60.5 Hz.

292

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

21.0mHz

15

0.81° 0.5

3ph fault

10

Δδ1-10/°

ΔfG1-10 (mHz)

20

5 0

-0.5

-5

-10

-1.0

-15 -20

-1.5

t (s) (a)

50 100 150 200 250 300 350 400 450

0.8

17.1mHz

0.67°

0.6 0.4

10

0.2

5

Δδ1-10/°

ΔfG1-10 (mHz)

-1.30°

50 100 150 200 250 300 350 400 450

15

0

-0.2

0

-0.4

-5

-0.6

3ph fault

-0.8

-15 -20

3ph fault

-15.6mHz

20

-10

0

-15.8mHz 50

100

150

200

250

300

3ph fault

-1.0

t (s) (b)

-1.2

-1.10° 50

100

150

200

250

300

Fig. 6.23 Absolute generator frequency (left) and rotor angle (right) errors between heterogeneous computing and TSAT simulation: (a) Case 1 and (b) Case 2

Therefore, the loads to be shed are around 57 MW and 50 MW at 100 s and 226 s to contain the negative impact, which is given in Fig. 6.22b as Case 2. Figure 6.22c gives the load condition on Buses 20 and 39 and the synchronous generators’ rotor angles of Case 2. The good agreement to DSATools/TSATTM indicates the accuracy of the heterogeneous co-simulation methodology. In Fig. 6.23, the absolute frequency and rotor angle errors of all ten synchronous generators are drawn to demonstrate the accuracy of the EMT-dynamic co-simulation using heterogeneous computing. It indicates that in both cases, the maximum error appears when the three-phase fault occurs. Nevertheless, the maximum frequency errors are merely 21.0 mHz and 17.1 mHz for the two cases, respectively, while that of the rotor angle does not exceed −1.3 ◦ and −1.1 ◦ . Therefore, the insignificant errors thoroughly demonstrate the validity of the introduced modeling and computing method, meaning the heterogeneous parallel CPU-GPU co-simulation platform can also be used for power system study, operation, and planning with a much faster speed since it resembles DSAToolsTM in terms of results.

6.4 Adaptive Sequential-Parallel Simulation

293

Table 6.5 AC-DC grid simulation speed comparison Np1 :Np2 1:199 10:190 50:150 100:100 150:50 199:1

PV No. 4.48 × 105 8.8 × 105 2.8 × 106 5.2 × 106 7.6 × 106 9.952 × 106

CPU-GPU 31.4 s 67.3 s 274 s 564 s 1054 s 1279 s

CPU 238 s 2340 s 11,575 s 22,759 s 31,614 s 39,492 s

MC-CPU 53.0 s 403 s 2360 s 4344 s 10,547 s 14,011 s

Sp1 7.6 35 42 40 30 31

Sp2 1.7 6.0 8.6 7.7 10 11

Table 6.5 summarizes the execution time that the platform requires to perform a 10 s simulation of the overall AC-DC grid where the CPU times are estimated based on a shorter duration. The inclusion of a less parallel AC-DC grid slightly reduces the speedups over pure CPU cases compared with the previous table, since the PV modules account for the major computational burden. Similarly, the co-simulation is still faster than that of the 80-core CPU when the NVIDIA® GTX 1080 GPU is utilized for solving the entire system, since tests show that even the last case, i.e., Np1 : Np2 = 199, can be completed within 2886 s, while the corresponding data provided in Table 6.5 is around 40,000 s and 14,000 s for single- and multi-core CPU, respectively.

6.4

Adaptive Sequential-Parallel Simulation

Section 6.3 has introduced heterogeneous high-performance computing of a PVintegrated AC-DC grid using CPU and GPU, both of which have a specific object. As, in the simulation, the electric power system scales vary, a fixed assignment of tasks would not always guarantee the maximum possible computational efficiency; more often than not, it could reduce that efficiency. The section introduces the concept of adaptive sequential-parallel processing (ASP2) which, with freedom of mode selection, is deemed as a universal computational framework that covers virtually all prevalent methods [222], i.e., sequential, parallel, and hybrid processing.

6.4.1

Wind Generation Model Reconfiguration

A typical DFIG system contains components with various numerical forms in terms of transient modeling. For instance, the induction machine is generally modeled by the state-space equation, and differential equations apply to the two transformers. Computation structure optimization by internal decoupling is carried out to enable different types of models to be compatible in the same simulation, in addition to reducing the processing burden induced by a large-dimension admittance matrix established based on electrical nodes of the overall system. The prevalent EMT model formulas of the induction machine, transformer, and power converter imply that the DFIG can be internally partitioned into four

294

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Is Tm

Induction machine

Vr Ir

Vs

Igc

Grid interface

VSCs Lgc

VR

T1 Rotor-side IR transformer

RSVSC

GSVSC

Vgc DFIG modeling boundary

Fig. 6.24 Doubly-fed induction machine computational boundary definition

subsystems which can be computed independently, as illustrated in Fig. 6.24. For an arbitrary subsystem, the impact of its neighboring counterparts can be deemed as equivalent current or voltage sources on the boundaries, whereas a specific selection of the type conforms to the general principle that the lumped high-order model has a higher priority over discrete low-order components, due to the fact that the latter has better flexibility in terms of Thévenin-Norton transformation.

6.4.1.1 Induction Generator Model The original fifth-order state-space equation of an induction machine takes the form of (5.41)–(5.45) in Chap. 5. Applying the α–β transformation, the vectors, uniformly represented by X, are organized as (5.43). Therefore, when partitioning the DFIG, the three-phase stator and rotor voltages Vs and Vr composed of the input vector u are retained within the boundary of induction machine, and consequently, it can be computed by its original form. Following the solution, the stator and rotor currents in vector I are restored into the three-phase domain and sent to neighboring transformers. Regarding the coefficients, while the input matrix B is a 4 × 4 identity matrix and the output matrix C is purely based on the magnetizing, stator, and rotor inductance, the state matrix A reflects electromechanical coupling since it contains the electrical angular velocity ωr in addition to those stator and rotor electrical parameters. The fifth differential equation describes the electromechanical interaction can be expressed as follows after introducing the friction factor F : dωr (t) P P = −F ωr + (Te (t) − Tm (t)). dt J J

(6.39)

where the electromagnetic torque is obtained in (5.46) and the mechanical torque Tm is a nonlinear function of ωr and the wind speed vw [223]:

6.4 Adaptive Sequential-Parallel Simulation Tm =

1 2 F (r , v , ω ), ρπ rT3 vw T w r 2

295

(6.40)

where ρ denotes the air density and rT the wind turbine radius. The model equations should be discretized before EMT simulation. At time t, using the trapezoidal rule, a general equation can be obtained for an n-order differential equation      At −1 At Bt (u(t) + u(t − t)) , X(t) = I − · I+ X(t − t) + 2 2 2 (6.41) where t denotes the simulation time-step. The above equation is applicable to (6.39), when A, u, and the identity matrix I reduce to the coefficients of ωr , (Te (t) − Tm (t)), and constant 1, respectively. Hence, the induction machine simulation commences with solving the statespace equation, followed by electrical torque Te alongside its mechanical counterpart, and the present time-step ends with the derivation of ωr , which is an essential element in the input matrix A.

6.4.1.2 Three-Phase Transformer Though the two transformers are separated in boundary definition attributing to indirect connection, they share an identical model, which is described by the relation between the terminal voltage vT and current iT in the differential equation (3.78). Following the inclusion of stator and rotor voltages into the induction machine subsystem, the corresponding currents Is and Ir are left for the two transformers. On the other winding, as the transformer has a higher priority over a voltage-source converter (VSC) regarding boundary source type selection, the converter-side threephase currents Igc and IR are taken, respectively, as inputs of the grid-interface transformer and rotor-side transformer when establishing the matrix equation. Discretization of the transformer model by applying the universal form (6.41) also leads to Eqs. (3.82)–(3.85). In the DFIG system, the current vector iT is known since it is formed by the boundary currents, and the nodal voltage vector vT can be solved after multiplying the reversed equivalent admittance matrix G−1 T with (iT –Ihis ). Then, the nodal voltages are sent to other subsystems, e.g., Vs and Vr to the induction machine and grouped as vector u after α–β transformation, for their solution in the next time-step. 6.4.1.3 DFIG Converter System The rotor-side VSC (RSVSC) and grid-side VSC (GSVSC) are formed by discrete low-order reactive components and switches and thus have the lowest priority in terms of boundary source determination. The converter system after DFIG internal partitioning is given in Fig. 6.25, where Vgc and VR are the three-phase boundary voltage sources on the grid side and rotor side. The power semiconductor switches T1 –T6 and S1 –S6 can be deemed as two-state resistors, i.e., small resistance and high impedance representing the ON and OFF state, respectively. When establishing

296

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

IR

T1 A

T2

T3

T5

B

T4

C

Cb

S3 2

3

T6

S6

VR From Ir rotor ωm θm

S5

4

T1-6 RSVSC Controller

S4

5

S1 1

Vs

GSVSC Controller

Igc D F E

S2 S1-6

Vb

Lgc

Vgc Igc Vs

From grid

Fig. 6.25 Decoupled DFIG converter system model

the admittance matrix of the converter system, the resistance is converted into admittance. Node reduction is performed to reduce the dimension of the admittance matrix. The fundamental EMT theory allows elimination of internal nodes D–F by merging Vgc and the three-phase inductor Lgc . Meanwhile, the voltages at nodes A–C are exactly VR solved in the transformer by (3.82) and can thus be excluded from the following nodal voltage equation: Vconv = Gconv −1 Jconv ,

(6.42)

where vectors Vconv and Jconv denote nodal voltages and equivalent current injections and Gconv is the admittance matrix of the converter system. The remaining five nodal voltages can then be solved efficiently. The two controllers involve in nodal voltage solution indirectly by determining the conductance of switches in the converter system. The RSVSC controller receives three rotor signals and the stator voltage for generating switch gate pulses, and the GSVSC controller decides the status of the corresponding switches based on the DC bus voltage it regulates, along with grid-side three-phase voltage and current [224].

6.4.2

Integrated AC-DC Grid Modeling

Figure 6.26 shows an AC-DC grid integrated with wind farms. The AC grid undergoing dynamic simulation is based on the IEEE 39-bus system, whose Buses 20 and 39 connect to DCS3 and DCS2 of a benchmark CIGRÉ B4 DC grid, which is analyzed by EMT simulation due to a significant impact the HVDC converter has on the entire grid. Different modeling approaches are adopted, and an EMT-DS interface is required to enable their compatibility and consequently form an electromagnetic-electromechanical transient co-simulation. The wind farm composed of an array of DFIGs is modeled in detail to reveal the exact impact of

30

Gen1

37

25

28

26

27

29

2

18

17

24

3

16

15

1

Gen6 35

4

14

21

5

22

9

6

12

19

F1

7

11

23

13

34

Gen4

8

31

39

ZIL Gen10

P39

P20

Grid

Gen3

ZIL

Gen5

Vi10 θi10

Gen2

32

Vr(20) θr20

10

20

33

36

Gen7

39

20

Grid

F3

Cm-B2

Grid

Cb-B2

Fig. 6.26 Comprehensive AC-DC grid integrated with wind farms

Gen8

Gen9 38

Cm-B3

Bm-B2

Cb-B1

Bb-B4

Pdc7

Cb-A1

Cm-A1

Pdc0

Pdc1

Pdc6

Pdc10

Pdc4

DCS1

Bm-B3

F2 Bm-F1

Pdc9

Pdc2

Pdc3

Bb-D1

Pdc8

Bb-C2

Pdc5

Bm-C1

Bm-E1

Cd-E1

DCS2

Bb-B1

DCS3

Bb-A1

Bm-A1

Cm-F1

Cm-E1

Cb-D1

Cb-C2

Cm-C1

HV

MV

OWF5

OWF4

OWF3

OWF2

OWF1

MV

22kV

LV

0.7kV

W×H

6.4 Adaptive Sequential-Parallel Simulation 297

298

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

wind speeds on its overall output, and it can also be scaled down or even lumped. Various contingencies such as F1–F3 can be conducted in the interactive grids for a comprehensive study. The AC grid electromechanical dynamic modeling has been illustrated in Chap. 2. In the comprehensive AC-DC grid, the IEEE 39-bus system provides bus voltages for the DC converters, and buses with the same number on the two sides are thus connected by a zero impedance line (ZIL), which acts as an interface between the DS and EMT simulation. Solution of the AC network equation (2.12) results in bus voltages in p.u. and their phases Vi,r  θi,r , which are then converted into timedomain sinusoidal functions Vi,r Vrt sin(ωt + θi,r ) by multiplying the voltage rating denoted as Vrt . In the meantime, transient computation of the DC grid yields the injected power from converters under the exact voltages, so its AC counterpart can proceed with network solution in the next time-step. The DC grid modeling, as the focus is on system transient response, does not require high-order nonlinear switch models as narrated in Chap. 5 and consequently carried out at the system level below.

6.4.2.1 Detailed EMT Modeling Figure 6.27 shows the full-scale of an HBSM-based MMC operating as a DC grid terminal, where hundreds of submodules are usually deployed to sustain high voltages. A remarkable fidelity can be achieved when an exact number of SMs in the converter are taken into account, while modeling of the IGBT module, especially its antiparallel freewheeling diode, further affects the computation accuracy of a detailed MMC. As mentioned, for system-level transient study, the IGBT can generally be modeled as a combination of a switch whose ON/OFF state is controlled by the gate signal g, the forward voltage drop represented by Vf , and the on-state resistance ron . The freewheeling diode consists of the p-n junction voltage Vj , on-state resistance, and an ideal diode D0 representing its unidirectional conduction. When S1 is turned on and its complementary switch S2 is under OFF state, determined by the direction of arm current, the capacitor is either charging or discharging; otherwise, the SM is bypassed by S2 . Therefore, when the MMC is under normal operation, the SMs’ status can be summarized as    iSM dt + (g1 − sgn(iSM ))2 Vf vSM = iSM ron + g1 C + (g1 + sgn(iSM ) − 1)Vj ,

(6.43)

where g1 is a binary denoting the ON/OFF state of switch S1 by 1 and 0 and the sign function sgn(·) yields binary 1 and 0 for positive and negative values, respectively. Deemed as a time-varying voltage source, the derivation of vSM decouples all SMs from the remaining circuit and therefore dramatically reduces the computational burden of a detailed MMC, which, from a mathematical point of view, contains a huge number of electrical nodes contributed by cascaded submodules.

6.4 Adaptive Sequential-Parallel Simulation

4

SM1

SM1 IA1

SMN Vdc

Ld 1 SM1

Ld 2 SM1

Lu C

VC

SMN

S2

VSM

ron

Vj

g1 IC2

IB2 SMN

S1

Ld 3 SM1

ISM

VC

SMN

Lu B

S1 C

IC1

IB1

IA2 5 SMN

SM1

SMN

Lu A

299

C

ron S2

IGBT

D0 ISM Vf g2

Vj

ron

ron

Vf

D0

Fig. 6.27 Three-phase MMC topology and detailed submodule circuit model

Each of the six arms is then turned into a Thévenin equivalent circuit, and discretizing the arm inductor yields the following formula: varm (t) = (ZLu/d + RLu/d )iarm (t) +

N 

vSM (t − ) + 2vLi u/d (t) ,

(6.44)

1

where ZLu,d and vLi u,d denote the impedance and incident pulse, respectively, after taking an inductor as a section of lossless transmission line [189] and RLu,d is the parasitic resistance of the arm inductor. By converting the arms into their Norton equivalent circuits, only five essential nodes are left on the AC and DC ports, as shown in the figure, and consequently, the MMC circuit yields a 5×5 admittance matrix. Then, the solution of a corresponding matrix equation gives the arm currents, which are also the submodule terminal currents iSM , and the computation repeats by calculating the submodule voltage in the next time-step. It can be inferred from Fig. 6.26 that two types of converter control schemes, i.e., grid-connected (GC) and grid-forming (GF), exist in the DC grid. As mentioned, the solution of (2.12) in the AC grid dynamic simulation provides the grid-connected MMCs with voltage Vi,r and phase angle reference θi,r . A phase-locked loop (PLL) is included to track the grid phase for the controller, which regulates DC voltage or active power on the d-axis and AC bus voltage or reactive power on the qaxis. A grid-forming inverter, on the other hand, generates voltage and phase angle

300

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

P/Vdc/Vac

Vdc*

Vdc

PI

P

Vgd*

P* Vgq*

Vgq

Vgd

PI dq

igd igq

ωL

*

Q/Vac

VAC

PI

iq*

ma mb mc

abc θ

PI VAC

Vgd

id*

Vgq

GC/GF

θi/r θ0

Fig. 6.28 General d-q frame GC/GF MMC controller

references for the wind farms to support the operation of induction machines. The d- and q-axis control AC voltage magnitude and reactive power, respectively, and the control target can be on both sides of the transformer, i.e., the high-voltage (HV) and medium-voltage (MV) buses. Though the specific control quantities vary, they share an identical scheme, as shown in Fig. 6.28. Thus, a GF-MMC provides a stable voltage for an array of W × H DFIGs: vGF (t) =

+

 2 Vgd

2 + Vgq

· sin

 2 ω0 dt + k π , k = 0, 1, 2 3

(6.45)

where Vgd and Vgq are the three-phase voltages in d-q frame and ω0 is the desired angular frequency. With the low voltage (LV) boosted by grid-interface transformers, all DFIGs are connected to a common MV bus. The availability of vGF thus enables the solution of induction machines’ state-space equations, and following the conversion of the output vector into the three-phase domain, the currents are aggregated on the MV bus and fed into the GF-MMC for its computation. In EMT simulation, the transmission lines or cables linking a WF to the MMC provide a natural partitioning scheme following discretization, and therefore, all DFIGs are physically independent and can be computed in parallel. As the distance between each turbine is normally a few hundred meters, the transmission delay is considered identical, or in other words, the transmission from all WTs to the MV/HV transformer and vice versa is completed within the same time-step. At an arbitrary grid interface, the transmission line, which adopts the traveling wave model, is involved in the formation of admittance matrix and history current  VT (t) = GT +

)

1 ZC I3×3 03×3

03×3

03×3

* −1  T 

Ihis (t − t) + Im (t − ), −Is (t) − Igc (t) , (6.46)

6.4 Adaptive Sequential-Parallel Simulation

301

where ZC is the characteristic impedance of the line and Im represents the threephase history current of the transmission line at the LV/MV transformer side. Similarly, the line elements also participate in the formation of an 8 × 8 MMC matrix equation along with the HV/MV transformer.

6.4.3

Adaptive Sequential-Parallel Processing

6.4.3.1 Heterogeneous CPU-GPU Processing Boundary Definition The scale of EMT simulation varies from case to case, and generally, the computational burden may also differ significantly even for the same system due to the flexibility of choosing a component with distinct model complexities. The subsequent uncertainty in computational burden is handled by adaptive sequentialparallel processing which automatically shifts the transient analysis between CPU and GPU, termed as host and device, respectively. The fact that both processors are individually capable of processing any number of components indicates the ASP2 can split into heterogeneous sequential-parallel computing, pure CPU execution, and GPU implementation, and the quantity of a certain type of subsystem— a component or a group of components that could be written as a function by programming languages—is a major criterion for the mode selection. In the comprehensive AC-DC grid, the MMC submodules and DFIGs are potential sources of massively parallel computing. Grid connection and pursuit of high power quality propel full-scale MMC modeling with a sufficient number of submodules, alongside the demand for very detailed converter internal phenomena study. Based on its element quantities, the MMC can be internally divided into two parts, as illustrated in Fig. 6.29. A three-phase high-voltage-oriented MMC contains hundreds or even thousands of submodules, and a further multiplied quantity due to the existence of a few terminals in the DC grid caters to the concept of massive parallelism and manifests the SIMT paradigm. While only one outer-loop GC/GF controller is required for each MMC, the inner-loop counterpart has a dramatic difference, the average voltage control (AVC) and balancing control (BC) correspond to each phase and submodule, respectively. Therefore, the BC blocks are allocated within a potential parallel processing region (PPPR), and the AVC is also included there to facilitate reading the capacitor voltage from each submodule. The remaining parts, including the MMC frame and the GC/GF controller, as well as DC lines linking a station to its counterparts, are processed on the CPU, as their numbers in a regular DC grid are inadequate to support massively parallel processing. The HV/MV transformer is solved along with the MMC frame and is thus separated from the WF, and the transmission line between them provides an inherent boundary. For an arbitrary phase, the DFIG-side transmission lines have a simultaneous history current update; in contrast, their MMC-side counterpart is a sum of all secondary side quantities, as given by

g1 g2

SM1

g1 ABC g2

SMN

g1 g2

SMN

g1 g2

SMN

g1 g2

A

B

C

SM1

g1 g2

SM1

g1 g2

SM1

g1 g2

SMN

g1 g2

SMN

g1 g2

SMN

g1 g2

mabc AVC

GC/GF control DC Grid

AVC VC

AVC

ua ub

BC BC BC BC BC BC BC BC

uc BC BC BC BC BC BC BC BC

H V

M V

Vk (t) ZC WH

Ik (t-Δt) Vm2 (t)

#2

ZC

Vmj (t)

ZC g1 g2 g1 g2 CPU-GPU g1 g2 Boundary g1 g2 AC Grid

#j

Vm1 (t) #1 ZC

Vm3 (t) #3

Im3 (t-Δt)

SM1

ZC

#W×H VmJ (t)

ImJ (t-Δt)

g1 g2

Potential Parallel Processing Region (PPPR)

SM1

Im1 (t-Δt)

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

Imj (t-Δt) Im2 (t-Δt)

302

ZC

WF

Fig. 6.29 Sequential and potential parallel processing boundary definition in wind farmintegrated AC-DC grid

Im [i](t) = −Ik (t − t) −

Ik (t) = −

3W H k=3i+0/1/2

2 Vk (t − t), i ∈ [0, 3W H ) ZC

(Im [k](t − t) +

2 Vm [k](t − t)).i ∈ [0, W H ) ZC

(6.47)

(6.48)

The concurrency of DFIG computation is therefore ensured by (6.47), and the sum operation is conducted on the CPU along with other processes to form an EMT-dynamic co-simulation on a heterogeneous sequential-parallel computing architecture. The essence of ASP2 is that the CPU-GPU processing boundary is not definitive and shifts according to the computational burden induced by system scale in conjunction with the prevalence of homogeneity or its opposite. For instance, when a low-level MMC is configured, or the WF is largely aggregated, the CPU outperforms its many-core counterpart in handling equipment originally in the PPPR, and therefore, the ASP2 will eventually regress to conventional CPU simulation to maintain high efficiency. On the other hand, if the numbers of MMC converter stations, synchronous generators, and AC buses keep increasing, the hybrid simulation will ultimately evolve into pure massively parallel processing. Figure 6.30a,b compares the processing efficiency of CPU and GPU to ascertain an approximate boundary, based on 10 s simulations of the optimized AC-DC grid on a server with 192 GB RAM, 20-core Intel Xeon® E5-2698 v4 CPU, and NVIDIA Tesla® V100 GPU. Figure 6.30a shows that the CPU is more efficient in

6.4 Adaptive Sequential-Parallel Simulation

102

101

5

GPU processing time (s)

34

51

101 201 MMC Level 21 Level 19 Level

33

17 Level 32

15 Level 13 Level

31

11 Level 30

9 Level

GPU processing time (s)

GPU/CPU Processing GPU Speedup

10 104 9 8 103 7 6 102 5 4 101 3 2 100 1 10-1 401 yy y 1 39

25 GPU/CPU Processing GPU Speedup

20 15 10

yy: GPU Speedup

y: Processing Time (s)

103

303

5 500 1000 2000 4000 8000 WT Number 130 WTs 120 WTs

38

110 WTs 100 WTs

37

90 WTs 80 WTs 70 WTs

36 -3 -2 -1 0 1 2 3 80-100 WT -3 -2 -1 0 1 2 3 GPU Speedup (dB) GPU Speedup (dB) (a) (b) 25 25 104 GPU Speedup GPU Speedup GPU/CPU GPU/CPU ASP2 Speedup ASP2 Speedup 20 20 CPU GPU 3 10 8000 15 401 Level WTs 15 CPU GPU 5 WTs 10 10 2 10 5 Level 5 5

9-21 Level

103

102

101

yy: Speedup

y: Processing Time (s)

104

5

51 101 201 MMC Level (c)

401

1 101 5 yy y

1 500 1000 2000 4000 8000 WT Number (d)

Fig. 6.30 CPU and GPU computational efficiency comparison and threshold identification for (a) MMC and (b) WT and ASP2 computational efficiency under various (c) MMC levels and (d) WT numbers

simulating the DC grid when the MMC voltage level is 5; nevertheless, its manycore counterpart gains a high speedup with practical voltage levels for HVDC grid, e.g., 10 times under 401-level MMCs. Estimation based on linear interpolation indicates that the crossover occurs near 15-level MMC, and therefore, neighboring levels ranging from 9 to 21 are also tested to obtain an accurate threshold Nth . Since the computational capability of GPU is below its limit in these cases, the parallel execution time falls within a small range of 29 to 35 s despite different

304

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

levels. Based on 70 samples in total, the box plot is drawn to extract the median value, which is 30.6 s, and subsequently, the threshold can be ascertained as 15level in the horizontal bar graph following comparisons with CPU execution time. Similarly, it can be found that the GPU enables a remarkable speedup over the CPU and the threshold is around 100 wind turbines. In both cases, the threshold can be slightly higher than the statistically obtained value since the CPU is still efficient. The computational efficiency of the ASP2 framework, represented by the speedup over pure CPU execution, is depicted in Fig. 6.30c,d, which indicates that with more cores to share the computational burden, the time GPU requires to complete its tasks remains relatively stable and thus the speedup climbs. With the five wind farms in Fig. 6.26 aggregated as five separate DFIGs, the GPU’s speedup rises alongside the MMC level, and it fluctuates around 20 when 8000 WTs account for the major computational burden, as shown in Fig. 6.30c. Meanwhile, a growing WT number leads to an increasingly evident GPU performance enhancement in Fig. 6.30d. The adaptability of this framework is manifested by the fact that its efficiency curve is constantly above or overlapping that of CPU and GPU. As an optimum solution, the ASP2 can flexibly switch among its three processing modes to maintain the efficiency maximum possible, i.e., regression to pure CPU execution when the GPU’s speedup is below 1 under the 5-level MMC and 5-WT scenario, proceeding with GPU’s massive parallelism for high-level MMCs and large groups of WTs when the ASP2’s efficiency curve overlaps with that of the GPU, and a joint CPU-GPU processing with the efficiency higher than that of GPU and CPU if the system possesses substantial homogeneity and inhomogeneity.

6.4.3.2 Adaptive Sequential-Parallel Processing Framework In the adaptive heterogeneous transient analysis of the WF-integrated hybrid ACDC grid, various block types located in the PPPR are designed into CUDA C++ kernels for massively parallel implementation using the single-instruction multiplethread manner. In the meantime, they are also written into C++ functions along with the remaining components for potential sequential execution. The EMT-dynamic co-simulation is always initiated on the CPU where variables of C++ functions and GPU kernels are defined in the host and device memory, respectively. The ASP2 distinguishes itself from single processor computation by identifying tasks that should be assigned to the GPU according to the threshold before analyzing the system, as shown in Fig. 6.31, and in this case, variables corresponding to kernel inputs and outputs should be copied to CUDA memory via PCIe® , which is a common channel for signal exchange between the two processors. The co-simulation formally commences with AC grid transient stability analysis on the CPU, and as implied by (2.13), the generator is solved first after establishing the Jacobian matrix, followed by the network solution. Though no definitive computing boundary exists between the AC and DC grids, the MMC d-q frame controllers always operate on the CPU one after another till all NS stations are completed, and under the ASP2 scheme, the output modulation signals denoted by mabc are copied to CUDA memory, and starting from the inner-loop controller,

6.4 Adaptive Sequential-Parallel Simulation

x0

AC Grid Ji x=f(x,u,t) Yi(r)i(r)

No

N2>Nth2?

mabc

Yes

GPU

g(x,u,t)=0

vr

vr

CUDA memory

vr PCIe

End

vi

CPU

N1>Nth1?

DC Grid cudaMemcpy

Define variables

Start

305

Yes

RAM

t >T ?

P20/39 t++

GC/GF k k++ Yes

AVC

IABC12

No i=NS? Signal Process

BC&SM 6NSN

vSM

mabc

ΣvSM

No k=NS? V m Ik ΣSM IABC12 Vm

MMC i i++ Yes

3NS

Motor 5WH Is

3NS

Igc Ik GridIntf 5WH

VSCs 5WH

Wind Farms

Fig. 6.31 Adaptive sequential-parallel processing scheme of WF-integrated AC-DC grid on heterogeneous CPU-GPU computing architecture

the program enters parallel computing stage, albeit the implementation of different kernels is still sequential. The number of threads invoked by each kernel using a single CUDA C++ command is exactly the actual component quantity, and this concurrency ensures a remarkable efficiency improvement. Each AVC and BC corresponds to an MMC phase and submodule, respectively, and consequently, their thread numbers are 3NS and 3NS ·2N. As each SM proceeds independently, the output voltage vSM is summed up with its counterparts in another kernel before returning to the CPU temporarily to handle MMC linear part, which derives the medium voltage for wind farms. Then, the process moves onto the GPU again so that the DFIG kernels can eventually yield the injection current into the DC grid. In the meantime, the CPU counts the simulation time, and the program should either continue as the next timestep or end. On the other hand, if the scale of either—or both—of MMC and DFIG is insufficient, the CPU will take charge of computation, and memory copy between the two processors’ physical layer is no longer necessary for that equipment. All of its components corresponding to CUDA kernels are implemented as C++ functions repeatedly in a sequential manner instead of SIMT-based massive parallelism. It can be seen from the ASP2 framework that all AC-DC grid components and control sections are implemented in the form of C++ functions or CUDA C++ kernels, and this modularized style prevailing in time-domain simulations facilitates allocating a component type to either the parallel processing region or outside it based on its quantity during the program initialization stage.

306

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

0.25

α=6, β=10, γ=5

0.15 0.1 0.05

4 α=6, β=10, γ=0 Histogram

Proposed method PSCAD

220 WF Power (MW)

11 9 8

0.2

Density

260

14

180

γ=5 m/s t=15s vw=(8+γ) m/s

γ=0

vw=(9+γ) m/s

140

Weibull Distribution

100

vw=(10+γ) m/s

γ=(t-10) m/s 2

6

4

8

10

(a)

12

16

14

vw (m/s)

60

8

10

12

60.2

γ=5 m/s γ=(t-10) m/s Remedied by HVDC

60.1 γ=0

Weibull Distribution 60 0

10

20

30

40

(c)

18

16

t (s)

Proposed method vw=(8+γ) m/s TSAT 62

γ=5 m/s γ=(t-10) m/s

61

vw=(9+γ) m/s

γ=0

60

vw=(10+γ) m/s

Gen5, 10 0

14

(b)

63 Generator frequency (Hz)

Generator frequency (Hz)

0

Gen5, 10 50

60

t (s)

59

0

10

20

30

40

(d)

50

60

t (s)

Fig. 6.32 Impact of WF modeling on AC-DC grid dynamics: (a) wind speed, (b) OWF4–5 output power, and Generator 5 and 10 frequency under (c) Weibull distribution and (d) unified wind speeds

6.4.4

EMT-Dynamic Co-simulation Results

The introduced detailed modeling methodology provides an insight into intricate interactions within the WF-integrated AC-DC grids that are otherwise unavailable in general modeling and simulation where identical components are aggregated or averaged. The consequent computational burden is, as indicated in Fig. 6.30, alleviated by the adaptive sequential-parallel processing. In the heterogeneous co-simulation, the EMT and dynamic simulations adopt a time-step of 20 μs and 5 ms, respectively. Thus, for result validation, PSCAD™ /EMTDC™ and DSAToolsTM /TSAT simulations are conducted under the same time-steps, i.e., 20 μs and 5 ms. Distributed over a wide area, the turbines are subjected to different wind speeds, and Fig. 6.32a gives the probability density function by Weibull distribution: D(vw ; α, β, γ ) =

α β



vw − γ β

α−1 e

−( vwβ−γ )α

, vw ≥ γ

(6.49)

6.4 Adaptive Sequential-Parallel Simulation

307

where α, β, and γ are shape, scale, and location parameters, respectively. Starting at t = 10 s, the wind intensifies at a rate of 1 m/s in the following 5 s, i.e., γ rises from 0 to 5 m/s, and the function is discretized as the histogram to calculate the exact output power of two 100-DFIG wind farms OWF4 and OWF5, as shown in Fig. 6.32b. The huge disparities among these curves indicate that adopting a uniform wind speed, either the median from the box plot or its neighboring values, falls short of precise power calculation under the entire operation stage. Hence, the AC grid exhibits distinct dynamic responses in Fig. 6.32c,d. The converter station Cm-B2 controlling the DC voltage witnesses a slight power injection increase into Bus 39, and the consequent surplus induces a slow frequency ramp, which leaves sufficient time for the HVDC to take remedial actions to enter a new stable system operating point. In contrast, the simulation yields incorrect generator frequency response without distinguishing each DFIG, as it will induce either a dramatic increase or decrease. Figure 6.33 depicts the AC grid stability following internal contingencies. The three-phase fault at Bus 21 incurs a significant voltage drop at Generators 4–7, along with large rotor angle and power oscillations, yet the AC grid can restore steadystate operation once it is eliminated 50 ms later, and its impact on the DC system is negligible. At t = 60 s, Bus 39 has a sudden 100 MW load increase, and the DC grid is ordered to take remedial actions. Figure 6.33d–f demonstrates that with a corresponding 100 MW step by the HVDC station Cm-B2, the IEEE 39-bus system stabilizes instantly. However, it is unstable if the same amount of power ramp occurs in 20 s as the grid frequency keeps decreasing, an approximate 104 MW is required in power ramp Scheme 2 to maintain stability. The EMT-dynamic co-simulation exhibits a good agreement with the selected TSAT results in Fig. 6.33. For a comprehensive analysis of the modeling accuracy, the deviations of quantities in the AC grid are given in Fig. 6.34. The terminal voltage, rotor angle, and frequency of all generators are identical to DSAToolsTM /TSAT simulation results since the errors are virtually negligible. The active power of the generator on Bus 39 has around 8% and 9% mismatches shortly after the three-phase fault on Bus 21 and power injection step of HVDC converter Cm-B2 on Bus 39 at t = 40 s and 60 s, respectively. Different modeling methodologies account for this phenomenon, since in the AC-DC grid co-simulation, the controller of Cm-B2 has a response time, and in TSAT, the analysis is merely applied to the AC grid where equivalent loads are used to represent the converter stations. It, therefore, indicates better fidelity of the EMT-dynamic co-simulation which is compatible with versatile models, and the results can be used for accurate dynamic security assessment of complex power systems. Figure 6.35 shows the impact of DC-side faults on system stability. When the middle of the DC line touches the ground for 1 ms, the fault F2 causes minor perturbations to the DC voltages, and its impact on AC grid transient stability is insignificant since the generators are able to maintain at 60 Hz, as given in Fig. 6.35a,b. In contrast, the MMC internal fault F3, depending on the percentage of faulty submodules in an arm, could have a remarkable disturbance to both AC and DC grids, as depicted in Fig. 6.35c–f. A proportion below 10% indicates a

308

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

1.1

80 70

1.0

0.7

Gen4, 5, 7

39

40

41

(a)

Gen3 & Gen6 Gen2

TSAT

20

42

43

-10 20

t (s)

Gen10 (Swing bus) 30

0.8

60

(b)

80

70

90 t (s)

150ms

-200

Gen3, 4, 6

0.6

HVDC power step HVDC power ramp Scheme 1

-250

0.4

-300

0.2

Gen1 30

40

50

60.2

60

(c)

70

80

90

t (s)

3-phase fault at Bus 21 HVDC power ramp Scheme 2

-350 20

40

60

TSAT

3-phase fault at Bus 21 30

40

50

60

(e)

70

80

90

Generator Frequency (Hz)

Remedy by HVDC power step

60.1

60

80

60.2

Gen5, 10 frequency Generator Frequency (Hz)

50

-150

Gen10 Gen9

59.9 20

40

-100

1.0

20

Gen1

0

TSAT

1.2 Generator Power (p.u.)

Gen6

TSAT

1.4

0

40 30 10

0.6 0.5

Rotor Angle (º)

Gen3

50

MMC Power Pdc1 (MW)

Vgen (p.u.)

0.9 0.8

Gen5 & Gen9

60

Gen10

t (s)

100

(d)

120

140 t (s)

Gen5, 10 frequency

60.1

60

HVDC power ramp Scheme 2

59.9

3-phase fault at Bus 21 Power ramp Scheme 1

59.8 20

40

60

80

(f)

100

120

140 t (s)

Fig. 6.33 AC grid-side contingency: (a)–(c) generator voltage, rotor angle, and output power, (d) Cm-B2 power, and (e, f) generator frequency

continuation of the normal operation of the entire hybrid grid; nevertheless, its further increase results in a more vulnerable system. The DC capacitor voltages VC on the remaining submodules have to rise to sustain the DC grid voltage, accompanied by a higher chance of equipment breakdown; yet the ripples enlarge, and the DC grid voltages are subjected to severer oscillations, and so are the transferred powers. Meanwhile, the AC grid will also quickly lose stability, e.g., it takes around 1 min when half of all submodules in an MMC arm are internally

6.5 Summary

309

0.02

0.6

DVgen (%)

0 -0.01 -0.02

Bus 21 fault

-0.03 -0.04 39

40

41

10

t (s)

1%

Bus 21 fault

6

Gen10 -1%

Gen1-10

2

0 -0.2 -0.4 -0.6 20

20

30

40

50

60

(c)

70

80

Bus 21 fault 30

40

50

60

(b)

90

t (s)

20

70

80

90 t (s)

2×10-4% Gen1-10 0 Bus 21 fault

0

HVDC power step

-2 -4

Gen1-10



0

0.2

-2× 10 -3

DPgen (%)

43

Gen10

8

4

(a)

42

HVDC power step

0.4

Dfgen (%) 6 4 8 10 -3 ×10 -3 ×10 -3 ×10 -3

0.01

Rotor Angle Deviation (%)

Gen1-10

HVDC power step 30

40

50

60

(d)

70

80

90 t (s)

Fig. 6.34 Deviations of the heterogeneous ASP2-based EMT-dynamic co-simulation of the ACDC grid

short-circuited. Thus, from an operation perspective, the MMC should be blocked once the faulty SM percentage in an arbitrary arm exceeds 10% to protect equipment from over-current.

6.5

Summary

This chapter introduced methods that accelerate electromagnetic transient simulation and dynamic simulation other than solely using massive parallelism facilitated by graphics processors. Variable time-stepping schemes properly utilize distinct requirements on the density of results under the steady-state and transient stage, and their scopes were identified and demonstrated via tests on an MMC-based DC grid following a general categorization. The N-R iteration count is specific to nonlinear circuits and systems, while the event-correlated criterion and LTE are universal methods regardless of the linearity of electrical systems they apply to. Due to the coexistence of both homogeneity and inhomogeneity in a system, a hybrid FTS-VTS scheme is explored to mitigate the computational burden of the overall system, and the simulation con-

6 Heterogeneous Co-simulation of AC-DC Grids with Renewable Energy

60 Gen1-10 frequency (Hz) 60 .0 .0 02 04

310

Cm-E1

1.05

Cm-B3

1.0

0.95 60

62

64

66

68

2.0

70

(a)

72

74

76

78

40%

VC (p.u.)

1.0

1.2 pu

0.75

20%

0.5 0

40

50

60

-0.8

70

(c)

-1.2

30% 40%

68

40

a

c d e f

70

(b)

72

74

76

78

t (s)

b

a: 2% b: 10% c: 20%

0.85 20 30

40

50

60

60.8

50%

-1.6 20 30

66

60.9

-1.4 -1.5

t (s)

Gen5, 9 Frequency (Hz)

Cm-B2 Pdc1 (p.u.)

90 100 110

2% 10% 20%

-1.3

80

MMC: Cm-B2

-0.9

-1.1

64

0.9

1.0 pu

2%

1.0

0.95

10%

0.25

-1.0

62

1.05

1.25 30%

30

60

t (s)

1.75 50% 1.5

20

Gen 1-10

60

Cm-B2

Cm-B2 Vdc (p.u.)

Vdc (p.u.)

Cm-F1

70

(d)

80

60

70

(e)

80

90 100 110

t (s)

90 100 110

t (s)

≈ 60 s

60.7 60.6 60.5

50%

60.4

40%

60.3 60.2

30%

60.1

2%-20%

60 50

d: 30% e: 40% f: 50%

59.9 20 30

40

50

60

70

(f)

80

90 100 110

t (s)

Fig. 6.35 AC-DC grid response to DC-side contingency: (a, b) converter power and generator frequencies under DC line fault F2 and (c)–(f) SM capacitor voltage, MMC DC voltage and power, and generator frequencies under Cm-B2 internal fault F3

ducted on different processors gains significant speedups compared with the fixed time-step scheme. The execution time of an MTDC system indicates that systemlevel EMT simulation involving complex nonlinear device-level IGBT/diode models is feasible with a combination of VTS and massive parallelism. Meanwhile, the heterogeneous CPU-GPU computing approach provides an efficient solution for electrical systems integrated with numerous elements. The

6.5 Summary

311

two types of processors are designed to be in charge of tasks that best suit them. In the EMT-dynamic co-simulation of the PV-integrated AC-DC grid, the SIMT paradigm of the GPU is utilized to expedite the computation of millions of PV modules; the CPU, on the other hand, is more efficient in performing a dynamic simulation of the AC grid. With a remarkable speedup over pure CPU calculation using the same model, it is feasible to obtain the exact output power of PV plants by the GPU parallel processing. Therefore, the exact environmental impact on the scattered renewable generations and the power system transient stability can be studied efficiently in the co-simulation. The adaptive sequential-parallel processing, deemed as a computational framework that covers all prevalent computing architectures, applies to the detailed and efficient transient analysis of all modern electric power systems. The concept of a flexible CPU-GPU processing boundary yields a generic heterogeneous transient analysis methodology which maintains the highest possible simulation efficiency regardless of the system scale following a combination of merits of available computational hardware resources, i.e., the massive parallelism capability of manycore GPU and the excellent single-thread processing feature of CPU.

7

Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

7.1

Introduction

The EMT program which simulates the temporary electromagnetic phenomena in the time domain such as voltage disturbances, surges, faults, and other transient behaviors in the power system is essential for modern power system design and analysis [106]. The simulation often requires detailed models with high computation complexity to accommodate large-scale power systems. The algorithms of mainstream EMT tools are highly optimized using sparse matrix methods and fine-tuned power system models, but the performance is bound by the sequential programming based on the CPU. To get through the bottleneck, parallel computing became a popular option to speed up large-scale EMT simulation. As seen in Chap. 3, by partitioning a large system into smaller parts, the rate of parallelism and the speed of convergence increased for nonlinear systems [133]. Both direct and iterative spatial domain decompositions were proposed to partition the system. Based on these domain decomposition methods, various parallel EMT off-line or real-time programs were implemented on different multi-core CPU, many-core graphics processing unit (GPU), field-programmable gate array (FPGA), and multiprocessor system-on-chip (MPSoC) architectures [133, 225, 226]. However, so far the time axis has not been explored for implementing parallelism. Parallel-in-time (PiT) algorithms have a history of at least 50 years [227] and are now widely used in many research fields to solve complex simulation problems [228,229]. In the 1990s, PiT methods were proposed to solve power system dynamic problems [76]. After 1998, there had been few works on PiT simulation until the past 5 years. The Parareal algorithm [230], which solves the initial value problems iteratively using two ordinary differential equation (ODE) integration methods, has become one of the most widely studied PiT integration methods for its ability to solve both linear and nonlinear problems [227]. PiT-based EMT simulation is challenging mainly due to the following reasons: (1) Traditional time-domain simulation of electric circuits is based on nodal © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 V. Dinavahi, N. Lin, Parallel Dynamic and Transient Simulation of Large-Scale Power Systems, https://doi.org/10.1007/978-3-030-86782-9_7

313

314

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

analysis, which often yields a DAE with an index equal to or greater than one [231]. The implicit DAE system can only be solved by backward differentiation formulas (BDF), and many kinds of ODE-only PiT methods cannot be used. Although there are some works that can convert DAE to state-space form [232], the matrix size increases dramatically compared to that in nodal analysis. (2) Although Parareal can solve nonlinear DAEs according to the aforementioned research works, traditional EMT model implementations often have a fixed timestep assumption which needs adaption to be used with the Parareal algorithm. (3) In power systems, the most ubiquitous components are transmission lines which are modeled in EMT simulation with lossless traveling wave model, which brings delay differential equations (DDEs) to the system [233], and the dependency on past states creates additional convergence problems. This chapter introduces a component-based system-level PiT power system EMT simulation algorithm based on the Parareal algorithm, which is implemented on the multi-core CPU with objectoriented C++ programming. The PiT-based EMT simulation has the following features: 1. Based on highly abstracted component class, the system architecture is flexible and extensible to integrate different kinds of traditional EMT models of power system equipment into the PiT algorithm and maintain all the advantages from nodal analysis. 2. Support for delay differential equations. With a modified interpolation strategy, the convergence speed increases so that transmission line models are able to work with the Parareal algorithm. 3. Reusing solver workers and workspace to reduce memory usage and decrease the overhead caused by object allocation in the Parareal iterations. The detailed modular multilevel converter (MMC) model, which can have thousands of state variables and complex internal structures, brings challenges to the Parareal-based PiT algorithm. Based on the analysis of traditional Parareal implementation of detailed modeled MMCs, this chapter proposes the hybrid PiT algorithm to handle MMCs with the device-level representation of submodules. Moreover, a propagation delay-based method is proposed to connect PiT grids to conventional or other PiT grids via simple queue hub and adapters. The case studies for the hybrid PiT and PiS simulation include a 201-level three-phase MMC and the CIGRÉ B4 DC grid test system.

7.2

Parallel-in-Time Modeling

There are different kinds of iterative and direct parallel-in-time methods, but only some of the iterative methods can solve nonlinear ODE problems. The Parareal algorithm [230] can be derived as a multi-grid method to solve ODE problems; however, it can be shown that the Parareal algorithm is not limited to the solution of ODE problems and thus is suitable for PiT EMT simulation.

7.2 Parallel-in-Time Modeling

7.2.1

315

Parareal Algorithm

This algorithm decomposes the simulation time [t0 , tend ] into N smaller subintervals Ik = [Tj −1 , Tj ], where start time T0 = t0 and end time TN = tend . At every time point Tj , the system has a unique solution for its state variables U j produced by a fine solution operator F(Tj , Tj −1 , U j −1 ). So, for N time intervals, the following nonlinear equations can be established:

W (U ) :=

⎧  ⎪ ⎪ ⎪U 1 − F T1 , T0 , U 0 ) ⎪ ⎪ ⎨U 2 − F T2 , T1 , U 1 ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎩U

N −1

= 0, = 0, .. .

(7.1)

− F(TN −1 , TN −2 , U N −2 ) = 0,

where U 0 is determined as the known initial value. The system (7.1) can be solved by Newton’s method. For every U j ∈ U = {U 1 , U 2 , ..., U n−1 }, only two entries are nonzero. The individual update formula for each U j is given by  (k) (k−1)  U j = F Tj , Tj −1 , U j −1 +

∂F  (k−1)   (k) (k−1)  Tj , Tj −1 , U j −1 U j −1 − U j −1 ∂U

(7.2)

where k represents the iteration number. F The ∂∂U item in (7.2) can be approximated by  ∂F  (k−1) Tj , Tj −1 , U j −1 ∂U   (k) (k−1) F Tj , Tj −1 , U j −1 − F(Tj , Tj −1 , U j −1 ) ≈ . (k) (k−1) U j −1 − U j −1

(7.3)

Notice that (7.3) is differentiated by index k not j since it is the derivative of U . To parallelize the computation, this item is approximated by a cheaper method G in sequential, which can produce a close approximation to F, giving     (k) (k) F Tj , Tj −1 , U j −1 ≈ G Tj , Tj −1 , U j −1 ,     (k−1) F Tj , Tj −1 , U (k−1) ≈ G T . , T , U j j −1 j −1 j −1

(7.4)

Substituting the F entries in (7.3) with the approximation in (7.4), the following equation can be derived:

316

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

(0)

U0

Coarse prediction (0) U1

(0)

(0)

(0)

U2

U3

U4

T2

T3

T4

t T0

T1

(k-1)

U0

Parallel operation U (k-1)

T0 (k)

U0

T0

Refined states

(a) (k-1)

(k-1)

(k-1)

1

U2

U3

U4

T1

T2

T3

T4

(b) (k)

(k)

(k)

(k)

U1

U2

U3

U4

T1

T2

T3

T4

(c) (0) Fig. 7.1 Progression of steps in the Parareal algorithm: (a) initialize U (0) j which equals to Gj ; (k) (k) (k−1) k (b) produce fine-grid solution Fjk ; and (c) refine U (k) j with U j = Gj + Fj − Gj

  (k) (k−1) U j = F Tj , Tj −1 , U j −1     (k) (k−1) + G Tj , Tj −1 , U j −1 − G Tj , Tj −1 , U j −1 ,

(7.5)

which becomes a quasi-Newton method [234]. The Parareal algorithm can solve nonlinear full-implicit DAE problems in the condition of having a good approximation method G to predict the states of F so that it can satisfy the assumptions in (7.4). The G and F methods are called coarse and fine solution operator, and the time points they work at form the coarse-grid and fine-grid, respectively. As shown in Fig. 7.1, the coarse operator makes initial predictions for the fine-grid; then the fine operator takes U as initial values and works in parallel to populate the finegrid results; next, the coarse operator refines U states by making predictions based on new fine-grid solutions using (7.5). When the U stops changing, the fine-grid solutions are equal to the ones sequentially computed by the fine operator.

7.2 Parallel-in-Time Modeling

7.2.2

317

Component Models

Normally, the nodal equations of a circuit have a DAE index greater than zero [231], which means only implicit ODE methods can solve the equations. To solve the circuit with the nodal analysis method, the general form of linear circuit DAE discretized by trapezoidal rule can be reinterpreted as Y v n+1 = s n+1 − i hist n+1 ,

(7.6)

i hist n+1 = g(x n , v n ),

(7.7)

x n+1 = h(i hist n+1 , v n+1 ),

(7.8)

where Y is the admittance matrix; v is node voltage vector; s n+1 is the source injection; i hist n+1 is the equivalent current injections determined by (7.7); and x is historical terms vector for components. The x and v form the global system state vector U = {v, x}. The nonlinear circuit has a varying Y matrix and additional current injections when the Newton-Raphson method is applied to linearize it at each time-step. The equations are formed from different components in the power system. Most components in EMT simulation are treated as an equivalent conductance in parallel with a current source, which is convenient for nodal analysis. However, the behavior of these components differs very much from each other. It is more convenient to integrate their states on their local space, and the only common states they share are the node voltages.

7.2.2.1 Inductor/Capacitor Models These components can be discretized by trapezoidal method in the general form [112] hist in+1 = Geq vn+1 + in+1 hist in+1 = in + Geq vn .

(7.9)

where GL eq =

2L C t , Geq = . t 2C

(7.10)

Geq is the discretized equivalent admittance for the inductor L or the capacitor C, in+1 is the total current going through the inductor or capacitor component in the step n + 1, and ihist is the historical current source injected to the right hand side of (7.7). The current i and v are the state variables of LC components.

318

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

In traditional fixed-step EMT simulation, the state variable i is substituted and omitted by current source i hist because the Geq remains unchanged between steps n and n + 1, which gives hist = inhist + 2Geq vn . in+1

(7.11)

However, this only applies to fixed time-stepping. Since there are two different time-steps in Parareal algorithm, the form of (7.9) must be applied to all transient components. All fixed time-stepping assumptions must be avoided.

7.2.2.2 Transformer Model The transformer for EMT simulation with winding resistance R and leakage inductance L is represented as v = Ri + L

di . dt

(7.12)

where R and L are n × n matrices and v and i are n × 1 vectors of winding voltages and currents. Using trapezoidal discretization, (7.12) can be written as i n+1 = Geq v n+1 + i hist n+1 , i hist n+1 = Geq v n − H i n ,

(7.13)

where     2L 2L −1 . Geq = R + , H = Geq R − t t The nonlinear saturation is modeled with a compensation current source is on the secondary winding (Fig. 7.2). The nonlinear relationship between flux λ and saturation compensation current is is given by a nonlinear function is = im (λ); details can be found in [112]. The λ is the integral of node voltage v over a timestep:  λ(t) = λ(t − t) +

t

v(t) dt.

(7.14)

t−t

Transformer states {i; λ} are chosen to participate in the Parareal iteration.

7.2.2.3 Generator Model Indirect approaches are widely used in the EMT program to interface the synchronous machines, for example, the Norton current source representation of machines implemented in PSCAD™ /EMTDC™ . However, this kind of model has weak numerical stability [235] and causes oscillations in Parareal iterations.

7.2 Parallel-in-Time Modeling Fig. 7.2 Admittance-based transformer model with saturation

319

Secondary winding

Primary winding

Linear admittance

.$

Exciter

vf

w

1+sT

if Te

vabc

Nonlinear saturation

Exciter 1+sT% 1+sT& VW

-+

Vref

AC System

Mechanical

Fig. 7.3 Synchronous generator representation using variable time-stepping machine model

Therefore, a machine model from [236], which can be used in variable time-stepping methods, is used in the current PiT simulation. In this work, the synchronous generators are represented by the machine model with one kd winding and two kq windings. The relationship between voltages and currents can be expressed as v um (t) = R um i um (t) − ψ um (t) = Lum i um (t),

d ψ (t) + u(t) dt um

(7.15) (7.16)

 T  T where v um = vd , vq , v0 , vf , 0, 0, 0 , i um = id , iq i0 , if , ikd , ikq1 , ikq2 , T   T ψ um = ψd , ψq , ψ0 , ψf , ψkd ψkq1 , ψkq2 , u = −ωψq , ωψd , 0, 0, 0, 0, 0 , Rum = diag Rd , Rq , R0 , Rf , Rkd , Rkq1 , Rkq2 , and Lum is the leakage inductance matrix. The machine is represented by a Thévenin voltage source and a resistance (Fig. 7.3). Further details can be found in [236]. Since the mechanical state ω changes slower than electrical ones, the numerical stability is better than the models that make relaxations or assumptions on electrical state variables. In the Parareal iteration, the state vector for this model is {v um , i um , ψ um , ω}.

320

7.2.3

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

Transmission Line Model

The transmission lines are the most common components in a power system. However, the models are based on the traveling wave theory which means that the solutions are dependent on a range of past states. This brings DDEs to the EMT power system simulation [233]. Taking the lossless line as an example, the equations to update historical current source are given as hist im (t) = −2Gvk (t − τ ) − ikhist (t − τ ) hist ikhist (t) = −2Gvm (t − τ ) − im (t − τ ).

(7.17)

hist is the receiving-end current source, i hist where τ is the transmission delay, im k is the sending-end current source, and G is the characteristic conductance of the transmission line. Details can be found in [112]. Since the equations are in continuous-time domain, to work with discrete-time integration, linear interpolation is used to obtain the approximation between two discrete time points near t − τ . However, the traditional method faces bottleneck under the PiT scenario for two reasons. First, although the Parareal algorithm can solve nonlinear DAE problems by predicting states at certain time points in coarse-grid and refine them in fine-grid, it is difficult to do such thing for a transmission line because the historical states for the fine-grid transmission lines do not exist in coarse-grid. Using interpolation to predict the fine-grid history vectors cannot reflect the transient waveform of the discrete system with a smaller time-step. In this case, all transmission line history data should be prepared before Parareal iteration. Currently, limiting the time window of iteration is the only way to avoid this dependency issue. Second, limiting the time window is still not enough. The traditional transmission line uses linear interpolation to approximate the historical value at t − τ . However, the approximated historical values are inconsistent with the fine-grid ones. As shown in Fig. 7.4b, the traditional interpolation’s inconsistency causes an error between coarse-grid and fine-grid values so that the prediction in the next window fails to meet the assumption in (7.4) and causes deviations. To solve these problems, a fine-grid reinforced transmission line model implementation is proposed, which is shown in Fig. 7.4a. Unlike conventional line models, where each line has its history vector to cover only one delay cycle, a transmission line in fine-grid and coarse-grid share the same memory for historical data. The computation is set to a time window which is smaller than the transmission delay τ , so that accurate historical data are prepared from the previously completed window, which avoids the data dependency issue. To improve the coarse-grid prediction, the coarse-grid transmission lines must read data from the data points in fine-grid history vectors, which requires a conversion between two time-steps to get the data index, given by



jf ine

t + = f loor jcoarse δt



  t τ , −1 δt t

(7.18)

7.3 Parareal Application to AC System EMT Simulation

Previous window

Current window

321

Prediction plot Voltage (kV)

i khist (t-τ)

Traditional interpolation

hist

im

(t)

Previous window

t-τ Δt

t

Fine-grid reinforced hist i m (t) interpolation

i khist (t-τ)

(a)

Time (s)

(b)

Fig. 7.4 (a) Interpolation scheme. (b) Prediction result comparison

where jf ine is the converted index in fine-grid history vector for coarse-grid transmission line, jcoarse is the original index in t step simulation, δt is the finegrid time-step, and the index is always truncated to an integer.

7.3

Parareal Application to AC System EMT Simulation

Unlike other research works on PiT simulation, which focus on specific differential equations, the power system EMT simulation needs to handle different kinds of equations and configurations. Also, the Parareal algorithm requires the ability to restart the simulation at an arbitrary time to perform the iterations. Therefore, it is necessary to model the power system on a higher-level abstraction. A componentbased system architecture is proposed to handle the PiT complexity flexibly and elegantly and serves as the fundamental element to build the PiT simulation program.

7.3.1

Component-Based System Architecture

A circuit is an undirected graph topological relationship between different components, where the components contain all the edge information of the graph. Therefore, the circuit can be represented with a vector of components. As shown in Fig. 7.5, the object-oriented concept is used to model the circuit system and components. Major properties of the circuit class contain the system matrix, state vectors, and a container for heterogeneous dynamic time-varying components. All the components inherit from abstract base component class TransientComponent so that the circuit object is able to call individual component functions with standard interface, initalize(), assemble_mat(), update_i(), and update_hist(), with

322

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

Circuit Class (Worker) System Matrix

Initialize()

System History State Vector

Assemble_mat()

Update_i()

Node Voltage Vector Update_hist()

Transient Components RLCs

Transformers

Get_hist_nums()

Set_hist_index()

Transmission Lines Set_hist_vector()

Generators

Get_hist_vector()

Fig. 7.5 Circuit class architecture for the PiT-based EMT simulation program

object polymorphism features. The get_hist_nums(), set_hist_index(), set_hist_i(), and get_hist_vector() functions are interface to gather or scatter state variables between individual components to the system history state vector. The algorithm flowchart is shown in Fig. 7.6. Before solving the circuit, each component initializes its variables and equivalent conductance according to the time-step. Then, they assemble the global system matrix according to their branch information in the graph. With the matrix formed, the circuit is ready to be solved. With this architecture, the complexity of power system EMT models is encapsulated into two functions of circuit class, the init() and step(). The component-based objectoriented architecture has the following advantages: 1. High flexibility: It separates the individual component model and system-level computation logic with the application of TransientComponent. That means users can focus on describing model behaviors without caring about the structure of the global system. It is easy to substitute models with different object-oriented interfaces. By correctly implementing PiT interfacing functions, the components automatically become available for the PiT solution. 2. High scalability: The Circuit class serves as a generic solver for different kinds of systems. The number of components and matrix sizes can be arbitrary. For small systems, the implementation uses a single thread, which achieves the optimal performance, but for large-scale systems that have thousands of components, parallel class functions can be used to increase the performance and seamlessly integrated into the user’s application code. 3. Modular design: The Circuit class is not only composed by decoupled modules but also designed as the basic building block of a PiT algorithm based on Parareal, which significantly reduces the implementation difficultly and human errors.

7.3 Parareal Application to AC System EMT Simulation

323

Start

Parse system data file, determine time-step dt.

Circuit::init()

Call initialize() on each component.

Circuit::step() Assemble system matrix by calling assemble_mat() on each component.

Factor the matrix using LU decomposition.

Calling update_i() to form up right hand side.

Time += dt. Perform forward and backward substitution to get nodal solutions.

Update componnents’ own historical states by calling update_hist().

N

Time == end_time?

Y End Fig. 7.6 Flowchart of PiT-based EMT simulation program

7.3.2

Fixed Algorithm Implementation

The basic version of Parareal algorithm is by using multiple circuit instances proposed in the previous section as workers. There is one coarse-grid worker, which is initialized with a larger time-step as a predictor, and many fine-grid workers with a smaller time-step. Their workspace is initialized before the beginning of computation. The coarse- and fine-grid workers communicate with shared memory

324

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids Output

Parareal Initialization K=0, U=U0 Initial state

Predict

Coarse Vector G(k) K

K+1

K+2

Worker k

Worker k+1

Worker k+2

... …

Fine-grid subdivisions K = iteration index N=total workers

Coarse worker

N-2

N-1

Worker N-1

Worker N

... Assemble final results

Reload state vectors

Solve fine-grid steps

k=k+1

Updated Solve for k+1 states generation U(k+1) states

Fine-grid subdivisions

N-2



N-3

N-2

N-1

Updated states U(k+1)

N-1

N-1

N-1

N-2

N-2

N-2

...

...

Coarse worker

Partial states F - G

...

K+2

Keep it Inplace

Coarse states (k+1) G

...

Last fine-grid solution of subdivisions F K+1

(k+1)

...

already converged

True False Converged?

K+2

K+3

K+3

K+3

K+1

K+2

K+2

K+2

Coarse Vector G(k) K+1

K+2

...

N-2

N-1

Parallel Operation

Sequential Update

Fig. 7.7 Detailed procedures in the PiT-based EMT simulation

space. The algorithm working on a fixed full simulation time range is called the fixed Parareal algorithm in this work. Suppose there are N coarse time intervals with initial time T0 and state vector U0 , and k denotes the index number of Parareal iterations, then the algorithm is composed of the following four stages:

7.3.2.1 Initialization The first coarse worker initializes with the system states U 0 and k = 0 at t = 0; then the coarse worker generates initial guess with coarse time-step t and stores the N solutions into an array called Gk in Fig. 7.7, where k = 0. This initial prediction process is only called once to start a full cycle of Parareal iterations in the designated simulation time range. 7.3.2.2 Parallel Operation After G(k) is prepared, the fine-grid workers load the initial solution according to their subdivision’s position and initialize all variables in parallel. Notice that in each iteration, only N − k threads are launched to reduce overhead, since the first subdivision generates a solution which is guaranteed to converge. After initialization, fine-grid workers work on their workspace to simulate m fine-grid steps, where m · δt = t. The states we care about in Parareal iterations are located at the points on coarse-grid. The (m)th step states at each subdivision are extracted into the state vector F (k+1) except the worker N because there is no (N )th state in coarse prediction. But this is not a problem because the (N )th state can converge once the (N − 1)th state is converged.

7.3 Parareal Application to AC System EMT Simulation

325

According to (7.5), it is possible to exploit the parallelism by merging the F (k+1) − G (k) into the parallel operation stage. All workers execute P = F (k+1) − G(k) , which is shown as red in Fig. 7.7, in parallel after their solutions are done, which reduces the sequential overhead caused by the large-scale system. There is an exception that the (k + 1)th solution is already converged, so it is skipped and can be used directly in this and next stage. After this step, all threads synchronize and wait for SequentialUpdate of coarse states.

7.3.2.3 Sequential Update Now the computation U (k+1) = G(k+1) − P is performed so that all states are (k+1) corrected by new prediction. Obviously, Gj in Fig. 7.7 can only be processed in (k+1)

(k+1)

sequential depending on U j −1 . The process starts from computing Gk+2 . There is no mistake to start with (k + 2)th solution because the k state is already known and (k + 1)th state is guaranteed to converge. So the square with text (k + 1)th is (k+1) (k+1) loaded into coarse worker first and produces Gk+2 . Then U k+2 is computed so (k+1)

(k+1)

(k+1)

(k+1)

that it is ready to use U k+2 to compute U k+3 , etc. Finally, all U k+1 to U N −1 are updated, and it is the time to check the convergence. This is done by error =

N −1  j =k+1

(k+1)

U j

(k)

− Uj (k)

U j

,

(7.19)

which computes the system states’ relative distance between (k + 1)th and (k)th generation at each coarse time point and sums them up. If the error is larger than the tolerance, the index (k)th increases by one and goes back to the parallel operation with newly updated states. If the computation converged, it is the time to generate final result.

7.3.2.4 Output In this stage, all fine-grid workers use the converged coarse stated to fill all fine-grid steps in the output solution vector, which is the assembly procedure in Fig. 7.7.

7.3.3

Windowed Algorithm Implementation

The fixed algorithm would waste much computation effort to work on a long time range for large systems, especially when there are many transmission lines, which require previous historical data to propagate traveling waves. Additionally, the memory consumption for a fixed algorithm is unacceptable since it needs the fulllength memory allocation to work. Therefore, a windowed version is adapted from the fixed algorithm. The windowed algorithm only launches several parallel workers and works on a small time window near the start point. To keep the implementation simple, there is no overlap between two consecutive windows. The workers’ group first starts working on a known initial state and workspace. Once the solution for the current window is finished, the worker in charge of the last subdivision transfers the

326

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

Program time

Simulation time Worker 1

Worker 2

Worker 3

Transfer States

Worker 1

Worker 2

Worker 3

Fig. 7.8 Example of windowed algorithm with three parallel workers Table 7.1 Performance comparison of different test cases with fixed Twindow = 200 μs, t = 40 μs, and δt = 1 μs for a 500 ms duration using five threads

IEEE-9 Parareal (s) 4.019 Parallel LU (s) 4.756 Sequential (s) 4.507 Parareal speedup 1.12 Theoretical speedup 1.88

IEEE-39 25.233 46.641 44.579 1.77 2.18

IEEE-118 283.393 385.024 591.015 2.09 2.24

data to the first worker, and all worker’s simulation time and indices are reset for the next time window. Figure 7.8 shows an example of how the algorithm works, the simulation time is the time for system simulation, and the program time is the actual time consumed by executing. The convergence and parallel efficiency increase a lot in this way. Moreover, the memory allocation for the windowed algorithm can be limited to one time window.

7.3.4

Case Studies

A CPU-based PiT EMT program was implemented in C++ with Intel® Threading Building Block (Intel® TBB) and Intel® Math Kernel Library (Intel® MKL). Tests are performed on the IEEE-9, IEEE-39, and IEEE-118 test systems to verify the PiT results against the sequential ones. In addition, the PiT performance is compared to a traditional spatial parallel computing implementation which utilized Intel® MKL’s highly optimized parallel lower-upper (LU) decomposition algorithm. The parameters of test cases are from [237]. The same thread number and algorithm configuration are used to compare the performance under the same condition. The time window and other parameters are shown in Table 7.1. For the IEEE-9 system, the minimal delay of transmission lines is 246 μs, which is enough for the 200 μs time window. But for the other test systems, there are some short transmission lines. The transmission line length in IEEE-39 is scaled up to 60–100 km so that it can fit the time window and get reasonable speedup. For the

7.3 Parareal Application to AC System EMT Simulation

327

IEEE-118 case, the transmission lines below 60 km are simplified to multiple PI sections, and the remaining 61 lines are modeled with the traveling wave model. Without the simplification, although good results can be obtained, the PiT algorithm falls back to the sequential program. A three-phase-to-ground fault happens at 0.3 s, Bus 8 in IEEE-9, and 0.3 s, Bus 14 in IEEE-39 case. A fault happens at 0.2 s in the IEEE-118 case at Bus 38. The fault resistor size is Rf ault = 0.01, Rclear = 1M. To achieve stable and correct results, the error tolerance of a whole time window is set to 0.01, which means the relative error sum of coarse steps cannot exceed 1%. Parallel workers are fixed to five for PiT cases, and they use the sequential LU algorithm. As shown in Fig. 7.9, the PiT simulation results coincide with those from the traditional method, and the zoomed-in views show the expected high accuracy. The results are verified with PSCAD™ /EMTDC™ . To evaluate the performance, the theoretical workload is defined by the simulation time consumption without any overhead from Parareal iterations or thread synchronizations. To achieve speedup, the PiT workload must be smaller than the sequential one to get speedup. The workload ratio can be obtained by Tpar /Tseq , where Tpar is the parallel workload and Tseq is the sequential workload, and the theoretical speedup is the reciprocal of workload ratio. Also, the parallel efficiency is computed to evaluate the utilization of parallel processors. The efficiency is the ratio of the speedup to the number of threads. Normally, a small-scale system cannot benefit from parallel computing due to the thread launching and synchronization overhead. As shown in Table 7.1, the parallel LU cannot achieve speedup under the IEEE-9 (matrix size 27 × 27) and IEEE-39 (matrix size 117 × 117) case, while in IEEE-118 (matrix size 354 × 354) case, the speedup is obvious. In contrast, the Parareal algorithm can get speedup in all three cases because the speedup is mainly determined by the temporal factors, which is more obvious in Table 7.5. Table 7.2 shows the performance of the IEEE-118 simulation with the same finegrid time-step δt = 1 μs and different coarse-grid time-steps. The PiT results are compared to sequential, parallel LU, and theoretical speedup. The sequential and parallel LU simulation uses the same time-step δt = 1 μs. All PiT simulation’s error tolerances are set to 0.01 per time window, so their results are on the same accuracy level. From Table 7.4, the best performance case is when the time window is 200 μs. In this case, it is 2.08× faster than the sequential one, while the parallel efficiency is 42%. The speedups are close to the theoretical ones, and the overhead is around 10%, indicating that the implementation is highly efficient. The actual speedups and parallel efficiency are higher than the MKL parallel LU case except for the 400 μs case. When t = 80 μs, the theoretical speed decreases. This is understandable as the minimum transmission line delay is 200 μs. The 400 μs time window exceeds the limit a lot so that the transmission lines cannot get historical data at the beginning, so it requires more iterations to converge. When t = 10 μs, each fine-grid worker only takes ten steps, and the coarse worker takes five steps, which means the minimum workload ratio is (5 + 10 + 4 + 10)/50 = 0.58. In this way, although it can converge in one Parareal iteration, the efficiency is limited

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids IEEE-39 Bus 4

IEEE-118 Bus 30

PSCAD Voltage (kV)

PSCAD Voltage (kV)

PSCAD Voltage (kV)

Parareal Voltage (kV)

Parareal Voltage (kV)

IEEE-9 Bus 7

Parareal Voltage (kV)

328

Time (ms) (b)

Time (ms) (c)

IEEE-39 Bus 14

IEEE-118 Bus 38

Parareal Current (kA) PSCAD Current (kA)

PSCAD Current (kA)

PSCAD Current (kA)

Parareal Current (kA)

Parareal Current (kA)

Time (ms) (a)

IEEE-9 Bus 8

Time (ms) (f)

IEEE-39 Zoomed-in

IEEE-118 Zoomed-in

Current (kA)

Current (kA)

Current (kA)

Voltage (kV)

Voltage (kV)

Time (ms) (e)

Voltage (kV)

Time (ms) (d) IEEE-9 Zoomed-in

Time (ms) (g)

Time (ms) (h)

Time (ms) (i)

Fig. 7.9 Simulation results of three-phase voltages: (a) Bus 7 in IEEE-9; (b) Bus 4 in IEEE-39; (c) Bus 30 in IEEE-118. Simulation results of fault currents: (d) Bus 8 in IEEE-9; (e) Bus 14 in IEEE-39; (f) Bus 38 in IEEE-118. Zoomed-in comparison: (g) voltages and currents of IEEE-9; (h) voltages and currents of IEEE-39; (i) voltages and currents of IEEE-118

7.3 Parareal Application to AC System EMT Simulation

329

Table 7.2 Comparison of sequential, parallel LU, and PiT IEEE-118 simulation with various Twindow , t, and fixed δt = 1 μs for a 300 ms duration using five threads Configuration 400 μs, 80 μs 200 μs, 40 μs 125 μs, 25 μs 50 μs, 10 μs Parallel LU Sequential

Average iterations 2.98 1.99 1.96 1.00 – –

Simulation time (s) 236.783 170.034 180.244 218.468 230.530 354.108

Speedup 1.50 2.08 1.97 1.62 1.54 1.00

Theoretical speedup 1.59 2.24 2.10 1.74 – –

Efficiency 30% 42% 39% 32% 31% –

Table 7.3 Performance comparison of various thread numbers with fixed t = 40 μs and δt = 1 μs for a 500 ms duration Thread Theoretical speedup number IEEE-9 IEEE-39 IEEE-118 16 2.27 1.43 1.93 12 1.87 1.74 1.91 8 2.18 1.74 1.83 6 2.14 2.21 2.55 5 2.14 2.18 2.24 4 1.79 1.79 1.83

Execution time (s) IEEE-9 IEEE-39 IEEE-118 8.63 52.79 480.76 7.52 39.26 408.86 4.40 34.50 340.75 3.77 25.16 256.80 3.63 25.23 287.95 4.08 29.52 334.33

Actual speedup IEEE-9 IEEE-39 0.52 0.84 0.60 1.13 1.03 1.29 1.20 1.76 1.25 1.76 1.11 1.50

IEEE-118 1.23 1.45 1.73 2.30 2.05 1.77

by the workload distribution between coarse- and fine-grid. Therefore, appropriate time-steps and worker assignment are significant to achieve practical speedup using the PiT method. Table 7.3 shows the performance of all test cases with different thread numbers. The t and δt remain constant for all thread numbers and test cases, so the time window size changes with the thread number. The different test cases show similar theoretical speedups. If the coarse prediction can match the fine-grid results on a wider range, the speedup can be higher. However, in power system EMT simulations, the delay of transmission lines restricts the time window size, so utilizing more threads may not get better performance. The exception is the IEEE9 case, which gets the best theoretical speedup with 16 threads. In general, the speedup of the Parareal method is mainly affected by the system’s time-domain characteristics such as the model’s time constants and simulation time-steps rather than the system’s scale. Although the time-domain characteristics are dominant, the overheads of multithread synchronization and error evaluations are noticed in the smallest test case. The actual speedup is further away from the theoretical ones in the IEEE-9 case compared to larger cases, indicating much room for improvement in small-scale systems. When the thread number equals 16, the CPU cannot finish all the jobs in parallel, so the speedup goes down significantly for all cases. The optimal thread number for these cases is five or six, which gives a time window of 200–240 μs. This is exactly the minimal transmission line delay set for all the test cases. This

330

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids

indicates that the transmission line delays are the main factors to affect the best speedup we can get. Exceeding this boundary causes inaccurate predictions, so the speedup drops down. Therefore, the treatment of the delays in transmission lines is significant for PiT power system EMT simulation. For IEEE-118 case, the best speedup is 2.30×, and the parallel efficiency is 38.3%, which is still higher than parallel LU decomposition.

7.4

Parallel-in-Time EMT Modeling of MMCs for DC Grid Simulation

The modular multilevel converter (MMC) has gained tremendous attention in high-voltage direct current (HVDC) transmission systems and multi-terminal DC (MTDC) system [164]. Electromagnetic transient (EMT) simulation is essential to the design and operation of MMC systems. MMCs have a large number of states due to the multilevel nature and require very small time-steps to accurately reflect the characteristics of nonlinear power electronic devices. This leads to high computation efforts and long execution time. To accelerate the simulation of MMCs, various acceleration solutions were proposed: for system-level simulation, [238] used degrees of freedom reduction technique to reduce internal nodes so that the computation effort is reduced; for device-level full-scale simulation, [226] utilized graphics processing unit (GPU) parallel computing to accelerate large-scale DC grid simulations with detailed device-level switch models; and [239, 240] use the field-programmable gate array (FPGA) and heterogeneous multiprocessor systemon-chip (MPSoC) to perform real-time MMC simulation. Although these parallel computing solutions achieved remarkable speedup, they are based on the physical topology and the partitioning of the system; the parallel efficiency is limited by the spatial structure of systems and the overhead of thread synchronization required by each small time-step. Parallel-in-time (PiT) methods may provide a new solution to overcome these limitations. Furthermore, device-level modeling of the MMC is desired to help engineers select the proper power switches and verify the control scheme and IGBT performance in more realistic scenarios. Besides, it is currently not practical to solve the whole power system using the PiT method, because all system elements must be converted to PiT models. Therefore, a practical approach to connect PiT systems to conventional systems is needed. In this section, first the Parareal implementations of the detailed ideal switch model and the transient curve-fitting IGBT model from [226] for device-level simulation are carried out. Based on the two implementations, a new PiT method is developed [241] by combining the ideal switch and transient curve-fitting model, which can avoid iterations to significantly improve the performance with flexible accuracy control by adjusting the coarse predictor. For large-scale system simulation, the traditional parallel-in-space (PiS) techniques are applied along with the PiT method, and an integration method based on transmission line propagation delays is proposed to integrate the PiT system into a conventional power system EMT simulation. Then, decoupled by the transmission lines, a parallel-in-time-and-

7.4 Parallel-in-Time EMT Modeling of MMCs for DC Grid Simulation

331

space (PiT+PiS) implementation is also presented to deal with large-scale systems with multiple MMC stations. The methods are tested on the CIGRÉ B4 DC grid test system composed of 11 201-level MMCs.

7.4.1

Modular Multilevel Converter Modeling

7.4.1.1 Three-Phase MMC Modeling This work utilizes the V -I decoupling method proposed in [242] and also described in Chap. 5 to build the MMC model. As shown in Fig. 7.10, the submodules of the MMC are solved as individual sub-circuits with the arm current of the previous time-step as the injection iarm (n − 1). This method is valid when the time constant of the submodule is much larger than the simulation time-step. The capacitors of submodules are often large, and the simulation time-step is only a few microseconds. This relaxation decouples the submodules from main circuit topology with a series of voltage sources as shown in Fig. 7.10a. However, pure voltage sources cannot be used in nodal analysis directly. To avoid this problem, the arm inductor can be merged into the arm and becomes the reduced equivalent model shown in Fig. 7.10b. This can be done by the following transformation: the arm equivalent circuit in Fig. 7.10b can be expressed by

iarm(n-1)

v1

SM1

v1

varm =∑ vsm

SM1 vsm

v2 SMN

Submodule

iLhist

v1-varm=v2

ieqhist=iLhist+varmGL

p

iLarm

Arm S1

n iLarm

S2

Node 1

Geq=GL

(b)

va vb vc

S1

S2

SMN

vsm N

(a)

Reduced Model

vdcp

SM1

1 vsm

Geq v3

GL

v3 Arm Equivalent

hist ieq

Node 3,4,5

n vdc

Node 2

(c)

Fig. 7.10 (a) v-i coupling method to simplify the submodules. (b) MMC arm equivalent circuit derived by DOF reduction. (c) Three-phase MMC equivalent circuit used in nodal analysis

332

7 Parallel-in-Time EMT and Transient Stability Simulation of AC-DC Grids



⎤ ⎡ ⎤ ⎡ ⎤ 0 0 0 v1 i1 ⎣ ⎦ ⎣ ⎦ ⎣ Gv = 0 GL −GL v2 = i2 ⎦ = i, 0 −GL GL v3 i3

(7.20)

where G is the admittance matrix and v and i are the nodal voltages and current injection of the 3x3 element. The constraint v2 = v1 − varm is expressed as ⎡ ⎤ ⎡ v1 1 ⎣v2 ⎦ = v = T vˆ + g = ⎣1 0 v3

⎡ ⎤ ⎤ 0 0   v 1 + ⎣−varm ⎦ , 0⎦ v3 1 0

(7.21)

ˆ which reduces the degree of freedom v2 . So the reduced new admittance matrix G ˆ and current injection i become ˆ = T T GT , iˆ = i − Gg. G

(7.22)

Since i1 = 0, i2 = −iLhist , andi3 = iLhist , the final element equation for the combined MMC arm is       v1 GL varm + iLhist GL −GL = , (7.23) −GL GL v3 −GL varm + iLhist which becomes the two-node element in Fig. 7.10b. This element can be used to construct the single-phase or three-phase MMC topology shown in Fig. 7.10c which yields a 5 × 5 element used in nodal analysis. An object-oriented design is used to implement this element. All these fundamental modules are wired up in a monolithic top-level MMC class in Fig. 7.11a for the final use in the EMT simulation program. To control the DC voltages or power, an upper-level controller module is used to measure the required signals and generate three-phase modular index mabc . The single-phase MMC module receives the upper-level control signals and generates gate signals by the lower-level drive control system. The nearest-level-modulation (NLM) scheme [243] is applied. Based on the proposed MMC architecture, two kinds of the half-bridge submodules (HBSMs) are investigated in this work.

7.4.1.2 Ideal Switch Model The HBSM has two IGBTs and two diodes which generate four valid states of the conduction shown in Fig. 7.12. In normal operation, the control signal S2 S1 = 01 means the SM is inserted and the capacitor voltage appears in the main circuit; S2 S1 = 10 means the capacitor is bypassed and the capacitor stops charging or discharging; S2 S1 = 00 is the blocking state which may be used for charging up the capacitor in the initialization stage. For the ideal switch model, the on-state resistor of the diode and IGBT can be the same. Therefore, only two pre-known conduction

7.4 Parallel-in-Time EMT Modeling of MMCs for DC Grid Simulation

ω ω

333

mabc

Fig. 7.11 (a) Top-level module of the three-phase MMC. (b) Upper-level controller for DC voltages or PQ power control. (c) Single-phase MMC module implementation

State Iarm >0

S2S1=01

S2S1=10

S2S1=00

S1

S1

S1

S2

S2

S2

S1

S1

S1

S2

S2

S2

Iarm