Stochastic Methods for Modeling and Predicting Complex Dynamical Systems: Uncertainty Quantification, State Estimation, and Reduced-Order Models 3031222482, 9783031222481

This book enables readers to understand, model, and predict complex dynamical systems using new methods with stochastic

312 37 4MB

English Pages 207 [208] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgements
Contents
1 Stochastic Toolkits
1.1 Review of Basic Probability Concepts
1.1.1 Random Variable, Probability Measure, and Probability Density Function (PDF)
1.1.2 Gaussian Distribution
1.1.3 Moments and Non-Gaussian Distributions
1.1.4 Law of Large Numbers and Central Limit Theorem
1.2 Stochastic Processes
1.3 Stochastic Differential Equations (SDEs)
1.3.1 Itô Stochastic Integral
1.3.2 Itô's Formula
1.3.3 Fokker-Planck Equation
1.4 Markov Jump Processes
1.4.1 A Motivating Example of Intermittent Time Series
1.4.2 Finite-State Markov Jump Process
1.4.3 Master Equation
1.4.4 Switching Times
1.5 Chaotic Systems
2 Introduction to Information Theory
2.1 Shannon's Entropy and Maximum Entropy Principle
2.1.1 Shannon's Intuition from the Theory of Communication
2.1.2 Definition of Shannon's Entropy
2.1.3 Shannon's Entropy in Gaussian Framework
2.1.4 Maximum Entropy Principle
2.1.5 Coarse Graining and the Loss of Information
2.2 Relative Entropy, Quantifying the Model Error and Additional …
2.2.1 Definition of Relative Entropy
2.2.2 Relative Entropy in Gaussian Framework
2.2.3 Maximum Relative Entropy Principle
2.3 Mutual Information
2.3.1 Definition of Mutual Information
2.3.2 Mutual Information in Gaussian Framework
2.4 Relationship Between Information and Path-Wise Measurements
2.4.1 Two Widely Used Path-Wise Measurements: Root-Mean-Square Error (RMSE) and Pattern Correction
2.4.2 Relationship Between Shannon's Entropy of Residual and RMSE
2.4.3 Relationship Between Mutual Information and Pattern Correlation
2.4.4 Why Relative Entropy is Important?
3 Basic Stochastic Computational Methods
3.1 Monte Carlo Method
3.1.1 The Basic Idea
3.1.2 A Simple Example
3.1.3 Error Estimates for the Monte Carlo Method
3.2 Numerical Schemes for Solving SDEs
3.2.1 Euler-Maruyama Scheme
3.2.2 Milstein Scheme
3.3 Ensemble Method for Solving the Statistics of SDEs
3.4 Kernel Density Estimation
4 Simple Gaussian and Non-Gaussian SDEs
4.1 Linear Gaussian SDEs
4.1.1 Reynolds Decomposition and Time Evolution of the Moments
4.1.2 Statistical Equilibrium State and Decorrelation Time
4.1.3 Fokker-Planck Equation
4.2 Non-Gaussian SDE: Linear Model with Multiplicative Noise
4.2.1 Solving the Exact Path-Wise Solution
4.2.2 Equilibrium Distribution
4.2.3 Time Evolution of the Moments
4.3 Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity and Correlated Additive and Multiplicative (CAM) Noise
4.3.1 Equilibrium PDF
4.3.2 Time Evolution of the Moments
4.3.3 Quasi-Gaussian Closure for the Moment Equations Associated with the Cubic Model
4.4 Nonlinear SDEs with Exactly Solvable Conditional Moments
4.4.1 The Coupled Nonlinear SDE System
4.4.2 Derivation of the Exact Solvable Conditional Moments
5 Data Assimilation
5.1 Introduction
5.2 Kalman Filter
5.2.1 One-Dimensional Kalman Filter: Basic Idea of Data Assimilation
5.2.2 A Simple Example
5.2.3 Multi-Dimensional Case
5.2.4 Some Remarks
5.3 Nonlinear Filters
5.3.1 Extended Kalman Filter
5.3.2 Ensemble Kalman Filter
5.3.3 A Numerical Example
5.3.4 Particle Filter
5.4 Continuous-In-Time Version: Kalman-Bucy Filter
5.5 Other Data Assimilation Methods and Applications
6 Prediction
6.1 Ensemble Forecast
6.1.1 Trajectory Forecast Versus Ensemble Forecast
6.1.2 Lead Time and Ensemble Mean Forecast Time Series
6.2 Model Error, Internal Predictability and Prediction Skill
6.2.1 Important Factors for Useful Prediction
6.2.2 Quantifying the Predictability and Model Error via Information Criteria
6.3 Procedure of Designing Suitable Forecast Models
6.4 Predicting Model Response via Fluctuation-Dissipation Theorem
6.4.1 The FDT Framework
6.4.2 Approximate FDT Methods
6.4.3 A Nonlinear Example
6.5 Finding the Most Sensitive Change Directions via Information Theory
6.5.1 The Mathematical Framework Using Fisher Information
6.5.2 Examples
6.5.3 Practical Strategies Utilizing FDT
7 Data-Driven Low-Order Stochastic Models
7.1 Motivations
7.2 Complex Ornstein–Uhlenbeck (OU) Process
7.2.1 Calibration of the OU Process
7.2.2 Compensating the Effect of Complicated Nonlinearity by Simple Stochastic Noise
7.2.3 Application: Reproducing the Statistics of the Two-Layer Quasi-Geostrophic Turbulence
7.3 Combining Stochastic Models with Linear Analysis in PDEs to Model …
7.4 Linear Stochastic Model with Multiplicative Noise
7.4.1 Exact Formula for Model Calibration
7.4.2 Approximating Highly Nonlinear Time Series
7.4.3 Application: Characterizing an El Niño Index
7.5 Stochastically Parameterized Models
7.5.1 The Necessity of Stochastic Parameterization
7.5.2 The Stochastic Parameterized Extended Kalman Filter (SPEKF) Model
7.5.3 Filtering Intermittent Time Series
7.6 Physics-Constrained Nonlinear Regression Models
7.6.1 Motivations
7.6.2 The General Framework of Physics-Constrained Nonlinear Regression Models
7.6.3 Comparison of the Models with and Without Physics Constraints
7.7 Discussion: Linear and Gaussian Approximations for Nonlinear Systems
8 Conditional Gaussian Nonlinear Systems
8.1 Overview of the Conditional Gaussian Nonlinear System (CGNS)
8.1.1 The Mathematical Framework
8.1.2 Examples of Complex Dynamical Systems Belonging to the CGNS
8.1.3 The Mathematical and Physical Reasonings Behind the CGNS
8.2 Closed Analytic Formulae for Solving the Conditional Statistics
8.2.1 Nonlinear Optimal Filtering
8.2.2 Nonlinear Optimal Smoothing
8.2.3 Nonlinear Optimal Conditional Sampling
8.2.4 Example: Comparison of Filtering, Smoothing and Sampling
8.3 Lagrangian Data Assimilation
8.3.1 The Mathematical Setup
8.3.2 Filtering Compressible and Incompressible Flows
8.3.3 Uncertainty Quantification Using Information Theory
8.4 Solving High-Dimensional Fokker-Planck Equations
8.4.1 The Basic Algorithm
8.4.2 A Simple Example
8.4.3 Block Decomposition
8.4.4 Statistical Symmetry
8.4.5 Application to FDT
8.5 Application: Modeling and Predicting Monsoon Intraseasonal Oscillation (MISO)
8.5.1 MISO Indices from a Nonlinear Data Decomposition Technique
8.5.2 Data-Driven Physics-Constrained Nonlinear Model
8.5.3 Data Assimilation and Prediction
9 Parameter Estimation with Uncertainty Quantification
9.1 Markov Chain Monte Carlo
9.1.1 The Metropolis Algorithm
9.1.2 A Simple Example
9.1.3 Parameter Estimation via MCMC
9.1.4 MCMC with Data Augmentation
9.2 Expectation-Maximization
9.2.1 The Mathematical Framework
9.2.2 Details of the Quadratic Optimization in the Maximization Step
9.2.3 Incorporating the Constraints into the EM Algorithm
9.2.4 A Numerical Example
9.3 Parameter Estimation via Data Assimilation
9.3.1 Two Parameter Estimation Algorithms
9.3.2 Estimating One Additive Parameter in a Linear Scalar Model
9.3.3 Estimating One Multiplicative Parameter in a Linear Scalar Model
9.3.4 Estimating Parameters in a Cubic Nonlinear Scalar Model
9.3.5 A Numerical Example
9.4 Learning Complex Dynamical Systems with Sparse Identification
9.4.1 Constrained Optimization for Sparse Identification
9.4.2 Using Information Theory for Model Identification with Sparsity
9.4.3 Partial Observations and Stochastic Parameterizations
10 Combining Stochastic Models with Machine Learning
10.1 Machine Learning
10.2 Data Assimilation with Machine Learning Forecast Models
10.2.1 Using Machine Learning as Surrogate Models for Data Assimilation
10.2.2 Using Data Assimilation to Provide High-Quality Training Data of Machine Learning Models
10.3 Machine Learning for Stochastic Closure and Stochastic Parameterization
10.4 Statistical Forecast Using Machine Learning
10.4.1 Forecasting the Moment Equations with Neural Networks
10.4.2 Incorporating Additional Forecast Uncertainty in the Machine Learning Path-Wise Forecast Results
11 Instruction Manual for the MATLAB Codes
[DELETE]
References
-12pt
Index
Recommend Papers

Stochastic Methods for Modeling and Predicting Complex Dynamical Systems: Uncertainty Quantification, State Estimation, and Reduced-Order Models
 3031222482, 9783031222481

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Synthesis Lectures on Mathematics & Statistics

Nan Chen

Stochastic Methods for Modeling and Predicting Complex Dynamical Systems Uncertainty Quantification, State Estimation, and Reduced-Order Models

Synthesis Lectures on Mathematics & Statistics Series Editor Steven G. Krantz, Department of Mathematics, Washington University, Saint Louis, MO, USA

This series includes titles in applied mathematics and statistics for cross-disciplinary STEM professionals, educators, researchers, and students. The series focuses on new and traditional techniques to develop mathematical knowledge and skills, an understanding of core mathematical reasoning, and the ability to utilize data in specific applications.

Nan Chen

Stochastic Methods for Modeling and Predicting Complex Dynamical Systems Uncertainty Quantification, State Estimation, and Reduced-Order Models

Nan Chen Department of Mathematics University of Wisconsin–Madison Madison, WI, USA

ISSN 1938-1743 ISSN 1938-1751 (electronic) Synthesis Lectures on Mathematics & Statistics ISBN 978-3-031-22248-1 ISBN 978-3-031-22249-8 (eBook) https://doi.org/10.1007/978-3-031-22249-8 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

This book is dedicated to my wife, Yingxin, to my daughter, Amy, to my son, (to be named), to my parents, Bangyi and Shan, and to my grandma, Xiujuan.

Preface

Complex dynamical systems are ubiquitous in many areas, including geoscience, engineering, neural science, material science, etc. Modeling and predicting complex systems are central scientific problems with significant societal impacts. Effective models capture the underlying dynamics and the observed statistics of nature. They also facilitate studying many related vital tasks, such as uncertainty quantification (UQ), data assimilation, and prediction. In addition to the accuracy in characterizing essential dynamical and statistical features, the computational efficiency for simulating, analyzing, and predicting these models is extremely important for real-time state estimation and forecast. Therefore, fast computational algorithms, rigorous mathematical theories, and suitable approximations with model reduction need to be incorporated into the data-driven and physics-based modeling approaches that advance the development of qualitative and quantitative models. As many nature phenomena or scientific problems are too complicated to be modeled with full details, stochastic methods play a crucial role in effectively and efficiently characterizing key features and implementing skillful forecasts of these systems. Meanwhile, developing appropriate measurements to assess model error and model uncertainty from both path-wise and statistical viewpoints is indispensable for understanding the strengths and shortcomings of these models in practice. This book aims to introduce a wide range of stochastic methods for modeling, analyzing, and predicting complex dynamical systems. In particular, improving the computational efficiency and facilitating the study of the associated UQ, data assimilation, and prediction problems are highlighted. These essential issues are also taken into account in the model development procedure. Data-driven techniques and physics-based reasonings supplement each other for achieving these goals. The book was written based on several years of work on these topics and the teaching of stochastic methods to undergraduate and graduate students. The book is self-contained. Readers can follow the book with minimum mathematical background. The book should be interesting to undergraduate students, graduate students, postdocs, researchers, and scientists in pure and applied mathematics, physics, engineering, statistics, climate science, neural science, material science, and many other areas working on modeling, analyzing, and predicting complex dynamical systems. Throughout vii

viii

Preface

the book, underlying motivations, extensive discussions, and simple examples are utilized to facilitate the understanding of sophisticated methods and advanced techniques. The applications of stochastic modeling and prediction tools to several significant real-world problems are also presented to demonstrate the use of these methods in practice. To further advance the understanding of these approaches, MATLAB® codes for validating the concepts and reproducing the results from the illustrative examples in the book are provided. The book can be used as a one-semester textbook for a high-level undergraduate or graduate course. It can also be treated as a toolkit for interdisciplinary research work. Madison, WI, USA September 2022

Nan Chen

Acknowledgements

First, I would like to thank my late advisor Dr. Andrew (Andy) J. Majda, who introduced me to the research areas of stochastic modeling, uncertainty quantification, data assimilation, information theory, and prediction, as well as applying these tools to improve the understanding of many real-world problems. Andy provided me with support and motivation to study many exciting topics and helped me understand the interconnections across different areas. Combining rigorous math theory, novel numerical algorithms, qualitative or quantitative models, and critical thinking in understanding nature is what I learned from him, which is the philosophy of developing the methods and presenting the results in this book. It is also worthwhile to mention that some advanced topics and simple illustrative examples in the book are based on my collaborative research with Andy. Next, I express my sincere gratitude to Dr. Reza Malek-Madani, who encouraged and supported my research over the past many years. Most of the examples involving realworld applications in this book are related to the research work that Dr. Malek-Madani supported. Dr. Malek-Madani also provided many vital suggestions that improved the contents of this book. The book and the related teaching and research components are funded by the Office of Naval Research (ONR) award N00014-21-1-2904 and the ONR Multidisciplinary University Initiative (MURI) award N00014-19-1-2421. Some of the basic materials in the book were based on a topic course I co-lectured with Dr. Michal Branicki, Dr. Dimitris Giannakis, Dr. Di Qi, Dr. Themis Sapsis, and Dr. Xin Tong at Courant Institute a few years ago. I wish to use this chance to thank them for having worked together on preparing some of the lecture notes, which facilitate the wring of this book. Special thanks to my collaborators, friends, students, and postdocs who helped review the book and provided crucial comments: Dr. Quanling Deng, Dr. Evelyn Lunasin, Mr. Jeffrey Covington, Dr. Changhong Mou, Ms. Yinling Zhang, Mr. Marios Andreou. I also want to thank the Springer Nature Publisher group, especially Ms. Susanne Filler and Ms. Melanie Carlson, for helping edit the book and providing technical support during the entire procedure of writing this book.

ix

x

Acknowledgements

Finally, I appreciate you, the readers, for all the support for this book. I will continue to appreciate any thoughts, suggestions, or corrections you might have on this book. Please send your comments and feedback to [email protected]. Thanks in advance for taking the time to contribute.

Contents

1

2

Stochastic Toolkits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Review of Basic Probability Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Random Variable, Probability Measure, and Probability Density Function (PDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Moments and Non-Gaussian Distributions . . . . . . . . . . . . . . . . . 1.1.4 Law of Large Numbers and Central Limit Theorem . . . . . . . . . 1.2 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Stochastic Differential Equations (SDEs) . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Itô Stochastic Integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Itô’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Fokker-Planck Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Markov Jump Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.1 A Motivating Example of Intermittent Time Series . . . . . . . . . . 1.4.2 Finite-State Markov Jump Process . . . . . . . . . . . . . . . . . . . . . . . . 1.4.3 Master Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4.4 Switching Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Chaotic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction to Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Shannon’s Entropy and Maximum Entropy Principle . . . . . . . . . . . . . . . 2.1.1 Shannon’s Intuition from the Theory of Communication . . . . . 2.1.2 Definition of Shannon’s Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Shannon’s Entropy in Gaussian Framework . . . . . . . . . . . . . . . . 2.1.4 Maximum Entropy Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.5 Coarse Graining and the Loss of Information . . . . . . . . . . . . . . 2.2 Relative Entropy, Quantifying the Model Error and Additional Lack of Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Definition of Relative Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 1 2 4 6 7 9 9 10 11 14 14 16 16 18 19 23 23 23 24 26 26 29 30 31

xi

xii

Contents

2.2.2 Relative Entropy in Gaussian Framework . . . . . . . . . . . . . . . . . . 2.2.3 Maximum Relative Entropy Principle . . . . . . . . . . . . . . . . . . . . . Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Definition of Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Mutual Information in Gaussian Framework . . . . . . . . . . . . . . . Relationship Between Information and Path-Wise Measurements . . . . . 2.4.1 Two Widely Used Path-Wise Measurements: Root-Mean-Square Error (RMSE) and Pattern Correction . . . . 2.4.2 Relationship Between Shannon’s Entropy of Residual and RMSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Relationship Between Mutual Information and Pattern Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Why Relative Entropy is Important? . . . . . . . . . . . . . . . . . . . . . .

32 32 33 33 34 35

3

Basic Stochastic Computational Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Monte Carlo Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Error Estimates for the Monte Carlo Method . . . . . . . . . . . . . . . 3.2 Numerical Schemes for Solving SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Euler-Maruyama Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Milstein Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Ensemble Method for Solving the Statistics of SDEs . . . . . . . . . . . . . . . 3.4 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39 39 39 40 41 43 43 44 45 47

4

Simple Gaussian and Non-Gaussian SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Linear Gaussian SDEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Reynolds Decomposition and Time Evolution of the Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Statistical Equilibrium State and Decorrelation Time . . . . . . . . 4.1.3 Fokker-Planck Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Non-Gaussian SDE: Linear Model with Multiplicative Noise . . . . . . . . 4.2.1 Solving the Exact Path-Wise Solution . . . . . . . . . . . . . . . . . . . . . 4.2.2 Equilibrium Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Time Evolution of the Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity and Correlated Additive and Multiplicative (CAM) Noise . . . . . . . . . . . 4.3.1 Equilibrium PDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Time Evolution of the Moments . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Quasi-Gaussian Closure for the Moment Equations Associated with the Cubic Model . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Nonlinear SDEs with Exactly Solvable Conditional Moments . . . . . . . .

49 49

2.3

2.4

35 35 36 36

49 51 52 53 53 54 55 57 57 59 60 62

Contents

xiii

4.4.1 4.4.2 5

6

7

The Coupled Nonlinear SDE System . . . . . . . . . . . . . . . . . . . . . . Derivation of the Exact Solvable Conditional Moments . . . . . .

63 63

Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 One-Dimensional Kalman Filter: Basic Idea of Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 Multi-Dimensional Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Nonlinear Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Extended Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Ensemble Kalman Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.4 Particle Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Continuous-In-Time Version: Kalman-Bucy Filter . . . . . . . . . . . . . . . . . . 5.5 Other Data Assimilation Methods and Applications . . . . . . . . . . . . . . . .

67 67 68

Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Ensemble Forecast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Trajectory Forecast Versus Ensemble Forecast . . . . . . . . . . . . . . 6.1.2 Lead Time and Ensemble Mean Forecast Time Series . . . . . . . 6.2 Model Error, Internal Predictability and Prediction Skill . . . . . . . . . . . . 6.2.1 Important Factors for Useful Prediction . . . . . . . . . . . . . . . . . . . 6.2.2 Quantifying the Predictability and Model Error via Information Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Procedure of Designing Suitable Forecast Models . . . . . . . . . . . . . . . . . . 6.4 Predicting Model Response via Fluctuation-Dissipation Theorem . . . . . 6.4.1 The FDT Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Approximate FDT Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.3 A Nonlinear Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Finding the Most Sensitive Change Directions via Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 The Mathematical Framework Using Fisher Information . . . . . 6.5.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Practical Strategies Utilizing FDT . . . . . . . . . . . . . . . . . . . . . . . . Data-Driven Low-Order Stochastic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Complex Ornstein–Uhlenbeck (OU) Process . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Calibration of the OU Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68 70 72 74 75 75 76 78 79 80 82 83 83 83 84 85 85 85 88 89 89 91 92 95 95 96 98 99 99 101 101

xiv

Contents

7.2.2

7.3 7.4

7.5

7.6

7.7 8

Compensating the Effect of Complicated Nonlinearity by Simple Stochastic Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Application: Reproducing the Statistics of the Two-Layer Quasi-Geostrophic Turbulence . . . . . . . . . . . . . . . . . . . . . . . . . . . Combining Stochastic Models with Linear Analysis in PDEs to Model Spatial-Extended Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linear Stochastic Model with Multiplicative Noise . . . . . . . . . . . . . . . . . 7.4.1 Exact Formula for Model Calibration . . . . . . . . . . . . . . . . . . . . . 7.4.2 Approximating Highly Nonlinear Time Series . . . . . . . . . . . . . . 7.4.3 Application: Characterizing an El Niño Index . . . . . . . . . . . . . . Stochastically Parameterized Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 The Necessity of Stochastic Parameterization . . . . . . . . . . . . . . 7.5.2 The Stochastic Parameterized Extended Kalman Filter (SPEKF) Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.3 Filtering Intermittent Time Series . . . . . . . . . . . . . . . . . . . . . . . . . Physics-Constrained Nonlinear Regression Models . . . . . . . . . . . . . . . . . 7.6.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.2 The General Framework of Physics-Constrained Nonlinear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6.3 Comparison of the Models with and Without Physics Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion: Linear and Gaussian Approximations for Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Conditional Gaussian Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Overview of the Conditional Gaussian Nonlinear System (CGNS) . . . . 8.1.1 The Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.2 Examples of Complex Dynamical Systems Belonging to the CGNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1.3 The Mathematical and Physical Reasonings Behind the CGNS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Closed Analytic Formulae for Solving the Conditional Statistics . . . . . 8.2.1 Nonlinear Optimal Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 Nonlinear Optimal Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.3 Nonlinear Optimal Conditional Sampling . . . . . . . . . . . . . . . . . . 8.2.4 Example: Comparison of Filtering, Smoothing and Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 Lagrangian Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 The Mathematical Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Filtering Compressible and Incompressible Flows . . . . . . . . . . . 8.3.3 Uncertainty Quantification Using Information Theory . . . . . . . 8.4 Solving High-Dimensional Fokker-Planck Equations . . . . . . . . . . . . . . . .

102 102 104 106 106 107 107 110 110 111 113 115 115 116 116 117 119 119 120 120 123 124 124 125 126 126 128 128 130 132 134

Contents

8.4.1 The Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Block Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 Statistical Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5 Application to FDT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Application: Modeling and Predicting Monsoon Intraseasonal Oscillation (MISO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 MISO Indices from a Nonlinear Data Decomposition Technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Data-Driven Physics-Constrained Nonlinear Model . . . . . . . . . 8.5.3 Data Assimilation and Prediction . . . . . . . . . . . . . . . . . . . . . . . . .

134 135 136 137 138

Parameter Estimation with Uncertainty Quantification . . . . . . . . . . . . . . . . . 9.1 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 The Metropolis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Parameter Estimation via MCMC . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 MCMC with Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Expectation-Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.1 The Mathematical Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.2 Details of the Quadratic Optimization in the Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.3 Incorporating the Constraints into the EM Algorithm . . . . . . . . 9.2.4 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Parameter Estimation via Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Two Parameter Estimation Algorithms . . . . . . . . . . . . . . . . . . . . 9.3.2 Estimating One Additive Parameter in a Linear Scalar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Estimating One Multiplicative Parameter in a Linear Scalar Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Estimating Parameters in a Cubic Nonlinear Scalar Model . . . 9.3.5 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Learning Complex Dynamical Systems with Sparse Identification . . . . 9.4.1 Constrained Optimization for Sparse Identification . . . . . . . . . . 9.4.2 Using Information Theory for Model Identification with Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.3 Partial Observations and Stochastic Parameterizations . . . . . . .

143 143 143 144 146 149 149 150

8.5

9

xv

10 Combining Stochastic Models with Machine Learning . . . . . . . . . . . . . . . . . . 10.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Data Assimilation with Machine Learning Forecast Models . . . . . . . . .

139 139 141 141

151 153 153 155 155 156 158 161 162 164 164 165 168 171 171 172

xvi

Contents

10.2.1 Using Machine Learning as Surrogate Models for Data Assimilation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Using Data Assimilation to Provide High-Quality Training Data of Machine Learning Models . . . . . . . . . . . . . . . . 10.3 Machine Learning for Stochastic Closure and Stochastic Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Statistical Forecast Using Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Forecasting the Moment Equations with Neural Networks . . . 10.4.2 Incorporating Additional Forecast Uncertainty in the Machine Learning Path-Wise Forecast Results . . . . . . . .

172 173 173 174 174 176

11 Instruction Manual for the MATLAB Codes . . . . . . . . . . . . . . . . . . . . . . . . . . .

179

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

185

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

197

1

Stochastic Toolkits

1.1

Review of Basic Probability Concepts

This section includes a summary of the basic probability concepts that are essential in developing stochastic toolkits for modeling and predicting complex dynamical systems. As the focus is on applied science, non-rigorous and formal definitions are often utilized below to facilitate understanding the concepts. Detailed and rigorous statements of these concepts can be found in classical probability theory books, for example, [109, 117, 121]. The materials presented here mainly regard continuous random variables frequently occurring in science and engineering. These concepts can be easily extended to discrete cases.

1.1.1

Random Variable, Probability Measure, and Probability Density Function (PDF)

In this and the subsequent chapters, vectors are often written in bold font while scalars are in plain or italic font. Definition 1.1 (Random variable) A random variable X :  → E is a measurable function from the set of possible outcomes  to some set E. In many examples presented below, E = Rn , which is the n-dimensional real space.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_1

1

2

1 Stochastic Toolkits

Definition 1.2 (Cumulative distribution function) The cumulative distribution function of the random variable X = (X 1 , . . . , X n ) is the function FX : Rn → [0, 1] defined by FX (x) := P(X ≤ x)

for all x ∈ Rn ,

(1.1)

where P is a probability measure and the notation X ≤ x should be understood componentwise as X 1 ≤ x1 , X 2 ≤ x2 . . . X n ≤ xn . If {X1 , . . . Xm } represents a countable collection of n-dimensional random variables, then their joint cumulative distribution function is FX1 ,...,Xm : Rn×m → [0, 1] and is given by FX1 ,...,Xm (x1 , . . . xm ) := P(X1 ≤ x1 , . . . , Xm ≤ xm )

(1.2)

for all xi ∈ Rn , i = 1, . . . , m. Definition 1.3 (Probability density function (PDF)) Suppose that X is a random variable and FX is its distribution function. If there exists a nonnegative, (Riemann) integrable function p : Rn → R such that  x1  xn FX (x) = P(X ≤ x) = ··· p(y1 , . . . , yn ) dy1 . . . dyn , −∞

−∞

with x = (x1 , . . . , xn ), then p is called the probability density function (PDF) of X. Definition 1.4 (Marginal density) Suppose that X :  → Rn is a random variable with PDF ˜ i ) for xi , i ∈ [1, . . . , n] is given by p(x1 , . . . , xn ). The marginal probability density p(x  ∞  ∞ ... p(y1 , . . . , xi , . . . , yn ) dy1 . . . dyi−1 dyi+1 . . . dyn . p(x ˜ i) = −∞

−∞

Definition 1.5 (Conditional distribution) Given two jointly distributed random variables x1 and x2 , the conditional probability distribution of x2 given x1 is the probability distribution of x2 when x1 is known to be a specific value: p(x1 |x2 ) =

1.1.2

p(x1 , x2 ) . p(x2 )

(1.3)

Gaussian Distribution

Gaussian (or normal) distribution is widely used in practice. For a Gaussian random variable X, it is characterized by the following PDF 1 1 T −1 e− 2 (x−μ)  (x−μ) , p(x) = √ (2π)n ||

(1.4)

1.1

Review of Basic Probability Concepts

3

where μ ≡ x is the mean, the symmetric and positive definite matrix  ≡ (x − μ)(x − ∞ μ)T is the covariance, ·T is the vector transpose and · := −∞ · p(x) dx is the statistical average. A Gaussian distribution is uniquely determined by these two statistics and is often denoted as X ∼ N (μ, ). For the one-dimensional case, the standard Gaussian distribution refers to the one with zero mean and unit variance. Proposition 1.1 (Marginal density of Gaussian) For a Gaussian random variable, the marginal density over a subset of components is obtained by simply dropping the remaining components. For example, given a three-dimensional Gaussian random variable (x1 , x2 , x3 )T with the mean μ and the covariance  being ⎞ μ1 μ = ⎝ μ2 ⎠ , μ3 ⎛



⎞ 11 12 13  = ⎝ 21 22 23 ⎠ , 31 32 33

the marginal two-dimensional density for (x1 , x3 )T remains as a Gaussian with the marginal ˜ being mean μ˜ and the marginal covariance    μ1 ˜ = 11 13 . μ˜ = ,  μ3 31 33 Proposition 1.2 (Conditional density of Gaussian) In addition to the marginal density, the conditional density for a Gaussian random variable is also Gaussian. For example, consider a Gaussian random variable x = (x1 , x2 )T with the mean μ and the covariance  being in the following form,    11  12 μ1 , = , μ= μ2  21  22 where x1 and x2 are multi-dimensional. The conditional density is then given by p(x1 |x2 ) ∼ N (μ, ) where

μ = μ1 +  12  −1 22 (x2 − μ2 )

and

 =  11 −  12  −1 22  21 .

(1.5)

(1.6)

One immediate conclusion drawn from (1.6) is that, for a Gaussian distribution, the conditional covariance  is always ‘no bigger’ than the marginal one  11 , since the matrix  12  −1 22  21 is semi-positive definite. When both x1 and x2 are scalar variables, the above two covariance matrices degenerate to the variances, and the conditional variance is expected to be equal or less than the marginal variance. In the multi-dimensional case, the comparison can be made by computing the areas of the hyperellipses associated with the covariance matrices.

4

1 Stochastic Toolkits

For a Gaussian random variable, it is natural and intuitive to utilize the covariance to characterize its uncertainty. It will be shown rigorously in Chap. 2 from the information theory viewpoint that the determinant of the covariance is indeed a unique way to quantify the uncertainty for a Gaussian random variable. Therefore, the result in (1.6) indicates that the additional information provided by x1 advances the reduction of the uncertainty of the state variable x2 . It is also worthwhile to remark that if x1 is regarded as the observational information while x2 is the forecast from a separate model, then (1.6) leads to the essence of data assimilation, in which observational data are incorporated into the model forecast to facilitate the reduction of the uncertainty and the correction of the model forecast error. The details of data assimilation and uncertainty quantification will be presented in Chap. 5.

1.1.3

Moments and Non-Gaussian Distributions

Consider a scalar random variable X with PDF p(x). The n-th moment and central moment are defined respectively as  ∞ x n p(x) dx, for n ≥ 1, μn = E[X n ] = −∞ (1.7)  ∞ μ˜ n = E[(X − μ)n ] = (x − μ)n p(x) dx, for n ≥ 2. −∞

The expressions of the moments are sometimes denoted as μn = x n for notation simplicity, ∞ where again · := −∞ · p(x) dx stands for the statistical average. The two most widely used moments are the mean and the variance. The mean μ is the first moment, representing the average value of the random variable, while the variance μ˜ 2 is the second central moment, measuring how far a set of numbers spreads out from the mean. The square root of the variance is named the standard deviation, which has the same unit as the random variable. For multi-dimensional random variables, the second central moment is called the covariance, which measures the joint variability. For n ≥ 3, the standardized moments provide more intuitive characterizations of a random variable and the associated PDF. A standardized moment is a normalized central moment (with n ≥ 3). The normalization is typically a division by an expression of the standard deviation, which renders the moment scale invariant. See Table 1.1 for the definitions, the Table 1.1 The definitions of the mean, variance, skewness and kurtosis Statistics

Description

Moment

Formula

Mean

The central tendency

1-st moment

μ

Variance

Spreading out

2-nd central moment

μ˜ 2

Skewness

The asymmetry

3-rd standardized moment

μ˜ 3 /σ 3

Kurtosis

The “peakedness”

4-th standardized moment

μ˜ 4 /σ 4

1.1

Review of Basic Probability Concepts

5

descriptions and the formulae of the third (n = 3) and the fourth (n = 4) standardized moments. They are named skewness and kurtosis, which characterize the asymmetry and the “peakedness” of the PDF associated with the random variable, respectively. When the skewness is zero, the PDF is symmetric. A positive (negative) skewness commonly indicates that the tail is on the right (left) side of the distribution. On the other hand, a PDF with kurtosis being the standard value of 3 has the same tail behavior as the Gaussian distribution. When the kurtosis is smaller (larger) than 3, the random variable produces fewer (more) extreme outliers than the corresponding Gaussian counterpart. Therefore, the associated PDF has lighter (heavier) tails than the Gaussian distribution. Excess kurtosis is sometimes utilized, defined as kurtosis minus 3. Figure 1.1 displays several PDFs with the same mean and variance but different skewness and kurtosis. The dashed black curve in each panel stands for the Gaussian distribution (known as the Gaussian fit), the skewness and kurtosis of which are 0 and 3, respectively. As is expected, with the increase in skewness, the asymmetry in the PDF becomes more and more significant. Similarly, the increased kurtosis leads to fat tails in the PDF. Note that it is often easier to observe the fat tails when the y-axis (i.e., the probability density) is plotted in the logarithm scale (see the second row of Fig. 1.1). In such a situation, the profile of a Gaussian distribution becomes a parabola. When the kurtosis exceeds 3, the tails of the

(a) PDFs with different skewness Skewness = 3.26 Skewness = 1.75 Skewness = 0.78 Skewness = 0

p(x)

0.4

0.2

(b) PDFs with different kurtosis

0.5

Kurtosis = 9 Kurtosis = 7 Kurtosis = 5 Kurtosis = 3 Kurtosis = 2

0.4

p(x)

0.6

0.3 0.2 0.1

0 -4

-2

0

2

4

0 -5

6

0

p(x) in logarithm scale

p(x) in logarithm scale

100

10-2

10-4 -5

0

5

x

5

x

x

10

100

10-2

fat-tails

10-4 -10

-5

sub-Gaussian

0

5

10

x

Fig. 1.1 PDF with different skewness (Panel a) and kurtosis (Panel b). The top row shows the PDFs in the standard coordinates with a linear scale. In contrast, the bottom row shows those with the y-axis (the probability) plotted in the logarithm scale, which facilitates the illustration of the tail behavior. The dashed black curves are the Gaussian fit of the non-Gaussian distributions, which is a parabola when the y-axis is plotted in the logarithm scale. The skewness is zero, and the kurtosis is the standard value of 3 for the Gaussian distribution

6

1 Stochastic Toolkits

Nino 3

Nino 4

0.5

2

o

C

PDFs

(a) El Nino-Southern Oscillation indices: sea surface temperature anomaly

4

0 -2 1980

1985

1990

1995

2000

2005

2010

2015

0

2020

-2

0

2

4

Year (b) Modesto irrigation district precipitation

PDF

0.1

inch

20 0.05

10 0

0 1900

1920

1940

1960

1980

2000

2020

0

10

20

30

Year PDF of u

T

u(t)

2 0

1

0.4

0.5

0.2

PDF of T

-2 0 200

205

210

215

220

225

230

235

240

245

250

-2

0

2

0 0

5

10

t

Fig. 1.2 Examples of non-Gaussian systems. Panel a: the El Niño-Southern Oscillation indices, where the Nino 3 index has a positive skewness and a one-sided fat tail while the Nino 4 index is negatively skewed and is sub-Gaussian. Panel b: the Modesto irrigation district precipitation, which is positively skewed and fat-tailed. Panel c: the numerical solution of a stochastically coupled FizHughNagumo (FHN) model, the distribution of the state variable u, and the distribution of the time interval T between successive oscillations in u

PDF lie above the associated Gaussian fit. When the kurtosis is less than 3, the tails decrease faster than the Gaussian fit, and the distribution is known as the sub-Gaussian. Many complex systems in nature have strong non-Gaussian features, which often come from complex nonlinear interactions between state variables, intermittent instability, rare and extreme events, and regime switching. Figure 1.2 includes some typical examples of non-Gaussian systems: (a) the El Niño-Southern Oscillation (ENSO) indices [6, 207], where the ENSO is a well-known climate phenomenon related to climate change, (b) a precipitation time series, and (c) a stochastically coupled FitzHugh-Nagumo (FHN) model [162], which is a prototype of excitable media system describing the activation and deactivation of a spiking neuron.

1.1.4

Law of Large Numbers and Central Limit Theorem

The law of large numbers describes the result of performing the same experiment many times. It states that the average results obtained from a large number of trials are closer and closer to the expected value as more trials are performed. The version presented here assumes finite variance.

1.2

Stochastic Processes

7

Proposition 1.3 (Law of large numbers with finite variance) Suppose that X 1 , X 2 , X 3 , . . . are independent and identically distributed (i.i.d.) random variables with finite mean E[X 1 ] = μ and finite variance V ar (X 1 ) = σ 2 . Let Sn = X 1 + · · · + X n . Then for any fixed  > 0, the law of large numbers implies:





Sn (1.8) lim P

− μ

≥  = 0. n→∞ n The proof of the law of large numbers (1.8) needs to exploit Chebyshev’s inequality, and the details can be found in [121]. Proposition 1.4 (Central limit theorem) Suppose that X 1 , X 2 , X 3 , . . . are i.i.d. random variables with finite mean E[X 1 ] = μ and finite variance V ar (X 1 ) = σ 2 . Let Sn = X 1 + · · · + X n . Then for any fixed −∞ ≤ a ≤ b ≤ ∞, the central limit theorem says   b y2 1 Sn − nμ lim P a ≤ ≤ b = (b) − (a) = √ e− 2 dy, √ n→∞ σ n 2π a where (·) is the standard Gaussian cumulative distribution function (1.1). The central limit theorem can be approximately expressed as follows, σ Sn σ Sn − nμ ≈ μ + √ Z, =μ+ √ · √ n n σ n n

(1.9)

where Z is a standard normal variable. The first equality above decomposes Sn /n into a sum of its mean μ and a random error. For large n, this random error is approximately normally √ distributed with standard deviation σ/ n. Note that the above approximation explains why Sn /n is a reasonable estimator for μ and is the basis of the Monte Carlo method. The latter is an important tool in simulating stochastic processes, which will be discussed in Chap. 3.

1.2

Stochastic Processes

Definition 1.6 (Stochastic process) A stochastic process is a collection of random variables {X t (ω)|t ∈ R+ }, representing the evolution of some system of random values over time. For each sample point ω ∈ , the mapping t → X t (ω) is the corresponding sample path of the stochastic process. Loosely speaking, a stochastic process X t (ω) is a ‘random’ function of time, i.e., for each fixed time, t ∗ , X t ∗ (ω) is a random variable, and for each fixed sample point, ω ∗ , X t (ω ∗ ) is just a function of time.

8

1 Stochastic Toolkits

The Wiener process W (t), also known as the (standard) Brownian motion, is a commonly used stochastic process defined as follows. Definition 1.7 (Wiener process) A real-valued stochastic process W (·) is called a Wiener process or Brownian motion [72] if 1. W (0) = 0 with probability one, i.e., almost surely (a.s.). 2. W (t) is continuous with probability one. 3. W (t) has independent increments with distribution W (t) − W (s) ∼ N (0, t − s) for 0 ≤ s < t. Thus, due to the independent increment of W (t), the sample paths of a Wiener process are continuous but with probability one nowhere differentiable. For each time t = ti − ti−1 ≥ 0, the increment Wi ≡ W (ti ) − W (ti−1 ) has the following property: E(Wi ) = 0

and

E(Wi2 ) = ti .

(1.10)

One can also define a complex Wiener process √ with the independent real and imaginary parts being real Wiener processes. A factor of 1/ 2 is added to keep the variance of the complex Wiener process the same as in the real case: 1 Wc (t) = √ (W R (t) + i W I (t)). 2

(1.11)

An important class of stochastic processes is the Markov process, in which the future state is independent of the past, given the present state. Below, instead of providing a rigorous mathematical definition for stochastic processes, the definition of the Markov assumption is formulated in terms of conditional probabilities for stochastic sequences. Definition 1.8 (Markov property (of stochastic sequences)) Consider a joint probability density, p(x0 , x1 , x2 , . . . , xn , . . .), of a stochastic sequence obtained from the process X by sampling it at an ordered sequence of times t0 < t1 < t2 . . .. The process with the density p has the Markov property if the conditional density satisfies: p(xn = yn |xn−1 = yn−1 , . . . , x0 = y0 ) = p(xn = yn |xn−1 = yn−1 ),

(1.12)

which implies that the conditional probability density associated with the process X is determined entirely by the state variable xn−1 at the most recent time instant tn−1 and is not dependent on history. It is clear that the Wiener process is a Markov process. Depending on the sample space and the resolution in time, stochastic processes can be divided into the following four categories: Markov chain (discrete in both time and space), Markov jump process (discrete in space, continuous in time), stochastic differential equation

1.3

Stochastic Differential Equations (SDEs)

9

(continuous in both time and space), and stochastic difference equation (discrete in time, continuous in space). Mixed versions of two or more stochastic processes are also seen in modeling many complex systems. The focus of this book will be on the two processes that involve continuous time and commonly appear in natural science, namely the stochastic differential equations and the Markov jump process. Analog tools have been developed for those with discrete time.

1.3

Stochastic Differential Equations (SDEs)

1.3.1

Itô Stochastic Integral

Denote by W (t) a Wiener process. Given a non-anticipating function G(x, t), Itô stochastic integral is defined as [100] 

t

t0

G(x(s), s) dW (s) := m.s. lim

n→∞

n

G(x(t j−1 ), t j−1 )(W (t j ) − W (t j−1 )),

(1.13)

j=1

where x may be a stochastic process, and the limit is taken in the mean square (m.s.) sense, i.e., m.s. limn→∞ xn = x → limn→∞ (xn − x)2 = 0. Here the ‘non-anticipating’ is also known as ‘adapted’ (see [146] for a rigorous definition). Roughly speaking, it means that at time t, the value of G(x, t) should be known. For example, the process G(x, t) = W1 , where G equals the value of a Wiener process at time 1, is not adapted, since at time t = 0.5 the value of G is not available. One of the key reasons for G to be adapted is that the increment of Wt will be independent of G. It is extremely important to note that Itô stochastic integral requires the function G on the right hand side of (1.13) to be evaluated at the left point t j−1 of each subinterval [t j−1 , t j ]. This is very different from the standard Lebesgue integral for a smooth process in calculus, where the limit of the infinite summation converges to a unique value regardless of where the function G is estimated within [t j−1 , t j ]. However, since the Wiener process W (t) is nowhere differentiable, the integral has different values depending on where the function is evaluated. For example, if a centered scheme (G(x(t j−1 ), t j−1 ) + G(x(t j ), t j ))/2 is adopted instead on the right hand side of (1.13), then the formula is called the Stratonovich stochastic integral [99]. These different integrals have distinct meanings, although they can translate between each other. Unless specified otherwise, the Itô stochastic integral (1.13) will be utilized in the remaining part of the book. Assume G and H are non-anticipating functions, and f is a smooth function of W . Some important properties of the Itô calculus are as follows: t t • t0 G(t  )[ dW (t  )]2 = t0 G(t  ) dt  , t •  t0 G(t  ) dW (t  ) = 0,

(due to dW 2 (t) = dt in (1.10)),

10

1 Stochastic Toolkits

t t t •  t0 G(t  ) dW (t  ) t0 H (t  ) dW (t  ) = t0 G(t  )H (t  ) dt,

∂f ∂2 f dt + ∂W dW (t). • d f (W (t), t) = ∂∂tf + 21 ∂W 2 Based on the properties of the Wiener process, it can be deduced that  t   t 1 W (t  ) dW (t  ) = [W 2 (t) − W 2 (t0 ) − (t − t0 )] and W (t  ) dW (t  ) = 0. 2 t0 t0 This property of the Itô stochastic integral also points to the fundamental differences with regard to the Lebesgue integration. The Itô stochastic differential equation (SDE) can be introduced with the tool of the Itô stochastic integral in hand. Given the initial condition X (t0 ) = X 0 and for t > t0 > 0, the SDE is given as follows: dX (t) = A(X (t), t)+B(X (t), t)W˙ (t), dt

(1.14)

which is known as the physics formulation. The above formula is an ordinary differential equation (ODE) without the second term on the right-hand side. In (1.14), the notation W˙ (t) can be loosely understood as the derivative of a stochastic process “ dW (t)/ dt”. However, such an interpretation is only formal, as the Wiener process is a.s. nowhere differentiable. See [199] for the more rigorous definition of SDEs. Throughout the book, the Wiener process will be utilized in the SDE (1.14), where W˙ (t) is also known as the (Gaussian) white noise. In addition to the physics formulation, an alternative way to write an SDE is as follows, dX (t) = A(X (t), t) dt+B(X (t), t) dW (t),

(1.15)

which is the notation widely utilized in probability theory and numerical methods. Finally, for all t and t0 , the solution of X is given by  t  t X (t) = X (t0 ) + A(X (s), s) ds + B(X (s), s) dW (s), (1.16) t0

t0

t where t0 B(X (s), s) dW (s) is the Itô integral in (1.13). Note that, for notation simplicity, X (t), W (t) and other functions, e.g., A(X (t), t), are sometimes denoted as X t , Wt and At , respectively, in SDEs.

1.3.2

Itô’s Formula

Itô’s formula (also known as Itô’s lemma) is an identity to find the differential of a timedependent function of a stochastic process. It serves as the stochastic calculus counterpart of the chain rule. Itô’s formula is a fundamental tool in finding the path-wise solution, deriving

1.3

Stochastic Differential Equations (SDEs)

11

the moment equations, and building reduced-order models of many SDEs. See Chap. 4 for concrete examples. Assume xt satisfies the following SDE: dxt = at dt + bt dWt ,

(1.17)

where at := a(x(t), t), bt := b(x(t), t) are nonlinear functions of both x and t. Then for a smooth deterministic function f , Itô’s formula reads 1 d f (xt ) = f  (xt ) dxt + f  (xt )( dxt )2 2

1

2  = f (xt ) at dt + bt dWt + f  (xt ) at dt + bt dWt 2

1  = f (xt ) at dt + bt dWt + f  (xt )bt2 dt 2

1   2 = f (xt )at + f (xt )bt dt + f  (xt )bt dWt . 2

(1.18a) (1.18b) (1.18c) (1.18d)

In (1.18a), a Taylor expansion up to the second order is applied (and the equality is still adopted here for the notation simplicity despite the remaining higher order terms being omitted). Different from the deterministic case, where ( dt)2 is a higher order quantity compared with dt, ( dWt )2 is of the same order as dt. In fact, according to the basic property of the Wiener process (1.10), it leads to dWt ∼ N (0, dt) =⇒ dWt · dWt = dt (variance). This gives the last term in (1.18c). Therefore, the second order Taylor expansion is crucial in (1.18) since otherwise the term 21 f  (xt )bt2 dt will be missed with a first-order expansion. Itô’s formula in (1.18) provides the correct expression for calculating differentials of composite functions, which depend on the Wiener process.

1.3.3

Fokker-Planck Equation

The Fokker-Planck equation is a partial differential equation (PDE), which describes the time evolution of the PDF associated with the underlying SDE. The solution of the Fokker-Planck equation is the essential part of data assimilation, ensemble prediction, and uncertainty quantification. Recall the notation · introduced in Sect. 1.1.3, which represents the statistical average. Denote by x a scalar random variable with PDF p(x) and f a smooth function of x. Then ∞  f (x) := −∞ f (x) p(x) dx is the average of f (x) with respect to the PDF p(x). Now assume x(t) satisfies the scalar SDE (1.17). The associated PDF is denoted by p(x, t), which is assumed to vanish at infinity. Then

12

1 Stochastic Toolkits

  f (x(t)) :=

∞ −∞

f (x(t)) p(x, t) dx

(1.19)

is a function of t, where the statistical average is taken at each fixed time instant t. Applying the statistical average (1.19) to Itô’s formula (1.18) yields    ∂ 2 f (x) ∂ f (x) 1 2 d (1.20a) f (x(t)) = a(x(t), t) + b (x(t), t) dt ∂x 2 ∂x 2  ∞ ∂ f (x) 1 2 ∂ 2 f (x) = a(x(t), t) p(x, t) dx (1.20b) + b (x(t), t) ∂x 2 ∂x 2 −∞   ∞

1 ∂2   ∂ = b2 (x(t), t) p(x, t) dx, f (x) − (1.20c) a(x(t), t) p(x, t) + 2 ∂x 2 ∂x −∞ where (1.20a) comes from the fact that the statistical average of the last term in (1.18a) is zero according to the definition of the white noise while (1.20c) is the integration by part utilizing the vanishing feature of the probability at infinity. On the other hand, the definition of the statistical average (1.19) also gives  ∞ ∂ p(x, t) d f (x)  f (x(t)) = dx. (1.21) dt ∂t −∞ Thus, in light of (1.20) and (1.21), the Fokker-Planck equation reads

1 ∂2   ∂ ∂ p(x, t) b2 (x(t), t) p(x, t) . =− a(x(t), t) p(x, t) + 2 ∂t ∂x 2 ∂x

(1.22)

It is important to highlight that despite the nonlinearity in the underlying SDE (1.17) of the state variable x, the associated Fokker-Planck equation in (1.22) is always a linear equation with respect to the probability density p(x, t). The Fokker-Planck equation is a convection-diffusion equation. The first term on the right-hand side of (1.22) is named ‘drift’, representing the overall transportation of the solution due to the deterministic part of the underlying SDE. The second term is named ‘diffusion’, as a result of the randomness of the Brownian motion. These features are intuitively seen by considering the following two special cases of the underlying SDE (1.17). Assume initial condition p(x, 0) = δ(x). • If the underlying process (1.17) is completely deterministic, namely dx = dt,

(1.23)

then the Fokker-Planck equation has only the drift term (which is also known as the Liouville equation), and the solution reads ∂ p(x, t) ∂ p(x, t) =− ∂t ∂x

=⇒

p(x, t) = δ(x − t).

(1.24)

1.3

Stochastic Differential Equations (SDEs)

13

The solution goes deterministically along a straight line x = t, which is consistent with solving the underlying system directly in (1.23). • If the underlying process (1.17) contains only the white noise, namely dx = dWt ,

(1.25)

then the Fokker-Planck equation has only the diffusion term, and the solution reads ∂ p(x, t) 1 ∂ 2 p(x, t) = ∂t 2 ∂x 2

=⇒

p(x, t) = √

1 2πt

e−x

2 /(2t)

.

(1.26)

The resulting Fokker-Planck equation in such a situation becomes a heat equation. In fact, if multiple random realizations are considered simultaneously for (1.25), then the PDF of the random walk spreads as time evolves, mimicking the heat diffusion process. The Fokker-Planck equation can also be defined for multi-dimensional SDEs. Consider an n-dimensional process x, dx = μ(x, t) dt + σ(x, t) dWt , where σ(x, t) is an N × M matrix and Wt is an M-dimensional Wiener process. The associated Fokker-Planck equation reads ∂ 1 ∂2 ∂ p(x, t) [μi (x, t) p(x, t)] + [Di j (x, t) p(x, t)] =− ∂t ∂xi 2 ∂xi ∂x j N

i=1

N

N

i=1 j=1

M with μ = (μ1 , . . . , μ N ) and Di j (x, t) = k=1 σik (x, t)σ jk (x, t) being the drift vector and the diffusion tensor, respectively. It is worthwhile to remark that, as the Fokker-Planck equation (1.22) is a PDE, solving the high-dimensional Fokker-Planck equation utilizing traditional numerical PDE solvers is often computationally unaffordable. In fact, most numerical solvers can only deal with such a PDE in 3 or 4-dimensional situations. However, the dimension of the state variables of many complex systems, such as the operational weather and climate models, can be millions or billions to resolve the multiscale features of nature. Even for many systems with intermediate complexity, the dimension of the phase space is around a few hundred to a thousand, which is still way beyond the capability of the traditional numerical PDE solvers. Many practical approaches have been developed to seek approximate solutions to the Fokker-Planck equation. The Monte Carlo simulation, which runs multiple simulations of the underlying SDE (1.17), is one widely used method. The collection of the resulting random paths often leads to a reasonable approximation of the time evolution of the PDF, provided that the dimension of the system is not too large and the PDF is not strongly non-Gaussian. Another approach is to build reduced-order models, which allow focusing on solving the PDF of only a subset of the state variables that mitigate the computational cost.

14

1 Stochastic Toolkits

In addition, efficient algorithms can be designed for systems with special structures to solve the Fokker-Planck equation in a relatively large dimension. The details of these strategies will be presented in the following chapters.

1.4

Markov Jump Processes

In this section, another simple stochastic process—the Markov jump process—is introduced, a powerful technique for modeling many natural phenomena. Unlike SDEs, where the state variables are continuous functions, the Markov jump processes have discrete (and often finite) states. Therefore, the Markov jump process is a concise mathematical modeling tool that is particularly useful in describing phenomena with diverse features that can be categorized into distinct clusters.

1.4.1

A Motivating Example of Intermittent Time Series

Time series or spatiotemporal patterns with intermittent instability are observed in many natural phenomena, such as the Rossby wave with baroclinic instability in the atmosphere and ocean science [174, 204]. The intermittent time series is characterized by irregular switching between quiescent and active phases. Extreme events, non-Gaussian statistics, and chaotic behavior are often associated with the occurrence of intermittency. One typical example of the intermittent time series is shown in the first and second rows of Fig. 1.3, where the PDFs exhibit significant non-Gaussian features with fat tails. In practice, the observed intermittencies at large or resolved scales are often triggered by the perturbations at small or unresolved scales due to the energy transfer via the system’s nonlinearity. As will be seen in the subsequent chapters, a simple linear scalar SDE with a constant noise coefficient (the so-called additive noise) is insufficient to reproduce intermittent time series with non-Gaussian PDFs. One practical way to model a time series with intermittency is to include one additional stochastic process to characterize the damping coefficient of the otherwise linear scalar SDE with additive noise, known as the stochastic parameterization. Physically, this additional stochastic process mimics the behavior of one of the unresolved state variables, which now explicitly interacts with the observed variable via this additional stochastic process. Mathematically, if the original model and this extra process are treated as an entire system, their coupling induces a quadratic nonlinearity that can trigger chaotic and intermittent phenomena. In particular, the switching of the sign of the damping coefficient as time evolves is the underlying mechanism that creates intermittency and non-Gaussian statistics. There are many choices for such a stochastic process. One simple option is a two-state Markov jump process. After coupling with the otherwise linear scalar SDE, the two-dimensional model reads

1.4

Markov Jump Processes

15 (a) Time series

Re[u]

(b) PDFs in logarithm scale

1

1

0

0

-1 0

50

100

150

200

250

1

Im[u]

-1 10-4 1

0

10-2

100

102

10-2

100

102

0

-1 0

50

100

150

200

250

-1 10-4

2

Truth Gaussian fit

1 0 0

50

100

0

50

100

150

200

250

150

200

250

1

E(u)

0.5 0

t

Fig. 1.3 The two-dimensional system (1.27). Column a includes the time series. The first two rows show the real and imaginary parts of the signal u. The third row shows the two-state Markov jump process γ in (1.27b), which randomly switches between two distinct states with γ− = −0.04 and γ+ = 2.27. The fourth row shows the energy of u as a function of time, where the energy is defined as E(u) = (Re[u])2 + (Im[u])2 . Column b shows the PDFs and the Gaussian fit of the intermittent time series u, where the probability (x-axis) has been plotted in the logarithm scale

du = [(−γ + iωu )u + f ] dt + σu dWu , γ satisfies two-state Markov jump process,

(1.27a) (1.27b)

where the oscillation frequency ω, the external forcing f and the stochastic noise (or stochastic forcing coefficient) σu in (1.27a) are all constant parameters. If the damping γ > 0 is also a constant coefficient, then the statistics of u is Gaussian. In fact, the fundamental solution of u in the homogeneous part is u(t) = u(0) exp((−γ + iωu )t), which decays to zero. A similar exponential decaying effect is exerted on both the deterministic and stochastic forcing that stabilizes the system without creating intermittency. Now, let γ be a two-state Markov jump process, which randomly switches between two distinct states with γ− = −0.04 and γ+ = 2.27. They correspond to the unstable and stable phases of the system, respectively. See the third row of Fig. 1.3. When γ = γ+ , the signal of u is strongly damped and has a weak amplitude. In contrast, when γ = γ− , the real part of the coefficient inside the exponential function of the fundamental solution becomes −γ− > 0, and therefore the amplitude of u experiences an exponential increase before γ switches back to γ+ . The fourth row of Fig. 1.3 shows the energy of u as a function of time, where the energy is defined as E(u) = (Re[u])2 + (Im[u])2 . It indicates the growth of the energy when the system lies in the unstable phase γ− . Therefore, the random switching between the stable and unstable phases provides the mechanism of intermittent instability. In

16

1 Stochastic Toolkits

addition, the frequent occurrence of the large amplitude in u at the unstable phase increases the probability of extreme events, which leads to a PDF with fat tails. As a final remark, the two-dimensional model in (1.27) can be utilized to characterize each spectral (e.g., Fourier) coefficients of a complex stochastic PDE system, allowing the occurrence of the intermittency and extreme events in the spatiotemporal patterns. Some sophisticated applications of utilizing the Markov jump processes for modeling complex systems include building stochastic multi-cloud models for tropical convection [148], characterizing the refined structures of the Madden-Julian Oscillation (MJO) and Monsoon Intraseasonal Oscillation (MISO) [247], parameterizing the stochastic wind bursts in the coupled air-sea model for the El Niño-Southern Oscillation (ENSO) [248] and simulating microscopic lattice systems in materials science and biomolecular dynamics [147].

1.4.2

Finite-State Markov Jump Process

Definition 1.9 (Finite-state Markov jump process) A finite-state Markov jump process is given by a continuous time stochastic process, which at any time t takes one of the discrete values in the finite set of states S = {si }i∈N , with N = {1, 2, . . . , N }. The set S is nontrivial if at least one of the si is different from the rest. In contrast to the SDEs, non-trivial Markov jump processes have discontinuous paths and do not satisfy the Fokker-Planck equation. Instead, the probability densities associated with such processes satisfy the so-called Master equation, which is derived from the differential version of the Kolmogorov equation upon the requirement that the process transitions between isolated states. The focus below is on deriving the Master equation for the simplest possible case of a two-state Markov jump process which is particularly interesting because of its exact solvability.

1.4.3

Master Equation

Consider now the simplest case of a Markov jump process X (t) which, at any time t, takes one of the only two values from the set of states S ∈ {sst , sun }, where, for simplicity, sst and sun are named the stable and unstable state, respectively. Let ν be the rate of change from the stable state sst to the unstable state sun . Similarly, denote by μ the rate of change from the unstable state to the stable state. In addition to the Markov property of X (t) in (1.12), the time homogeneity of the process is assumed: P(X (t) = si |X (τ ) = s j ) = P(X (t − τ ) = si |X (0) = s j ),

∀ t > τ > 0.

(1.28)

Thus, the process X is fully determined by the transition probabilities P(X (t) = si |X 0 = s j ), i, j = ‘st , ‘un . The rates of change ν and μ define the following local transition

1.4

Markov Jump Processes

17

probabilities with t being a small time increment, pt (sst , sun )

:= P(X t+t = sun |X t = sst ) = νt + o(t),

pt (sun , sst )

:= P(X t+t = sst |X t = sun ) = μt + o(t),

pt (sst , sst )

:= P(X t+t = sst |X t = sst ) = 1 − νt + o(t),

pt (sun , sun )

:= P(X t+t = sun |X t = sun ) = 1 − μt + o(t).

(1.29)

Using (1.29) and the Markov property (1.12), the probability of finding the process in the unstable state at t + t starting from the stable state at 0 is pt+t (sst , sun ) = pt (sst , sun ) pt (sun , sun ) + pt (sst , sst ) pt (sst , sun ) = pt (sst , sun )(1 − μt) + pt (sst , sst )νt + o(t), where the two paths passing through the unstable and the stable states at t, respectively, have been taken into consideration to form the entire probability. After regrouping terms, a finite difference equation is obtained pt+t (sst , sun ) − pt (sst , sun ) = −μ pt (sst , sun ) + ν pt (sst , sst ) + o(1). t In the limit t → 0, with the assumption that p is at least C 1 with respect to t, one obtains the following differential equation ∂ pt (sst , sun ) = −μ pt (sst , sun ) + ν pt (sst , sst ), ∂t initial condition: pt (sst , sun )|t=0 = 0. Applying a similar procedure to the remaining transition probabilities yields ∂ ˆ Pt = Pˆt A ∂t

with

where I is a 2 × 2 identity matrix  pt (sst , sst ) pt (sst , sun ) Pˆt = pt (sun , sst ) pt (sun , sun )

Pˆt |t=0 = I ,

 and A =

(1.30)

−ν ν . μ −μ

Given that (1.30) is a linear system with constant coefficients, the probabilities for the system to be in stable and unstable regimes are easily found as

μ μ ν Pt (sun , sst ) = + e−(ν+μ)t , 1 − e−(ν+μ)t , Pt (sst , sst ) = ν+μ ν+μ ν+μ

ν ν μ −(ν+μ)t Pt (sun , sun ) = , Pt (sst , sun ) = + e 1 − e−(ν+μ)t , ν+μ ν+μ ν+μ so that the two-state Markov jump process X (t) approaches a stationary process with

18

1 Stochastic Toolkits

μ , ν+μ ν Peq (sst , sun ) = Peq (sun , sun ) = . ν+μ Peq (sst , sst ) = Peq (sun , sst ) =

(1.31)

Finally, based on (1.31), the expectation for X (t) is given by E[X ] =

1.4.4

μsst + νsun . ν+μ

(1.32)

Switching Times

By definition, the two-state Markov jump process resides in one of the two states sst or sun and the switching rates between these states are ν (for the sst → sun transition) and μ (for the sst → sun transition). It is then important to understand the time instants when the process switches states. To this end, suppose that the system transitions to the stable state at some time t = t0 . Define the time during which the system stays in the stable state as Tst = t ∗ − t0 ,

t∗ =

inf {t : X (t) = sun }.

t∈[t0 ,∞]

Here, t ∗ is the switching time from the stable to the unstable state, and Tst is the residence time in the stable state. Clearly, both t ∗ and Tst are random variables. The probability that the process remains in sst at time t = t0 + t is P(Tst > t) = P(t ∗ > t0 + t) = P(X (t0 + t) = sst , X (t0 ) = sst ). Then, one can write P(X (t0 + t + t) = sst |X (t0 ) = sst ) = P(X (t0 + t) = sst |X (t0 ) = sst )P(X (t0 + t + t) = sst |X (t0 + t) = sst ) = P(X (t0 + t) = sst |X (t0 ) = sst )Pt (sst , sst )

(1.33)

= P(X (t0 + t) = sst |X (t0 ) = sst )(1 − νt + o(t)), where the first equality is due to the Markov property of the process (1.12), the second equality is due to the homogeneity (1.28) and the last equality follows from (1.29). In the limit t → 0, the following Master equation is reached d P(X (t0 + t) = sst |X (t0 ) = sst ) = −ν P(X (t0 + t) = sst |X (t0 ) = sst ), dt

(1.34)

the solution of which is P(X (t0 + t) = sst |X (t0 ) = sst ) = e−νt .

(1.35)

1.5

Chaotic Systems

19

It implies that Tst satisfies an exponential distribution with mean 1/ν, and thus P(Tst < t) = 1 − e−νt .

(1.36)

Similarly, the time Tun the system spends in the unstable regime before switching to the stable one is a random variable with the exponential distribution function P(Tun < t) = 1 − e−μt .

1.5

(1.37)

Chaotic Systems

The model in (1.27) provides a simple example of the so-called chaotic system. The pathwise solutions of these systems are sensitive to the initial conditions and are predictable only for a short period. In other words, even without stochastic noise, the forecast trajectory will soon become very distinct from the truth if the initial value is slightly perturbed. The underlying reason is that the system is intermittently unstable and has the so-called positive Lyapunov exponent, which is a quantity that characterizes the rate of separation of infinitesimally close trajectories. While the rigorous definition of the Lyapunov exponent and the analysis require additional preliminaries and are not the focus of this book, an intuitive explanation of the chaotic behavior with the trajectory separation is as follows. Assume that a perturbation δu(0) is added to the initial condition u(0). Further, assume γ = γ− < 0 from the initial time up to a certain instant t. Then, omitting the stochastic noise in (1.27), the true solution and the solution with the perturbed initial condition are u(t) = u(0) exp((−γ− + iωu )t) and u δ (t) = (u(0) + δu(0)) exp((−γ− + iωu )t), respectively. Their difference describes the separation of the two trajectories, which is δu(t) = δu(0) exp((−γ− + iωu )t).

(1.38)

However, since γ− < 0, the difference between the two trajectories δu(t) will experience an exponential increase up to time t. This indicates an exponentially fast separation of the two trajectories, even with a small initial gap. In the presence of stochastic noise (or when the system becomes more turbulent), the separation of the trajectories is expected to become stronger and faster, which indicates a barrier to the practical path-wise forecast. It is worthwhile to introduce the Lorenz 63 (L63) model: dx = σ(y − x), dt dy = x(ρ − z) − y, dt dz = x y − βz. dt

(1.39)

20

1 Stochastic Toolkits

Fig. 1.4 Simulation L63 model. Panel a: two different simulations of the L63 model (1.39) with parameters σ = 10, ρ = 28 and β = 8/3. The initial values of two simulations only have a small difference. Panel b: the phase plots of the L63 model

The L63 model was proposed by Lorenz in 1963 [169] as a simplified mathematical model for atmospheric convection. In (1.39), x is proportional to the rate of convection, y to the horizontal temperature variation, and z to the vertical temperature variation. The constants σ, ρ, and β are system parameters proportional to the Prandtl number, Rayleigh number, and certain physical dimensions of the layer itself [241]. The L63 model is a well-known chaotic system explaining the butterfly effect. Interestingly, the shape of the solution plotted in phase space also resembles a butterfly. The L63 model is widely used as a testbed for various numerical algorithms in predicting chaotic signals. Panel (a) of Fig. 1.4 shows the chaotic behavior of the L63 model with the classical parameters σ = 10, ρ = 28 and β = 8/3 [169]. It is seen that, even with a tiny difference at the initial time instant, the trajectories associated with the two simulations (blue and red curves) diverge quickly from each other and become completely distinct after 5 units. Therefore, the forecast of the model trajectories is an intrinsically challenging problem. Panel (b) shows the phase plots of the L63 model, which exhibits the classic ‘butterfly’ attractor structure and has become an iconic image of chaotic systems. In practice, the perfect initial condition is seldom known. Large uncertainty and bias often exist at the initial stage of the forecast. Therefore, obtaining an as exact as possible initial state is an essential task for extending the skillful forecast for chaotic and turbulent systems. Data assimilation, which combines models with noisy observations, is a widely utilized tool to improve the initialization of the forecast, which will be discussed in Chap. 5. In addition, due to the inherent difficulties in forecasting a single trajectory and the intrinsic uncertainties, the path-wise forecast may not be the most appropriate choice for predicting these complex systems. An alternative forecast, known as the ensemble forecast that characterizes the time evolution of the PDF and quantifies the forecast uncertainty, is often

1.5

Chaotic Systems

21

a more suitable forecast method for chaotic or turbulent systems. Chapter 6 contains the details of the ensemble forecast and related topics such as internal predictability and model error. Overall, there is a strong need for developing stochastic models and methods to handle chaotic and turbulent systems, the details of which will be discussed in the remaining of this book.

2

Introduction to Information Theory

2.1

Shannon’s Entropy and Maximum Entropy Principle

Shannon’s Entropy is a simple yet powerful measurement for quantifying the uncertainty of a random variable and has wide applications in modeling complex systems.

2.1.1

Shannon’s Intuition from the Theory of Communication

Shannon’s intuition of the entropy concept originated from representing a “word” in a message as a sequence of binary digits with length n. Denote by A2n the set of all words of length n, which has 2n = N elements. Equivalently, the amount of information needed to characterize one element will be n = log2 N . See the following simple example. 1 0 1 0 1 0 1 Example: n = 7, N = 128.

0 1 1 1 0 1 0 .. .. .. .. .. .. .. .......

Following the same argument, the amount of information needed to characterize an element of any set, A N , is n = log2 N for general N . Now consider a situation of a set A = A N1 ∪ · · · ∪ A Nk , where the sets A Ni are pairwise disjoint from each other with A Ni having Ni  total elements. Set pi = Ni /N , where N = Ni . See the following simple illustration.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_2

23

24

2 Introduction to Information Theory

Example: A = A N1 ∪ A N2

1 0 0   

N = 24, N1 = 8, N2 = 16.

N1 =23 =8

p1 =8/24

1 0 1 1    N2 =24 =16

p2 =16/24

Assume an element of A is known to belong to some A Ni . To completely determine this element, log2 Ni additional information is needed. Therefore, the average amount of information needed to determine an element is given by    Ni  Ni Ni (2.1) log2 Ni = log2 ·N = pi log2 pi + log2 N . N N N i

i

where the ‘average’ is in the sense of the probability that the element belongs to a certain A Ni . Recall that log2 N is the information needed to determine an element given the set A if it is unknown to which A Ni a given element belongs. Thus, the corresponding average  lack of information is − pi log2 pi .

2.1.2

Definition of Shannon’s Entropy

Following Shannon’s intuition, the formal definition of entropy is as follows. Definition 2.1 (Shannon’s entropy (discrete case)) Let p be a finite, discrete probability measure on the sample space A = {a1 , . . . , an },

n n   pi δai , pi ≥ 0, pi = 1 . PMn (A) = p = i=1

i=1

Shannon’s entropy S( p) of the probability p is defined as S( p) = S( p1 , . . . , pn ) = −

n 

pi ln pi .

(2.2)

i=1

Shannon’s entropy is unique under certain assumptions [176]. To provide a connection between the entropy and the amount of uncertainty of a system, let us compute Shannon’s entropy for different cases in Fig. 2.1.

2.1

Shannon’s Entropy and Maximum Entropy Principle

25

Fig. 2.1 Systems with different probabilities for computing Shannon’s entropy

(a) S( p) = − (0.5 ln (0.5) + 0.5 ln (0.5))

= 0.6931

(b) S( p) = − (1/3 ln (1/3) + 1/3 ln (1/3) + 1/3 ln (1/3))

= 1.0986

(c) S( p) = − (0.2 ln (0.2) + 0.8 ln (0.8))

= 0.5004

(d) S( p) = − (0.1 ln (0.1) + 0.9 ln (0.9))

= 0.3251

(e) S( p) = −1 ln 1

=0

(f) S( p) = − lim n n→∞



 1 1 ln = lim ln n n→∞ n n

=∞

First, if a random guess is assigned, then the chance of obtaining the correct state ai is 50% and 33% for Case (a) and Case (b), respectively. This means the answer for Case (b) is more uncertain than that for Case (a). The corresponding Shannon’s entropy has a consistent behavior with the uncertainty. Next, although all three Cases (a), (c), and (d) have only two choices, the system has a tendency of an increased chance to reach a2 from (a) to (c) to (d). In other words, the system becomes more and more deterministic, where both the associated uncertainty and Shannon’s entropy decrease. The limiting case is in fact Case (e), where the answer is definite without any uncertainty and the corresponding Shannon’s entropy is zero. Case (f) shows the other extreme case, where Shannon’s entropy increases to infinity as the number of the states, namely the uncertainty. Therefore, one can draw a conclusion that Shannon’s entropy increases as the uncertainty! Shannon’s entropy can also be generalized to continuous measures. Definition 2.2 (Shannon’s entropy (continuous case)) Let ρ(x) be a continuous probability density, which satisfies the following two requirements: ρ(x) ≥ 0 and ρ(x) dx = 1. Then the continuous form of Shannon’s entropy is given by

S(ρ) = − ρ ln(ρ) dx. (2.3)

26

2.1.3

2 Introduction to Information Theory

Shannon’s Entropy in Gaussian Framework

Proposition 2.1 (Shannon’s entropy in Gaussian framework) If ρ ∼ N(µ, R) satisfies an n-dimensional Gaussian distribution, then Shannon’s entropy has the following explicit form S(ρ) =

n 1 (1 + ln 2π ) + ln det(R). 2 2

(2.4)

It is clear from (2.4) that uncertainty is entirely determined by the determinant of the covariance and is independent of the mean. This mathematical argument is consistent with intuitions. A ‘larger’ covariance means a more extensive range of the possible values of the state variable. In contrast, the two PDFs with a shift of the mean state have the same profile and should have an identical amount of uncertainty. The derivation of (2.4) is straightforward. Consider, for example, the one-dimensional case. Let  (x − μ)2 1 exp − . (2.5) ρ(x) = √ 2R 2π R Then the explicit formula of Shannon’s entropy (2.4) can be obtained by plugging (2.5) into the general definition (2.3), 

(x − μ)2 1 dx − ρ(x) ln ρ(x) dx = − ρ(x) − ln (2π R) − 2 2R

1 1 1 1 1 = ln(2π R) + (x − μ)2 ρ(x) dx = ln(2π ) + ln R + . 2 2R 2 2 2 In practice, the formula in (2.4) is often utilized to compute the uncertainty of the Gaussian fit of a non-Gaussian PDF. Although the resulting value does not equal the actual uncertainty of the non-Gaussian PDF, the uncertainty associated with the Gaussian fit is usually sufficient for providing a qualitative estimation of the uncertainty in many applications. Two direction applications of (2.4) will be shown in Chaps. 6 and 9, respectively, for studying the predictability of a complex system and learning the underlying dynamics from a causality-based data-driven approach.

2.1.4

Maximum Entropy Principle

In addition to appearing in the theory of communication, entropy is also a well-known concept in thermodynamics, which is commonly associated with the amount of disorder or chaos. Specifically, the second law of thermodynamics (according to Rudolf Clausius’ statement) indicates that the entropy of the universe tends to a maximum. This famous physical law is consistent with the entropy from the information theory, which decreases with more information being included. These facts provide an extremely useful constrained

2.1

Shannon’s Entropy and Maximum Entropy Principle

27

optimization criterion for modeling complex systems based on incomplete knowledge of nature. It says that the most plausible statistical state of a complex system is the one that satisfies the given constraints and meanwhile maximizes the entropy. The formal definition of the maximum entropy principle is as follows. First, with a given probability measure p ∈ PM(A), various statistical measurements can be made with respect to this probability measure p. The expected value, or statistical measurement, of f with respect to p is given by n  f (ai ) pi , (2.6) f p = i=1

where the notation · p in (2.6) is the same statistical average as the one defined in Sects. 1.1.2–1.1.3 except that the probability in (2.6) is in the discrete form. A practical issue is to look for the least biased probability distribution consistent with certain statistical constraints,   C L = p ∈ PM(A)| f j p = F j , 1 ≤ j ≤ L . Definition 2.3 (Maximum entropy principle) The least biased probability distribution p L given the constraints C L is the maximum entropy distribution, max S( p) = S( p L ),

p∈C L

pL ∈ CL .

(2.7)

Note that the maximum entropy principle (2.7) also implies that the uncertainty decreases with more information being included, S( p L ) ≤ S( p L ),

for L < L,

as the searching space becomes narrower. The following examples show applying the maximum entropy principle in finding the most probable (least biased) distribution. These examples are also instructive in understanding the features of the resulting distributions under different situations. Example. Find the least biased probability distribution p on A = {a1 , . . . , an } with no additional constraint. First, it is worthwhile to highlight that although ‘no additional constraint’ is imposed, the total probability being 1 is always an intrinsic constraint in applying the maximum entropy principle. Therefore, the optimization problem becomes: Maximize

S( p1 , . . . , pn ) = −

n 

pi ln pi ,

i=1

Subject to

n  i=1

pi = 1

and all pi ≥ 0.

28

2 Introduction to Information Theory

The method of Lagrangian multipliers can be applied to solve such a constrained optimization problem. Construct the Lagrange function: L=−

n 

pi ln pi + λ

i=1

n 

pi ,

i=1

where λ is the Lagrangian multiplier associated with the constraint of the total probability being 1. The optimum p ∗ can be reached by taking the partial derivatives of the state variables (as well as the Lagrangian multiplier) and finding the zeros, namely ∂∂ pLi = 0 for i = 1, . . . , n, which results in ln pi∗ + 1 − λ = 0, for i = 1, . . . , n. This implies that all the probabilities pi∗ are equal. Then, in light of the total probability being 1, the solution reads: pi∗ = 1/n, which is a uniform distribution! This is intuitive, as the uniform distribution is the ‘fairest’ way for any assignment task if no additional information is available. Example. Find the least biased probability distribution p on A = {a1 , . . . , an } subject to the r + 1 constraints (r + 1 ≤ n): Fj = f j p =

n 

f j (ai ) pi ,

j = 1, . . . , r ,

i=1

n 

pi = 1.

i=1

Following the procedure in the previous equation, the method of Lagrangian multipliers    leads to the optimal solution pi∗ = exp − rj=1 λ j f j (ai ) − (λ0 + 1) , which is a function that belongs to the exponential family. The constraint that the sum of all the probabilities pi∗ being 1 can be utilized to eliminate the multiplier λ0 ,    exp − rj=1 λ j f j (ai )   . pi∗ =  n r i=1 exp − j=1 λ j f j (ai ) However, the other Lagrange multipliers λi have to be determined through the constraint equations, which are, in general, non-trivial to solve analytically. Numerical or approximate solutions are often utilized to seek the maximum entropy solution in such a situation. The maximum entropy principle holds for the continuous case as well. Example. The least biased density p with the following constraints,

2 ¯ 2 p(a) da. a¯ = ap(a)da, σ = (a − a) is the following Gaussian distribution:  1 (a − a) ¯ 2 p ∗ (a) = √ . exp − 2σ 2 2π σ

(2.8)

2.1

Shannon’s Entropy and Maximum Entropy Principle

29

The derivation requires the knowledge of variational derivative [176]. The result in (2.8) supports the well-known idea from the central limit theorem that Gaussian densities are the universal distributions with given first and second moments. Example. The least biased probability distribution p subject to the r + 1 constraints,

Fj = f j p = f j (a) p(a) da, j = 1, . . . , r , p(a) da = 1, is given by

   exp − rj=1 λ j f j (a)    p ∗ (a) = . exp − rj=1 λ j f j (a) da

(2.9)

The maximum entropy principle has many vital applications. In practice, the first few moments (usually the leading 2 or sometimes the leading 4) are relatively easy to be obtained from data. In contrast, the accuracy of measuring higher-order moments often suffers from noise. Therefore, the maximum entropy principle can be utilized to construct an approximate PDF based on these measurements. Similarly, although the Fokker-Planck equation (1.22) is a PDE and is computationally expensive, the moment equations associated with the same underlying SDE are simply ODEs, which are much easier to solve and are amenable to higher dimensional systems. Therefore, the solution from the moment equations can be utilized to approximate the time evolution of the PDFs with the help of the maximum entropy principle.

2.1.5

Coarse Graining and the Loss of Information

Coarse graining originated from the statistical physics for processes that involve averaging over small scales. Nowadays, coarse-graining widely appears in modeling and numerically solving complex systems, which exploits low-resolution mesh grids to reduce computational costs. However, the feedback from small scales to resolved large scales often plays a vital role in nonlinear systems. Thus, the loss of information in the refined grids may lead to significant errors in solving the coarse-grained models. Parameterizations or closure models are often utilized to compensate for the otherwise ignored contribution in small scales due to coarse-graining, which will be discussed in the following few chapters. On the other hand, exploiting the leading a few moments to approximate the PDF via the maximum entropy principle can also be regarded as a coarse-graining procedure, as the more refined information in the higher order moments is ignored in the reconstructed PDF. The review paper [150] includes several interesting topics related to coarse-graining and Shannon’s entropy. Intuitively, coarse-graining means that information is lost so that the entropy should increase. The following one-dimensional example provides a rigorous mathematical justification for such an argument. Let λi < λi+1 for integers i, −∞ < i < ∞ with equi-distance

30

2 Introduction to Information Theory

spacing λ and a probability measure ρ(λ) define the discrete coarse-grained probability measure p by p=

 i

ρi δλi ,

with

ρi =

λi+1

λi

ρ(λ) dλ.

(2.10)

The following general inequality for λ ≤ 1 shows that p contains less information than ρ(λ) due to the coarse-graining, S( p) ≥ S(ρ) + ln((λ)−1 )

(2.11)

 where S( p) = − ρi ln ρi and S(ρ) = − ρ ln ρ. The result in (2.11) can be shown by first making use of the mean value theorem ln(x) − ln(y) 1 1 = ≥ , for x > ξ > y, x−y ξ x and then rewriting it into the form x ln(x) ≥ x ln(y) + x − y. Now set y = ρi /λ and x = ρ(λ) and integrate the above equation over the interval (λi , λi+1 ),

λi+1

λi

ρ(λ) ln ρ(λ) dλ ≥ ρi ln ρi − ρi ln(λ).

Summing the above inequalities over i arrives at (2.11) (note the negative sign in Shannon’s entropy).

2.2

Relative Entropy, Quantifying the Model Error and Additional Lack of Information

Shannon’s entropy provides a systematic way of measuring the uncertainty or the lack of information in one probability distribution function. In many practical problems, the interest lies in measuring the lack of information in one distribution compared with another. Two typical situations are: 1. measuring the model error (i.e., the lack of information) in a given imperfect model p M (x) compared with nature (i.e., the perfect model) p(x), and 2. assessing the information gain with the help of additional observations p(x|y) compared with the direct model output p(x), where y stands for the observation. The relative entropy is thus introduced to handle these issues.

2.2

Relative Entropy, Quantifying the Model Error and Additional …

2.2.1

31

Definition of Relative Entropy

Let us start with measuring the lack of information in the imperfect model p M (x) compared with the perfect one p(x). Assume continuous measures are adopted here. Following the argument from Shannon’s entropy, the ignorance (lack of information) about x in the perfect model is I( p(x)) = − ln( p(x)). The average ignorance, i.e., Shannon’s entropy, is

I( p(x)) = − p(x) ln( p(x)) dx. (2.12) Next, the ignorance about x in the imperfect model is I( p M (x)) = − ln( p M (x)). The model’s expected ignorance is

M I( p (x)) = − p(x) ln( p M (x)) dx, (2.13) where the outcomes are actually generated by p. Consider now the difference between these two entropies, which leads to the concept of the relative entropy: P( p, p M ) := I( p M (x)) − I( p(x))

= p(x) log( p(x)) dx − p(x) ln( p M (x)) dx 

p(x) dx. = p(x) ln p M (x)

(2.14)

Note that the integrand in (2.13) is not p M (x) ln( p M (x)). This is because even though an imperfect model is used to measure the information ln( p M (x)), the actual probability of x to appear is always p(x), which is independent of the choice of the model. In other words, the role of the model is to provide the lack of information for each event x. In contrast, the underlying distribution of the occurrence of x is objective regardless of the model used. As a remark, if p M (x) ln( p M (x)) is utilized in (2.13), then its difference with (2.12) is called the Shannon entropy difference, which has several deficiencies in quantifying the lack of information. See [272] for comparing the relative entropy and the Shannon entropy difference. The relative entropy P( p, p M ), which is also known as Kullback-Leibler (KL) divergence or information divergence [153, 154], is an objective metric for model error that measures the expected increase in ignorance, or lack of information, about the system incurred by using p M , when the outcomes are generated by p. It has two important features. First, P( p, p M ) is positive unless p = p M . The relative entropy increases monotonically with the model error. Second, P( p, q) is invariant under general nonlinear change of variables. This is crucial in practice as different communities often use different coordinates and units. It also means that relative entropy is a dimensionless measurement.

32

2 Introduction to Information Theory

In addition to quantifying the model error in the imperfect model p M with respect to the perfect one p, the relative entropy can also be used to measure the gain of information in the posterior distribution p(u|y) compared with the prior p(x). Here, the terms posterior and prior distributions come from data assimilation, where the posterior distribution refers to the estimation of the state by combining the model forecast result with the available observations. In contrast, the prior distribution simply comes from running the forecast model. In such a situation, the relative entropy reads P( p(x|y), p(x)). Therefore, the relative entropy is not symmetric by changing the order of the two inputs, as the manipulation will completely change the physical interpolation, and the relative entropy does not obey the triangle inequality.

2.2.2

Relative Entropy in Gaussian Framework

Proposition 2.2 (Relative entropy in Gaussian framework) When both p ∼ N(m p , R p ) and q ∼ N(mq , Rq ) are n-dimensional Gaussian distributions, the relative entropy P( p, q) = p ln( p/q) has the following explicit formula,     P( p, q) = 21 (m p − mq )T Rq−1 (m p − mq ) + 21 tr(R p Rq−1 ) − n − ln det(R p Rq−1 ) , (2.15) where ·T is the vector transpose, tr(·) is the matrix trace, and det(·) is the determinant. On the right-hand side of (2.15), the first term is called ‘signal’, measuring the lack of information in the mean weighted by model covariance. The second term is called ‘dispersion’, which involves the covariance ratio and quantifies the uncertainty in the covariance. It is seen that even in the Gaussian framework, the uncertainty reduction is not merely in the variance, which is different from Shannon’s entropy (2.4).

2.2.3

Maximum Relative Entropy Principle

Similar to the maximum entropy principle in Definition 2.3, the maximum relative entropy principle provides a useful tool for constrained optimization. The additional feature of the maximum relative entropy principle is to include the prior information in the constrained optimization. For the convenience of the statement, the discrete probability is utilized here. The concept of the maximum relative entropy principle can be easily generalized to the case with continuous measurements. Definition 2.4 (Maximum relative entropy principle) The least biased probability measure p ∗ , given all the constraints C and some prior information p0 , is the one that satisfies

2.3

Mutual Information

33

max S( p, p0 ) = S( p ∗ , p0 ), p∈C

where S( p, p0 ) = −

n i=1

  pi ln pi / pi0 .

There are two special situations. Proposition 2.3 If there is no prior knowledge, then the maximum entropy principle should yield the same probability distribution p ∗ as the maximum entropy principle. In fact, without n δai . Then the relative entropy S( p, p0 ) differs from the external bias, p0 = (1/n) i=1 entropy S( p) by a constant, S( p, p0 ) = S( p) − ln n. Proposition 2.4 If there is no additional constraint but exists certain prior information p0 , then the probability distribution predicted by the maximum relative entropy principle must n p = 1. be the prior information itself, p ∗ = p0 . In this case the only restriction is i=1  ∗ 0 i Utilizing the method of Lagrange multipliers, componentwise it yields ln pi / pi + (1 + λ) = 0 with i = 1, . . . , n, which implies that the ratio pi / pi0 is a constant independent of i and thus p ∗ = p0 .

2.3

Mutual Information

2.3.1

Definition of Mutual Information

The mutual information (also known as transinformation) of two random variables measures the variables’ mutual dependence. It is defined by the relative entropy between the joint distribution and the product of the two marginal ones,   p(x, y) p(x, y) ln M(X , Y ) = P p(x, y), p(x) p(y) = dx dy. (2.16) p(x) p(y) X Y According to the fundamental property of the relative entropy, it is clear that M(X , Y ) > 0 unless X and Y are independent random variables. Mutual information is linked with other information measurements. In fact, (2.16) can be rewritten as:

p(x|y) p(x, y) ln dx dy M(X , Y ) = p(x) X Y



p(x, y) ln p(x) dx dy + p(x, y) ln p(x|y) dx dy =−  X Y

X Y =− p(x) ln p(x) dx − − p(x, y) ln p(x|y) dx dy X

X

Y

=S( p(x)) − S( p(x|y)) := S(X ) − S(X |Y ),

34

2 Introduction to Information Theory

Fig. 2.2 Schematic illustration of the relationship between mutual information and other information measurements in (2.17)

where S(X |Y ) := X Y p(x, y) ln p(x|y) dx dy is the conditional entropy. Further define the joint entropy as S(X , Y ) := X Y p(x, y) ln p(x, y) dx dy. The following equalities link the mutual information with these information measurements. See Fig. 2.2 for an intuitive illustration of these relationships. M(X , Y ) = M(Y , X ), M(X , X ) = S(X ) M(X , Y ) = S(X ) − S(X |Y ),

(2.17)

M(X , Y ) = S(Y ) − S(Y |X ), M(X , Y ) = S(X ) + S(Y ) − S(X , Y ).

2.3.2

Mutual Information in Gaussian Framework

Proposition 2.5 (Mutual information in Gaussian framework) Consider a 2n-dimensional Gaussian random variable (x, y)T , where both p(x) ∼ N(mx , Rx ) and p(y) ∼ N(m y , R y ) are n-dimensional Gaussian distributions and their cross-covariance is denoted by Rx y . Making use of the definition of the mutual information (2.16) and the expression of the relative entropy in the Gaussian framework (2.15), the mutual information is given as follows,   det(Rx ) det(R y ) 1 Rx Rx y , with R = . (2.18) M(X, Y) = ln R yx R y 2 det(R) As Shannon’s entropy, the mutual information for Gaussian random variables depends only on the covariance part.

2.4

Relationship Between Information and Path-Wise Measurements

2.4

35

Relationship Between Information and Path-Wise Measurements

Path-wise measurements are often utilized to examine the similarity between two trajectories, with applications in assessing the skill of data assimilation, prediction, and many other tasks. Information measurements tightly relate to the path-wise ones, but the former also quantify additional information beyond the latter’s scope [37].

2.4.1

Two Widely Used Path-Wise Measurements: Root-Mean-Square Error (RMSE) and Pattern Correction

Among different path-wise measurements, the root-mean-square error (RMSE) and the pattern correlation (Corr) are the two that are most widely used in practice [135]. Given two one-dimensional time sequences x(ti ) and y(ti ) with i = 1, . . . , I , these measurements are defined as: ⎞ ⎛ I 2 i=1 (y(ti ) − x(ti )) ⎠ , (2.19a) RMSE = ⎝ I    I  y(t ) − mean(y) x(t ) − mean(x) i i i=1 Corr =  (2.19b) 2  I  2 , I  i=1 y(ti ) − mean(y) i=1 x(ti ) − mean(x) where mean(x) and mean(y) are the time average of the sequence x and y, respectively. In many practical situations, the quantity of interest is the anomalies of the time sequences (or time series), which implies that the mean values have already been removed.

2.4.2

Relationship Between Shannon’s Entropy of Residual and RMSE

Define U = x − y as the residual of two one-dimensional time sequences x = {x1 , x2 , . . .} and y = {y1 , y2 , . . .}, and ρ(U ) the associated PDF. Recall Shannon’s entropy for a Gaussian random variable (2.4), n 1 S(ρ) = (1 + ln 2π ) + ln det(R), (2.20) 2 2 which depends only on R, the variance of U . Now recall the definition of the variance R, which is exactly the square of the RMSE in (2.19a). The result here indicates that when x and y become more and more different at each instant, the variance of the associated residual time sequence increases. As a result, the uncertainty of the residual also gets larger as the RMSE. Therefore, in the Gaussian framework, Shannon’s entropy is the surrogate for the path-wise RMSE.

36

2 Introduction to Information Theory

It is worth highlighting that, for the non-Gaussian cases, Shannon’s entropy remains skillful in providing the uncertainty that may result from higher-order moments. However, according to the definition of the RMSE in (2.19a), only the square term of x − y is involved. In other words, there is no information provided by the RMSE beyond the second-order moment (the variance). Therefore, it should be cautious about concluding from the RMSE when the underlying sequences have extreme events and other non-Gaussian features.

2.4.3

Relationship Between Mutual Information and Pattern Correlation

Consider again two scalar time sequences, denoted by x and y. Recall the mutual information in (2.18). When Rx and R y are both scalars, (2.18) can be simplified as    Rx R y 1 1 = ln , (2.21) M(X , Y ) = ln 2 Rx R y − Rx2y 1 − r2  where r = Rx2y /(Rx R y ) is the correlation coefficient. Clearly, r is exactly the pattern correlation of the anomaly time series in (2.19b). Since there is a minus sign in front of the r 2 term in (2.21) and 1 − r 2 appears in the denominator inside the logarithm function, it can be concluded that the mutual information becomes larger as the absolute value of the pattern correlation x and y increases. Thus, in the Gaussian framework, the mutual information is the surrogate for the path-wise pattern correlation. Similar to Shannon’s entropy of residual, the mutual information can represent the ‘dependence’ of two non-Gaussian variables while the pattern correlation by design only takes into account the second-order statistics.

2.4.4

Why Relative Entropy is Important?

The path-wise measurements have been adopted as simple and effective criteria in many applications. However, one should be careful in drawing conclusions from the results utilizing path-wise measurements. In fact, it has been shown in the previous subsections that the path-wise measurements do not consider the information in high-order statistics. Therefore the interpretation of extreme events and other non-Gaussian features from the path-wise measurements can be biased. On the other hand, even for time series with Gaussian statistics, path-wise measurements may not always be the optimal choice. Figure 2.3 illustrates two different forecast time series (red) with the same truth (blue). The underlying dynamics of the truth and the models for the forecast are both linear models with Gaussian noise, which lead to Gaussian PDFs. Comparing the two forecasts makes it intuitive that the one in Case (b) looks ‘more accurate’. However, suppose the RMSE and pattern correlation are the only quantification criteria. In that case, the forecast skill in the two cases is comparable (Case (a) even has a slightly smaller RMSE). The issue in Case (a) is the severe underesti-

2.4

Relationship Between Information and Path-Wise Measurements

37 PDFs

Case (a): RMSE = 1.2375, Corr = 0.68488, Relative Entropy = 0.66197

x

Truth

Prediction

0

-5 50

60

70

80

90

100

110

120

130

140

Probability

5

0.4 0.2 0 -10

150

0

10

PDFs

Case (b): RMSE = 1.3786, Corr = 0.71401, Relative Entropy = 0.022488 Probability

x

5

0

-5 50

60

70

80

90

100

t

110

120

130

140

150

0.4 0.2 0 -10

0

10

x

Fig. 2.3 Comparison of two different forecast time series (red) with the same truth (blue) and the associated PDFs. The underlying dynamics of the truth and the models for the forecast are both linear models with Gaussian noise, which lead to Gaussian PDFs. The values of the skill scores are listed in the title of each panel

mation of the amplitude in the forecast time series. Therefore, relative entropy is a natural criterion for assessing the forecast skill with respect to the amplitude. Clearly, the relative entropy in Case (b) is much smaller than that in Case (a). This simple example implies that the three information criteria—Shannon’s entropy of residual, the mutual information, and the relative entropy—should all be taken into consideration for a better evaluation of the forecast skill, or at least the relative entropy should be used together with the two path-wise measurements in the nearly Gaussian scenarios.

3

Basic Stochastic Computational Methods

3.1

Monte Carlo Method

The Monte Carlo method is a widely used numerical technique for finding the solutions to various complex problems that are hard to solve analytically. The basic idea of the Monte Carlo method is to numerically approximate the desired quantity via repeated sampling of random variables and reach convergence in a statistical sense when the number of samples becomes large. The Monte Carlo method often aims at recovering a statistical quantity such as a certain moment or the PDF, which itself is deterministic, of a stochastic system. On the other hand, such a random sampling technique is also applicable to handling many deterministic systems. The Monte Carlo method is one of the fundamental approaches to solving the time evolution of the statistics associated with SDEs, which will be presented in Sects. 3.2 and 3.3. This section provides the essential idea of the Monte Carlo method based on a simple example.

3.1.1

The Basic Idea

The goal is to approximate some quantity r associated with a complex system. Suppose a random variable X can be easily generated with a low computational cost, and the expected value of X equals r . Then create a sequence of i.i.d. random variables X 1 , X 2 , X 3 , . . . with the same distribution as X . The law of large numbers indicates that the average value of these random variables, namely X1 + · · · + Xn Xˆ n = n © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_3

39

40

3 Basic Stochastic Computational Methods

will approach r as n → ∞. Moreover, the central limit theorem allows us to reach quantitative estimates on the difference Xˆ n − r and provides confidence intervals.

3.1.2

A Simple Example

The simple example utilized here to illustrate the idea of the Monte Carlo simulation is numerical integration, which was one of the earliest motivations for developing the Monte Carlo method. An analytic solution is usually not available when the integrand is complicated. In contrast, the Monte Carlo method can provide a suitable numerical approximation if the number of samples is sufficiently large. Let −∞ < a < b < ∞ and suppose our goal is to integrate a function f over the interval [a, b]. The integral can be rewritten as an expectation of a uniform random variable U over [a, b], denoted by Unif(a, b), 

b

 f (x) dx = (b − a)

a

b

f (x)

a

1 dx = (b − a)E[ f (U )], b−a

(3.1)

where p(x) := 1/(b − a) inside the integral is the PDF of the uniform distribution. Thus, the integral can be estimated by repeatedly sampling from a Unif(a, b) random variable, applying f , and then taking the average (expectation). Let us consider a simple concrete example by computing the following integral  I =

1

x 2 dx,

(3.2)

0

the true solution of which is I = 1/3. In light of the general algorithm in (3.1), a sequence of i.i.d. random numbers satisfying the uniform distribution over [0, 1] is generated. Taking the square of these random numbers and then averaging them leads to the Monte Carlo approximation of the solution. One simulation using different numbers of samples N is provided as follows: N=

10,

100,

1000,

10000,

100000

I =

0.2000,

0.4000,

0.3330,

0.3285,

0.3346

(3.3)

which intuitively shows that the Monte Carlo solution converges to the truth. Yet, the values in (3.3) will be different if another experiment with different random numbers is carried out. Large differences are more likely to occur when N is small. See the uncertainty next to Panel (a) in Fig. 3.1. Therefore, a sufficiently large N is often required in practice to reduce such a sampling error. The uncertainty estimate as a function of N will be discussed in the next subsection.

3.1

Monte Carlo Method

41

Fig. 3.1 Schematic illustration of the three methods in approximating the numerical integration (3.2) using Monte Carlo approaches. The uncertainty next to Panel (a) shows the histogram of the estimates from 1000 repeated independent experiments, each containing N Monte Carlo samples

It is worth highlighting that there is a large degree of freedom in designing suitable Monte Carlo algorithms. In the context of numerical integration, the method in (3.1) is only one of the many practical approaches, which for the convenience of the statement, is named Method 1 and is schematically illustrated in Panel (a) of Fig. 3.1. Another Monte Carlo method for solving the same problem is shown in Panel (b) of the figure, named Method 2. It is based on the geometric explanation of the integration, which represents the area under the curve y = x 2 . Consider a two-dimensional random sample point (x, y), where x and y are generated from two independent uniform distributions. The chance of this two-dimensional random sample point being located under the function curve equals the ratio between the integration value over the total area. Following such an argument, after generating a sequence of two-dimensional i.i.d. uniform random numbers (x, y), count the total number of points N1 that satisfy x 2 < y. Taking the ratio between N1 and N and multiplying the total area (which is one here) lead to the value I . Note that one can design an even simpler method that exploits the Monte Carlo idea by combining it with deterministic methods. Beginning with the equipartition of the interval [0, 1] into N subintervals, one can generate a uniformly distributed random number xi within each subinterval. Then use a rectangle approximation for the function value within that interval, where the rectangle’s width is 1/N while the height is yi = xi2 . See Panel (c) for the illustration of this idea, which is named Method 3. As N becomes large, the randomness disappears, and the method becomes the standard definition of integration, which is expected to converge to the truth.

3.1.3

Error Estimates for the Monte Carlo Method

Denote by p(x) the PDF of a scalar variable x. The expectation of x is given by  ∞ x p(x) dx. μx := E(x) = −∞

With N i.i.d. samples x1 , . . . , x N , a reasonable estimator for μx is the sample mean,

42

3 Basic Stochastic Computational Methods

x¯ :=

N 1  xi . N

(3.4)

i=1

However, since the Monte Carlo simulation involves a finite number of random draws of the inputs, different runs will lead to different results of x¯ in (3.4). As has been seen in Fig. 3.1, the variability in the results of x¯ (i.e., how much the mean estimate varies from utilizing one set of random draws to another) depends on N . A natural question is: on average, how accurate is x¯ as an estimate of μx ? To answer this, take the expectation of the difference between the truth and the estimator x¯ − μx ,   N N 1  1  ¯ − μx = E xi − μx = E(xi ) − μx . (3.5) E(x¯ − μx ) = E(x) N N i=1

i=1

Since the xi ’s in (3.5) occur from a random sampling of the inputs when using the Monte Carlo method, then E(xi ) = μx , which leads to E(x¯ − μx ) = 0. This result shows that, on average, the error in using x¯ to approximate μx is zero. When an estimator gives an expected error of zero, it is called an unbiased estimator. Next, to quantify the variability in x, ¯ the variance of x¯ − μx is computed,  ¯ − Var(μx ) = Var(x) ¯ = Var Var(x¯ − μx ) = Var(x)

N 1  xi N i=1



 N   1 = 2 Var xi N i=1

where the variance of μx is zero as it is a constant. Since the Monte Carlo method draws i.i.d. random samples, it indicates that the variance of the sum of the samples xi is the sum of their variances, which leads to Var(x¯ − μx ) =

N 1  1 σ2 Var(xi ) = 2 N σx2 = x 2 N N N i=1

The standard error of the estimator, √ which is the square root of the variance, decreases with the square root of the sample size, N , which is precisely the case shown in Fig. 3.1. In other words, to decrease the variability in the mean estimate by a factor of 10 requires a factor of 100 increase in the number of Monte Carlo trials. For a large sample size N , the central limit theorem can be applied to approximate the distribution of x. ¯ Specifically, the central limit theorem says that for large N , the distribution of x¯ − μx will approach a normal distribution with mean 0 and variance σx2 /N . In practice, the actual value of σx is unknown. Therefore, to compute the error estimates or provide the confidence interval, the sample variance is often utilized as an approximation. This introduces additional uncertainty in the quality of the estimate, and this could be significant for small sample sizes.

3.2

Numerical Schemes for Solving SDEs

3.2

43

Numerical Schemes for Solving SDEs

Recall the SDE in (1.15) dX (t) = A(X (t), t) dt+B(X (t), t) dW (t), the exact solution of which from time tn to tn+1 is given by  tn+1  tn+1 X (tn+1 ) = X (tn ) + A(X (s)) ds + B(X (s)) dW (s). tn

3.2.1

(3.6)

(3.7)

tn

Euler-Maruyama Scheme

Euler-Maruyama method is one of the most widely used numerical schemes for solving SDEs. Denote by X n+1 the numerical solution of X (tn+1 ). Let t = tn+1 − tn . The EulerMaruyama scheme reads [151]: X n+1 = X n + A(X n )t + B(X n )Wn ,

(3.8)

where the left endpoint is utilized in B(X n ) as in Itô’s calculus. The first two terms on the right-hand side of (3.8) are precisely the Euler scheme for solving the√ODE. The last term is the numerical approximation of the stochastic part. Here, Wn is t · Z , where Z ∼ N (0, 1) is drawn from a standard Gaussian random variable. The next task is to explore the convergence of the Euler-Maruyama scheme. In other words, it is important to understand in what sense does |X n − X (tn )| → 0 as t → 0. Among different definitions of convergence for sequences of random variables, the two most common and useful concepts in numerical SDEs are [200]: 1. The weak convergence (or loosely speaking, the error of the mean): weak = sup |E[(X )] − E[(X (t ))]|, e n n t 0≤tn ≤T

weak → 0, as where  is a given deterministic function. The weak convergence means e t t → 0. The weak convergence captures the average behavior. In practice, E[(X n )] is estimated √ by exploiting Monte Carlo simulation over N  1 paths, which has an order of “1/ N ” sampling error. A numerical method is said to have weak convergence with weak ≤ K t p , for all 0 < t ≤ t ∗ . an order p if e t 2. The strong convergence (or loosely speaking, the mean of the error): strong = sup E[|X n − X (tn )|], et 0≤tn ≤T

44

3 Basic Stochastic Computational Methods

which measures exactly the path-wise behavior. A numerical method is said to have weak strong convergence with an order p if et ≤ K t p , for all 0 < t ≤ t ∗ . Strong convergence often leads to weak convergence. In general, Euler-Maruyama has weak order p = 1 and strong order p = 1/2. But for certain SDEs, the strong order can be enhanced (see next subsection).

3.2.2

Milstein Scheme

In the Euler-Maruyama scheme √ (3.8), the second term on the right hand side is of order t while the third term is t. Therefore, the main source of the error comes from the stochastic part. A natural idea to improve the accuracy is to use a higher-order expansion to approximate the stochastic part. The following method with a second order expansion of the stochastic part is called the Milstein scheme [192], X n+1 = X n + A(X n )t + B(X n )Wn +

1 B(X n )B (X n )(Wn2 − t), 2

(3.9)

where B is the derivative of B. Instead of presenting the full rigorous derivation of the Milstein scheme, the following example illustrates the basic idea of the derivation. Consider the SDE, which is named the geometric Brownian motion, dX (t) = μX (t) dt + σ X (t) dW (t),

(3.10)

  where μ and σ are both constants. Applying Itô’s formula yields d ln X (t) = μ − 21 σ 2 dt +σ dW (t). The solution of (3.10) is then given by  t+t   t+t 1 2 μ− σ dt + X n+1 = X n exp σ dW (t) 2 t t  1 2 1 2 2 ≈ X n 1 + μt − σ t + σ Wn + σ (Wn ) 2 2 1 = X n + A(X n )t + B(X n )Wn + B(X n )B (X n )((Wn )2 − t), 2 where A(X t ) = μX t and B(X t ) = σ X t as in (3.6). In general, the Euler-Maruyama method has weak order p = 1 and strong order p = 1/2 on appropriate SDEs, while the Milstein method has p = 1 for both weak and strong orders. Yet, when the diffusion coefficient is a constant, then b (X n ) ≡ 0 and therefore the correction term 21 B(X n )B (X n )(Wn2 − t) vanishes. In such a situation, the EulerMaruyama method becomes the same as the Milstein method and has p = 1 for both weak and strong orders.

3.3

Ensemble Method for Solving the Statistics of SDEs

45

The Euler-Maruyama method is adopted in many practical applications where the SDEs contain only additive noise (i.e., B is a constant). Due to its simple form, it is also widely used for systems with multiplicative noise (i.e., B depends on X ), although it is less accurate than the Milstein method. Higher-order numerical schemes for the stochastic part are seldom utilized because of the high computational cost. It is also worthwhile to point out that, for many complex turbulent systems, one of the high order numerical schemes (e.g., the RungeKutta 4th order method) is needed for solving the deterministic part to ensure the numerical stability, while the Euler-Maruyama or the Milstein methods can take care of the stochastic part.

3.3

Ensemble Method for Solving the Statistics of SDEs

The Euler-Maruyama or the Milstein method provides a single trajectory of a given SDE, which depends on the random noise generated at each numerical integration step. To numerically obtain the statistics of the SDE, the Monte Carlo method can be combined with such a procedure that creates a number of trajectories. This is achieved by repeatedly running the SDE numerically with different random numbers and is called the ensemble method. An ergodic system is a dynamic system in which the proportion of time spent in a particular state is the same as the probability that it will be found in that state at a random moment. If a stochastic system is ergodic, then time average of its properties is equal to the average over the entire space (or the ensemble average). In other words, the long-term behavior of the statistics (namely the equilibrium statistics) can be computed by collecting the points in a single long trajectory. This will significantly reduce the computational cost for complex systems since otherwise repeated simulations with N trajectories need to be carried out to reach the equilibrium statistics of the system. Many complex systems in practice are ergodic. Therefore, time average is often utilized to get the critical statistical quantities of interest. While the rigorous definition of the ergodic system is not the focus of this book, the following examples provide some intuitions about the ergodic systems. The first example is the following linear model with additive noise, dx(t) = (−ax(t) + f (t)) dt + σ dW A (t).

(3.11)

Let a = σ = 1 and f ≡ 0. As is seen in the first row of Fig. 3.2, the PDF computed based on a long trajectory (blue) is consistent with the theoretical value of the equilibrium PDF, which can also be computed numerically by exploiting a large number of samples at a long time. Therefore, the system is ergodic. Column (c) includes 10 different realizations. These trajectories, at any instant far from the initial condition, cover the entire phase space that the trajectory will pass through. The second example is a nonlinear model with multiplicative noise, dx(t) = [F + ax(t) + bx 2 (t) − cx 3 (t)] dt + (A − Bx(t)) dWC (t) + σ dW A (t), (3.12)

46

3 Basic Stochastic Computational Methods

Fig. 3.2 Model trajectory and equilibrium PDF of different SDEs. The first row and the third row correspond to the linear model (3.11) with the forcing being f ≡ 0 and f (t) = sin(t/100), respectively. The second row shows the results from the nonlinear model (3.12). Column (a) shows a long trajectory, which is utilized to compute the equilibrium PDF shown in blue color in Column (b). The trajectory consists of in total of 2000 units. The model reaches the statistical equilibrium or the attractor after a few units. The first 100 units are discarded in computing the PDF. The red color in Column (b) is the theoretical value, which can also be computed numerically by exploiting many samples for a long time. Note that for the third row, there is no equilibrium since f (t) is not a constant, but the red curve shows the PDF for a long run at a certain time instant when the system reaches the statistical attractor. Column (c) includes ten different realizations of each model

the details will be discussed in Sect. 4.3. By choosing the parameters a = 4, b = 2, c = 1, f = 0.1, A = 1, B = −1 and σ = 1, the model exhibits a regime-switching feature, where the trajectory roughly jump between two states. Because of this, the associated PDF is bimodal. Despite the sudden switch of the path-wise behavior, the system is ergodic. The random switches among multiple trajectories allow them to cover the entire phase space (see Column (c)); therefore, the proportion of time that the model spends in a specific value equals the probability that it will be found in that value at any random moment. Finally, let us consider again the system (3.11) but adopt a time-periodic forcing f (t) = sin(t/100). Now the long-term PDF is no longer fixed but has a time-periodic behavior as the forcing, named the statistical attractor. The time average in such a case does not equal the ensemble average, as the latter changes in time. The third row in Column (b) only shows the PDF at a specific time on the statistical attractor (red curve), which is very different from the PDF computed from the time average. Column (c) shows that, at each fixed time instant, the ensemble of the trajectories cannot cover the entire phase space. For example, at t = 140, none of the trajectories shown in the figure is close to x = −5, which is obviously in the phase space and has a non-negligible probability. Such a system is thus not ergodic.

3.4

3.4

Kernel Density Estimation

47

Kernel Density Estimation

Kernel density estimation (KDE) is a non-parametric method to estimate the PDF of a random variable using a finite number of samples [88, 124, 230]. Different from the direct numerical construction of a PDF based on the histogram that is usually rough, the KDE provides a smoothed PDF, which facilitates analysis and applications. Let x1 , x2 , . . . , x N be N be i.i.d. samples drawn from an unknown PDF p(x). The KDE aims at estimating p(x) as follows:  N N 1  1  x − xi . K h (x − xi ) = K pˆ h (x) = N Nh h i=1

(3.13)

i=1

In (3.13), K (·) is the kernel function that is non-negative and is predetermined. The parameter h > 0 is called the bandwidth, which controls the degree of smoothness. Note that, since the choice of the kernel and its bandwidth affects the accuracy of pˆ h (x), there is always a trade-off between the error (the so-called bias) and the smoothness of the estimator (the so-called variance). The Gaussian kernel is one of the simplest kernels utilized in practice. In the onedimensional case, it read:  2 1 − (x−xi ) x − xi = √ e h2 . (3.14) K h 2π In multi-dimensional cases, the bandwidth is related to the covariance matrix. Given the profile of the kernel function, the most common optimality criterion used to select the bandwidth h is the mean integrated squared error (MISE):

 2 (3.15) MISE(h) = E ( pˆ h (x) − p(x)) dx . The calculation of the MISE involves the bias-variance tradeoff, the details of which can be found in [95, 261]. Note that the true density p(x) in (3.15) is unknown in practice. Therefore, various approximate methods are used to compute the MISE in (3.15). In particular, if Gaussian kernels are used to approximate univariate data and the underlying density being estimated is Gaussian, then the optimal choice for h is  h=

4σˆ 5 3N

1 5

≈ 1.06σˆ N −1/5 ,

(3.16)

where σˆ is the standard deviation of the samples. The formula in (3.16) is often known as the rule-of-thumb bandwidth estimator. While this rule of thumb is easy to compute, it should be used cautiously as it can yield widely inaccurate estimates when the density is not close to being Gaussian. One appropriate method for optimizing the bandwidth that does not impose

48

3 Basic Stochastic Computational Methods

0

0 -5

5

Truth

KDE approximation

Samples

0.1

0

20

10

x x (e) Comparison of two KDEs with N = 1000

Truth KDE (rule of thumb) KDE (solve-the-equation)

K(x)

0.1 0

0.1

0

x

5

5

0.1 0

0

20

0

10

0

20

x (f) Kernels

0.4 0.2 0

-5

0

0.2

0.2

p(x)

Bimodal distribution

0 -5

5

0.2

0 0

0

Kernels

p(x)

p(x)

p(x)

0.1 0 -20

0 -5

5

0.2

0.2

Gamma distribution

0

p(x)

0 -5

1

(d) N = 1000

0.5

-2

0

2

10

20

x p(x) in log scale

0.2

(c) N = 10

0.5

p(x)

p(x)

p(x)

Gaussian distribution

(b) N = 4

p(x)

(a) N = 2

0.4

10

0

10-5 0

10

20

x

x

Fig. 3.3 Approximation of different PDFs by the KDE. The first two rows show the approximated PDFs with a different number of samples N for recovering the standard Gaussian distribution and a Gamma distribution, respectively. The third row (Panels (e)–(f)) compares two KDE methods for a bimodal distribution with N = 1000

any requirement for the profile of the underlying PDF is the so-called “solve-the-equation plug-in” approach [32]. Figure 3.3 shows the approximation of different PDFs by the KDE. The KDE approximates a standard Gaussian distribution in the first row, where the bandwidth is determined using the rule-of-thumb criterion (3.16). With N = 10 well-distributed samples, the KDE already provides a good fit for the truth. When N = 1000, the resulting PDF becomes much more robust. The second row shows the approximation of a Gamma distribution, x 1 x a−1 e− b with shape parameter a = 2 and scale parameter b = 2 as well, p(x) = ba (a) ∞ a−1 −z where (a) = 0 z e dz is the Gamma function. The Gamma distribution has a onesided fat tail and is positively skewed. With a sufficiently large number of samples, the KDE method can still provide a smooth PDF with reasonable accuracy. In particular, the tail of the distribution becomes smoothed with the KDE, which is often hard to obtain using a standard histogram plot. Finally, the third row (Panels (e)–(f)) compares the two KDE methods for a bimodal distribution. It is clear that even with N = 1000 samples, the KDE with the rule-of-thumb criterion still brings about a large error because the PDF is oversmoothed. In contrast, the KDE estimator using the “solve-the-equation plug-in” approach recovers the truth accurately. In fact, as is shown in Panel (f), the bandwidth associated with the “solve-the-equation plug-in” approach is much smaller than that from the rule-of-thumb criterion.

4

Simple Gaussian and Non-Gaussian SDEs

4.1

Linear Gaussian SDEs

An SDE is called a linear Gaussian system if it meets the following three conditions: (1) the initial data is Gaussian distributed, (2) the model has linear dynamics, and (3) the stochastic noise satisfies a Gaussian distribution. A linear Gaussian system has Gaussian distributions at all time instants. The first and the third conditions require the initial state and the external input of the system to be Gaussian, while the second condition implies that Gaussianity is preserved under linear transformations. The simplest linear Gaussian SDE is the following real-valued scalar model, dxt = (−axt + f t ) dt + σ dWt ,

(4.1)

where Wt is the Wiener process and the subscript t means the variable is a function of time. In (4.1), a and σ are both constants with −a < 0 representing the damping. The forcing f t can either be a constant or a time-dependent deterministic function.

4.1.1

Reynolds Decomposition and Time Evolution of the Moments

This subsection aims at deriving the moment equations of SDEs using the so-called Reynolds decomposition method [4, 196]. The solution of the moment equations can be utilized to reconstruct the time evolution of the PDF in light of the maximum entropy principle, serving as an efficient approximate solution of the Fokker-Planck equation, which is, in general, challenging to solve. For a linear Gaussian SDE, the PDF is uniquely determined by the mean and the covariance. Therefore, the reconstructed PDF using these two statistics is

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_4

49

50

4 Simple Gaussian and Non-Gaussian SDEs

exact. In the following, the scalar SDE in (4.1) with a constant forcing f will be utilized to present the general idea. Recall that xt at each fixed time instant t is a random variable. The Reynolds decomposition separates its mean from the fluctuation part, xt = xt  + xt ,

(4.2)

where · represents the statistical average for a random variable (see Sect. 1.1.2). It is also called ensemble average as xt is a collection of many Monte Carlo samples. At each fixed time instant t, the ensemble mean xt  in (4.2) is a deterministic number while xt remains as a random variable but its mean is shifted to zero, namely xt  = 0 and the variance of xt equals (xt )2 . As a quick remark, it is often easier to solve the equations of the raw moments instead of the centralized moments. Nevertheless, the latter can be recovered by the former. For the variance, the following simple relationship allows it to be inferred from the second order raw moment xt2 , (xt )2 = (xt − xt )2 = xt2 + xt 2 − 2xt xt  (xt )2  = xt2  + xt 2 − 2xt 2 = xt2  − xt 2 . To write down the evolution equation of the mean, it suffices to apply the ensemble average to the entire model (4.1). Since a and f are constants, they remain unchanged while xt becomes xt  using (4.2). The standard Gaussian white noise has zero mean and it thus vanishes. Therefore, the mean equation yields a deterministic ODE dxt  = (−axt  + f ) dt.

(4.3)

Denote by x0  the average of the initial ensembles. The solution of (4.3) is given by xt  = x0 e−at +

f (1 − e−at ). a

(4.4)

Next, subtracting (4.3) from (4.1) allows an equation of the fluctuation part xt , dxt = −axt dt + σ dWt . To form an equation of the variance (xt )2 , an intermediate step of finding the equation of xt2 is essential. Since xt is a stochastic process, Itô’s formula has to be applied to obtain the correct expression of the time evolution of xt2 , which yields   dxt2 = 2xt dxt + ( dxt )2 = − 2axt2 + σ 2 dt + 2σ xt dWt . (4.5) Taking the ensemble average of (4.5) leads to

4.1

Linear Gaussian SDEs

51

dxt2  =



 − 2axt2  + σ 2 dt.

(4.6)

Denote by x02  the variance of the initial condition, the solution of (4.6) is xt2  = x02 e−2at +

σ2 (1 − e−2at ). 2a

(4.7)

As a final remark, despite the randomness in the underlying system (4.1), the associated moment equations are always deterministic as the moments are a certain statistical average of the underlying random variable. Closed analytic formulae are generally not available for more complicated SDEs (details will be shown below). Nevertheless, standard ODE solvers are often sufficient to provide accurate solutions to the moment equations, which can then be combined with the maximum entropy principle to reach an approximate solution of the Fokker-Planck equation.

4.1.2

Statistical Equilibrium State and Decorrelation Time

Recall the solutions of the mean (4.4) and the variance (4.7) associated with the real-valued scalar SDE (4.1) (assuming f is a constant). As time evolves, although each ensemble member of (4.4) still fluctuates due to the unpredictable white noise, the solutions of the moments will converge to constant values. So does the PDF. Such a long-term behavior is called statistical equilibrium, which appears in many linear and nonlinear SDEs with constant forcings. When the forcing is a time-periodic function, then the long-term behavior of the PDF is often time-periodic as well. In such a situation, the quasi-stationary solution is called the attractor. Since a > 0, the term e−at becomes zero as t → ∞. The equilibrium solutions of the mean (4.4) and the variance (4.7) are thus given by x¯∞ = f /a and Var(x)∞ = σ 2 /(2a), respectively. However, the information on the equilibrium distribution is not sufficient to uniquely determine the scalar SDE (4.1). From the algebraic viewpoint, there are three parameters in (4.1) to be determined, while the equilibrium mean and the equilibrium variance only provides two conditions. From the dynamical system point of view, the equilibrium statistics do not include any temporal information about the system; thus, different dynamical models may have the same equilibrium distribution. Figure 4.1 shows the simulations of the scalar SDE (4.1) with three different sets of parameters, which lead to precisely the same PDFs. However, the three trajectories look very distinct from each other. The temporal structure and dependence between nearby points in Case (1) are pretty clear. The temporal structure becomes weaker in Case (2) while the time series in Case (3) looks similar to white noise. The temporal structure of the scalar SDE (4.1) can be described by a single parameter a, which represents the damping rate of the system. The inverse of a, i.e., 1/a, is called the decorrelation time, which measures the system’s memory. When a is large, the signal

52

4 Simple Gaussian and Non-Gaussian SDEs (a) Trajectories

4 2

350

400

450

500

0 0

0.2

0.4

0.6

x 400

450

0.2

0.4

0.6

4

x

x

400

450

500

5

10

15

0

5

10

15

0

-2 350

0

0.5

0

0

15

1

2

2

10

0 0

500

5

0.5

0 -2 350

0 1

2

0

-2 300

-2 4

2

-2 300 4

Case 3)

0

x

Case 2)

0.5

x 0 -2 300 4

(c) ACFs

1

2

x

Case 1)

(b) PDFs

4

0

0.2

0.4

0.6

p(x)

t

t (lag)

Fig. 4.1 Simulations of the scalar SDE (4.1) with three different sets of parameters. Case (1): a = 0.25, f = 0.25, σ = 0.5. Case (2) a = 1.0, f = 1.0, σ = 1.0. Case (3): a = 4.0, f = 4.0, σ = 2.0. The associated PDFs and the autocorrelation functions (ACFs) of the three cases are also illustrated

will be strongly damped, and the memory of the signal will be shortened since the signal is dominated by random noise. This is Case (3) in Fig. 4.1. On the other hand, when a is small, the decorrelation time or the system’s memory is long, indicating long predictability. See Case (1). The general definition of the decorrelation time is as follows: Definition 4.1 (Decorrelation time) The decorrelation time of a stationary process is given by  ∞

τcorr =

R(s) ds,

(4.8)

0

where R(s) is the autocorrelation function (ACF), R(s) =

E[(xt − x¯∞ )(xt+s − x¯∞ )] . V ar (x)∞

(4.9)

The ACF at a time lags s measures the cross-covariance between points with a fixed shift s in the temporal direction. It is usually a decaying function in time as the correlation becomes weaker as the lag increases. For the scalar SDE (4.1), the ACF is an exponential function R(s) = exp(−as) and the associated decorrelation time is τcorr = 1/a. The ACF shown in Column (c) of Fig. 4.1 indicates consistent findings as the temporal dependence observed from the time series in Column (a).

4.1.3

Fokker-Planck Equation

The Fokker-Planck equation associated with the scalar SDE (4.1) (see (1.22)) is

4.2

Non-Gaussian SDE: Linear Model with Multiplicative Noise

 σ 2 ∂2 p ∂p ∂  . =− (−ax + f ) p + ∂t ∂x 2 ∂x2

53

(4.10)

The stationary solution of (4.10) satisfies  ∂ p/∂t  = 0, 2plugging in which into (4.10) and f integrating the resulting equation yields x − a p + σ2a ∂∂ xp = c1 . The boundary conditions p → 0 and ∂ p/∂ x → 0 imply that c1 = 0. Thus, the stationary solution of (4.10) x→±∞

x→±∞

is Gaussian after taking another integration,   1 (x − f /a)2 p(x) = N0 exp − . 2 σ 2 /2a

(4.11)

It can be seen from (4.11) that the equilibrium mean x¯∞ = f /a and the equilibrium variance Var(x)∞ = σ 2 /2a are consistent with the results from the moment Eqs. (4.4) and (4.7). On the other hand, solving the time evolution of the PDF directly from the Fokker-Planck equation is more difficult, while using the moment equations and the maximum entropy principle significantly facilitates the calculation.

4.2

Non-Gaussian SDE: Linear Model with Multiplicative Noise

Consider the following linear scalar SDE, in which the coefficient of the Wiener process Wb is a function of the state variable, known as the multiplicative noise, dx(t) = [−ax(t) + f ] dt + bx(t) dWb (t) + c dWc (t).

(4.12)

In (4.12), Wb and Wc are independent scalar Wiener processes. Despite the linear dynamics, part of the external forcing of the system is given by the time series bx(t) dWb (t), the distribution of which is non-Gaussian. The multiplicative noise triggers the non-Gaussian distribution of the state variable x.

4.2.1

Solving the Exact Path-Wise Solution

Due to the linear nature of the system, the exact path-wise solution of (4.12) can be solved analytically following a similar two-step procedure as solving a non-homogeneous ODE. In deriving the path-wise solution, all the parameters are allowed to be deterministic functions of time a := a(t), f := f (t), b := b(t) and c := c(t). 1. Solve the general solution of the homogeneous equation associated with (4.12), dx(t) = −ax(t) dt + bx(t) dWb (t). 2. Find the special solution of the non-homogeneous equation.

(4.13)

54

4 Simple Gaussian and Non-Gaussian SDEs

In Step 1, the correct way to solve the homogeneous equation (4.13) is as follows. Set f (x) = ln x. Applying Itô’s formula yields   1 2 dx 1 ( dx)2 dt + b dWb (t), = − a+ (4.14) − b d ln x(t) = x 2 x2 2 where the term 21 b2 comes from the second order expansion from Itô’s formula. Denote by x(t0 ) the initial condition. The solution of (4.14) is then given by   t   t   1 2 a(s)+ 2 b(s) ds + x h (t) = x(t0 ) exp − b(s) dWb (s) := x(t0 )(t). (4.15) t0

t0

Next, Step 2 aims at finding the solution of the original non-homogeneous equation (4.12). Similar to solving a linear ODE, applying the method of variation of parameters together with Itô’s formula yields,       d x−1 = dx−1 + x d −1 + dx d −1 ,

(4.16)

where     d −1 = −−2 d + −3 ( d)2 = (a + b2 ) dt − b dWb (t) −1 .

(4.17)

In light of (4.12) and (4.17), the three terms on the right hand side of (4.16) become:     dx−1 = −ax dt+bx dWb (t) −1 + f dt + c dWc (t) −1 ,       x d −1 = ax dt+b2 x dt−bx dWb (t) −1 and dx d −1 = −b2 x dt−1 . Plugging these results back to (4.16) yields     d x−1 = f dt + c dWc (t) −1 , which leads to the solution of the non-homogeneous equation:   t  t f (s) c(s) dWc (s) . xs (t) = (t) ds + (s) t0 (s) t0

(4.18)

The final solution is x(t) = x h (t) + xs (t) by combining (4.15) and (4.18).

4.2.2

Equilibrium Distribution

Now assume the model parameters a, f , b and c are all constants. The Fokker-Planck equation associated with (4.12) is given by

1 ∂t p(x, t) = −∂x [(−ax + f ) p(x, t)] + ∂x2 (b2 x 2 + c2 ) p(x, t) . 2

(4.19)

4.2

Non-Gaussian SDE: Linear Model with Multiplicative Noise

55

The stationary solution of (4.19) gives the equilibrium distribution. By dropping the lefthand side of (4.19) and taking the integration, the resulting equation reads:

∂x (b2 x 2 + c2 ) peq (x) = 2(−ax + f ) peq (x),

(4.20)

which can be easily integrated again to yield   x N0 B(x  )/A(x  ) dx  , peq (x) = exp A(x) x0 where N0 is the normalization constant while A(x) = bx 2 + c2 ,

B(x) = 2(−ax + f ).

Depending on the choice of the parameters, there are at least the following four cases: (1) a > 0, b  = 0, c  = 0 The model has non-zero additive and multiplicative noises. The associated PDF can be non-Gaussian with fat algebraic tails:    N0 2f |b| exp arctan x . peq (x) = 2 |bc| |c| (b2 x 2 + c2 )1+a/b (4.21) (2) a > 0, b = 0, c  = 0 The model has only additive noise and therefore the PDF is Gaussian:   a (4.22) peq (x) = N0 exp − 2 (x − f /a)2 . c (3) a > 0, b = 0, c = 0 The model has no noise and therefore the system is a deterministic ODE with solution: peq (x) = δ(x − f /a).

(4.23)

(4) a > 0, b  = 0, c = 0 The equilibrium distribution does not exist at x = 0: peq (x) =

N0 2) 2(1+a/b x

  exp −2 f /(b2 x) .

(4.24)

Figure 4.2 shows the model simulations with different parameters. The non-Gaussian features with skewed PDFs and fat tails are illustrated in Columns (a) and (b), respectively. Column (c) shows the case with a Gaussian equilibrium PDF.

4.2.3

Time Evolution of the Moments

Assume that b and c are constants, and the forcing f can be a deterministic function of time a (t) with period but is bounded. Further assume the damping is time-periodic a(t) = a0 +

56

4 Simple Gaussian and Non-Gaussian SDEs

10 10

-6

-0.5

0

0.5

1

1.5

2

100

100

10-2

10-2

10

-4

10

-6

-1

p(x)

p(x)

p(x)

Truth Gaussian fit

10-2 -4

(c) Case 2 (Gaussian)

(b) Case 1 (fat tails on both sides)

(a) Case 1 (one-sided fat tail) 100

10-4 -0.5

0

0.5

10-6

1

0

x

x

0.5

0.5

1

0.5

0.5

x

0

x

1

0

-0.5

0 320

-0.5

x

x

330

340

350

360

320

370

t (a = 5, b = 1, c = 0.3 and f = 0.9)

330

340

350

360

-0.5 320

370

t (a = 5, b = 1, c = 0.5 and f = 0.0)

330

340

350

360

370

t (a = 5, b = 0, c = 0.5 and f = 0.6)

Fig. 4.2 Simulations of Case (1) and Case (2) for the linear model (4.12). The PDFs are computed based on the analytic formulae in (4.21) and (4.22), and the probabilities are shown in the logarithm scale

t +T T and a finite positive mean a0 with 0 < a0 ≡ T1 t00 a(s) ds that guarantees the mean stability. It will be shown in the following that the moments of (4.12) exist up to order n provided that c  = 0 and − na0 + 21 n(n − 1)b2 (t − t0 ) < 0.

(4.25)

To write down the equation of the n-th moment, apply Itô’s formula to x n , dx n = nx n−1 dx +

n(n − 1) n−2 ( dx)2 . x 2

(4.26)

Replacing dx on the right hand side of (4.26) by (4.12) yields  

dx n = nx n−1 − ax(t) + f dt + bx(t) dWb (t) + c dWc x n−2 (b2 x 2 + c2 ) dt + n(n−1) 2    2 x n dt + n f (t)x n−1 + b = −na(t) + n(n−1) 2

n(n−1) 2 n−2 2 c x

 dt

(4.27)

+ bnx n dWb + nx n−1 c dWc . Taking the ensemble average of (4.27) leads to   n 2 x n (t) + n f (t)x n−1 (t) + dx (t) = −na(t) + n(n−1) b 2

 n(n−1) 2 n−2 (t) 2 c x

dt,

which after integration yields  t   x n (t) = x0n e Jn (t0 ,t) + e Jn (s,t) n f (s)x n−1 (s) + 21 n(n − 1)c2 x n−2 (s) ds, t0

(4.28)

4.3

Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity



where Jn (s, t) = −n

s

t

a(t  ) dt  + 21 n(n − 1)b2 (t − s).

57

(4.29)

In the above formulae, x(t)n−k  = 1 for n = k, k = 1, 2 and x(t)n−k  = 0 for n − k < 0, k = 1, 2. It is seen that (4.25) provides the stability condition that forces the exponential functions to have a negative power. If the parameter b is fixed, then the stability of the higher order moments requires a stronger damping a0 .

4.3

Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity and Correlated Additive and Multiplicative (CAM) Noise

Consider another scalar non-Gaussian SDE with a slightly increased complexity, dx(t) = [F + ax(t) + bx 2 (t) − cx 3 (t)] dt + (A − Bx(t)) dWC (t) + σ dW A (t), (4.30) where the coefficient c in front of the cubic damping is set to be c > 0 as a necessary condition for preventing the finite-time blowup of the solution. The nonlinear model in (4.30) is a canonical model for low-frequency atmospheric variability and was derived based on stochastic mode reduction strategies [178]. It was applied in a regression strategy for data from a prototype climate model [94] to build one-dimensional stochastic models for low-frequency patterns such as the North Atlantic Oscillation (NAO). Note that the model in (4.30) has both correlated additive and multiplicative (CAM) noise (A − Bx) dWC as well as an extra uncorrelated additive noise σ dW A . The nonlinearity interacting with noise allows rich dynamical features in the model, such as strongly non-Gaussian PDFs and multiple attractors. Unlike the model (4.12) with linear dynamics, there is no explicit path-wise solution for such a nonlinear system.

4.3.1

Equilibrium PDF

The Fokker-Planck equation associated with the model (4.30) is given by 1 ∂2 ∂ ∂p [(A − Bx)2 + σ 2 ) p]. = − [(F + ax + bx 2 − cx 3 ) p] + ∂t ∂x 2 ∂x2 The equilibrium PDF can be solved by setting ∂ p/∂t = 0 and using the vanishing boundary conditions of the probability and its derivatives at infinity. Depending on the CAM, the equilibrium PDF has different forms. • When A  = 0 and B  = 0, the equilibrium PDF peq (x) is given by

58

4 Simple Gaussian and Non-Gaussian SDEs

     N0 Bx − A c 1 x 2 + b1 x exp d arctan exp − . ((Bx − A)2 + σ 2 )a1 σ B4

    Fat algebraic tails Gaussian part (4.31) where N0 is the normalization constant and the associated constants are given by peq (x) =

−3A2 c + a B 2 + 2 AbB + cσ 2 , b1 = 2bB 2 − 4c AB, c1 = cB 2 , B4 A2 bB − A3 c + Aa B 2 + B 3 f 6c A − 2bB d1 d1 = 2 , d2 = , d= + d2 σ. 4 4 B B σ

a1 = 1 −

• When A = B = 0, the equilibrium PDF peq (x) is given by   2 a 2 b 3 c 4 peq (x) = N0 exp (F x + + − ) , x x x σ2 2 3 4

(4.32)

where N0 is again the normalization constant. Figures 4.3–4.4 show the equilibrium PDFs of the cubic model (4.30) in different dynamical regimes (namely with different sets of parameters) with A  = 0, B  = 0 and A = B = 0, respectively. Various non-Gaussian features are exhibited. In particular, the cubic nonlinearity allows a bimodal distribution, which is not seen in the linear model with multiplicative noise (4.12). The bimodal distribution results from the balance between two stable equilibria associated with the deterministic part of (4.30) after the random noise are injected into the system. The cubic nonlinearity is essential in triggering the bimodality of the PDF. It is also interesting to see that fat tails do not appear when A = B = 0. This is because, according (a) Nearly Gaussian Regime (b) Highly Skewed Regime

0 0

1

2

3

x

p(x) in log scale -1

0

1

x

2

3

2

4

0

2

x

p(x) 0

2

-2

0

x

100

10-5

0.2

0 -2

x

Truth Gaussian fit

100

0.5

0 0

p(x) in log scale

-1

p(x) in log scale

0.5

0.4

4

100

10-5

-2

0

x

2

4

6

4

6

x p(x) in log scale

0

(d) Bimodal Regime

1

p(x)

0.5

10-5

(c) Fat-Tailed Regime

1

p(x)

p(x)

1

2

100

10-5

-2

0

2

x

Fig. 4.3 Equilibrium PDFs of the cubic model (4.30) in different dynamical regimes (namely with different sets of parameters) for A  = 0 and B  = 0. These PDFs are computed using the formula in (4.31)

Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity

0.6

0.4

p(x)

p(x)

p(x)

1

0.6

0.5

0.5

0.2 1

100

-1

0

1

x

0

2

1

-1

0

0

-2

1

1

2

x

100

10-5

-1

0

x

0

2

x

x

100

10-5

0 -1

2

x

Truth Gaussian fit

p(x) in log scale

p(x) in log scale

x

10-5

-1

2

p(x) in log scale

0

p(x) in log scale

-1

0.4 0.2

0

0

0

(d) Bimodal Regime

(c) Sub-Gaussian Regime

(a) Nearly Gaussian Regime (b) Highly Skewed Regime 1

59

p(x)

4.3

1

100

10-5

-2

0

2

x

Fig. 4.4 Equilibrium PDFs of the cubic model (4.30) in different dynamical regimes (namely with different sets of parameters) for A = 0 and B = 0. These PDFs are computed using the formula in (4.32)

to (4.32), the highest order term in the exponential function is −cx 4 (with c > 0), which means the tails of the PDF decay faster than those of a Gaussian distribution, and fat tails are thus suppressed.

4.3.2

Time Evolution of the Moments

The derivation of the moment equations associated with the cubic model (4.30) can follow a similar procedure described in Sect. 4.2.3. The moment equation of the mean is obtained by taking the ensemble average of the state variable x, and that of the second order moment can be reached with the help of Itô’s formula dx = (F + ax + bx 2  − cx 3 ) dt dx 2  = 2x dx + ( dx)2    = (A2 + σ 2 ) + (2F − 2 AB)x + (2a + B 2 )x 2  + 2bx 3  − 2cx 4  dt. (4.33) It is seen that knowing the second and third-order moments is the prerequisite to solving the moment equation of the mean. However, obtaining the information of the second/third moment requires the knowledge of the fourth/fifth order moment. In general, for the cubic model (4.30), the time evolution of the n-th moment x n  depends on higher order moments x n+1  and x n+2 . Thus, the system of the moment equations is never closed! This is an intrinsic difficulty for any nonlinear system, in which the nonlinearity automatically makes the order of the moments on the right-hand side of the moment equation higher than the one on the left-hand side.

60

4 Simple Gaussian and Non-Gaussian SDEs

Thus, approximations must be made to close the systems of the moment equations. This is typically done by the so-called closure methods, which approximate the higher order moments as functions of the low order ones. Then the closed system consisting of the leading a few moments is utilized to approximate the time evolution of the PDF. Such a procedure leads to a reduced-order system of the statistical moments because the dimension of the moment equations is reduced from infinity to a small number. Therefore, it is often named the statistical reduced-order model. One simple and practically useful closure method is the quasi-Gaussian closure. It only includes the evolution of the mean and variance. The third and other odd-order central moments are assumed to be zero in these two equations. In contrast, the fourth and other even-order central moments are expressed via the variance in light of the relationship of the moments for a Gaussian distribution. Specifically, the identity of the fourth order central moment and the variance is given by x 4  = 3(x 2 )2 ,

(4.34)

which is due to the definition of the kurtosis x 4 /(x 2 )2 of a Gaussian being 3. The focus below is to develop the quasi-Gaussian closure for the cubic model. The procedure can be generalized to many other nonlinear systems.

4.3.3

Quasi-Gaussian Closure for the Moment Equations Associated with the Cubic Model

Let us start with applying the Reynolds decomposition to the cubic model (4.30), dx + dx  = [a(x + x  ) + b(x + x  )2 − c(x + x  )3 + f (t)] dt + [A − B(x + x  )] dWC + σ dW A , = [a(x + x  ) + b(x2 + x 2 + 2xx  ) − c(x3 + 3x2 x  )

(4.35)

− c(3xx 2 + x 3 ) + f (t)] dt + [A − Bx − Bx  ] dWC + σ dW A , where x = x + x  . Taking the ensemble average of (4.35) yields dx = [ax + bx2 + bx 2  − cx3 − 3cxx 2  + f (t)] dt,

(4.36)

where the term cx 3  has been dropped due to the quasi-Gaussian closure approximation, in which all the odd moments, except the mean, are set to be zero. To compute the variance, Itô formula is utilized, which gives dx 2  = 2x  dx   + ( dx  )2 . The equation of the fluctuation can be derived in light of (4.30) and (4.36), dx  = [ax  + 2bxx  − 3cx2 x  − cx 3 + bx 2 − bx2 − 3cxx 2 + 3cxx 2 ] dt + [A − Bx − Bx  ] dWC + σ dW A ,

4.3

Non-Gaussian SDE: Scalar Model with Cubic Nonlinearity

61

and thus 2x  dx  = 2[ax 2 + 2bxx 2 − 3cx2 x 2 − cx 4 + bx 3 − bx2 x  − 3cxx 3 + 3cxx 2 x  ] dt + 2[Ax  − Bxx  − Bx 2 ] dWC + 2x  σ dW A . Taking the ensemble average yields 2x  dx   = 2(a + 2bx − 3cx2 − 3cx 2 )x 2  dt,

(4.37)

where the quasi-Gaussian closure in which x 3  = 0 and the fourth order central moment is replaced by the variance (4.34) has been applied. On the other hand, ( dx  )2  = (A − Bx − Bx  )2 + σ 2  dt = [A2 + B 2 x2 − 2 ABx + B 2 x 2  + σ 2 ] dt. Combining (4.37) and (4.38) yields  dx 2  = [2(a + 2bx − 3cx2 − 3cx 2 ) + B 2 ]x 2   + [A2 + B 2 x2 − 2 ABx + σ 2 ] dt.

(4.38)

(4.39)

Equations (4.36) and (4.39) thus provide the quasi-Gaussian approximation of the moment equations associated with the cubic nonlinear scalar model (4.30). Figure 4.5 compares the time evolution of the quasi-Gaussian closure model (4.36) and (4.39) (red) with the truth (blue) in four different dynamical regimes. It is seen from Columns (a)–(c) that the quasi-Gaussian closure model captures the time evolution of both the mean and the variance not only in the nearly Gaussian regime but also in the regime with strong skewness and fat tails. In contrast, if the dependence of the third and fourth order moments are completely omitted in the equations of the mean and variance (black dashed curve; the so-called bare truncation model), then a significant error appears. Such a comparison indicates the important role played by the closure. On the other hand, caution should be taken when applying the quasi-Gaussian closure model to the case when the true dynamics have a bimodal distribution (see Column (d)). Depending on the initial condition, the solution of the quasi-Gaussian closure model converges to two different values (see the red and the green curves), which correspond to the two peaks of the distribution. In contrast, the actual mean of the distribution locates between them. It is also worthwhile to point out that the solution from the bare truncation model blows up very quickly, indicating the importance of including the contribution from the higher order moments in driving the mean and the variance dynamics.

62

4 Simple Gaussian and Non-Gaussian SDEs

Fig. 4.5 Using the quasi-Gaussian (qG) closure model (4.36) and (4.39) to approximate the moments of the cubic model (4.30). Top row: the time evolution of the PDF from a Monte Carlo simulation of the cubic model (4.30) using 1500 samples. Middle and bottom rows: the time evolution of the mean and variance. Here, the blue curve shows the true solutions (up to the sampling error) from the Monte Carlo simulation; the red curve shows the results from the quasi-Gaussian closure model (4.36) and (4.39) and the black dashed curve shows those by completely omitting the third and fourth order moments (the so-called bare truncation model). In Column (d), the green curve shows another quasi-Gaussian closure solution starting from a different initial condition. The regimes here are the same as those in Fig. 4.3

4.4

Nonlinear SDEs with Exactly Solvable Conditional Moments

Nonlinear SDE systems (or multi-dimensional SDEs) provide rich dynamical and statistical features that advance a more realistic characterization of nature beyond scalar SDEs. In particular, complex nonlinear interactions between different state variables can be studied by exploiting these SDE systems. Another reason for introducing the SDE systems follows the argument in Sect. 1.4.1, where the additional degrees of freedom represent the hidden variables that trigger the observed intermittency and provide appropriate nonlinear feedback from the unresolved states to the resolved ones. However, different from the simple scalar SDEs, it is very challenging to find the closed analytic formulae for solving the marginal or joint non-Gaussian distributions for most nonlinear SDE systems, even the equilibrium ones. Nevertheless, there are many complex systems in which the conditional statistics are analytically solvable. Such a desirable feature significantly facilitates mathematical analysis and numerical simulations. The general framework of such nonlinear SDE systems and their applications to uncertainty quantification, data assimilation, and prediction will be presented in Chap. 8. The focus below is to provide insights on understanding and deriving the exact

4.4

Nonlinear SDEs with Exactly Solvable Conditional Moments

63

solvable conditional moments of a family of nonlinear SDE systems exploiting the tools that have been developed so far.

4.4.1

The Coupled Nonlinear SDE System

Consider the following coupled SDEs with state variables (u, γ ) [34]:   du = r (γ , t)u + l(γ , t) dt + σu (γ , t) dWu (t), dγ = F(γ , t) dt + σγ (γ , t) dWγ (t).

(4.40)

where r (γ , t), l(γ , t), σu (γ , t), F(γ , t) and σγ (γ , t) are all nonlinear functions of γ and time t, while Wu (t) and Wγ (t) are independent Wiener processes. The probability of the state variables vanishes at infinity. For simplicity, the state variables u and γ here are treated as scalars, but the framework also applies to vector fields. The system (4.40) is overall nonlinear due to both the above nonlinear functions of γ and the nonlinear interaction between r (γ , t) and u. Thus, the marginal distributions p(u), p(γ ) as well as the joint distribution p(u, γ ) can be highly non-Gaussian. The model in (4.40) mimics many turbulent features in nature. One example is to generate intermittent time series of u. This can be achieved by allowing the function r to alternate between positive and negative values, as the mechanism presented in Sect. 1.4.1. The model in (4.40) also provides a systematic framework to understand the role of the hidden process γ in triggering the observed intermittency. The perfect dynamics of γ can be highly complicated, while an approximate process of γ might be sufficient to provide accurate statistical feedback to u. Then quantifying the gap in the response of u with different processes of γ becomes practically essential for building parsimonious models, known as stochastic parameterization (see Sect. 7.5). Another example of (4.40) is the turbulent diffusion of passive tracers, where u is the large-scale zonal mean jet while γ represents one of the Fourier modes of the passive tracer and the small-scale shear flow [176].

4.4.2

Derivation of the Exact Solvable Conditional Moments

Denote by p the PDF of (4.40). The associated Fokker-Planck equation is given by ∂ 1 ∂2 ∂p (σ 2 (γ , t) p) = − (F(γ , t) p)+ ∂t ∂γ 2 ∂γ 2 γ ∂ 1 ∂2 2 − (σ (γ , t) p). ((r (γ , t)u + l(γ , t)) p) + ∂u 2 ∂u 2 u In light of (1.6), the joint PDF p(γ , u, t) can be written as

(4.41)

64

4 Simple Gaussian and Non-Gaussian SDEs

p(γ , u, t) = π(γ , t) p(u|γ , t),

(4.42)

where π(γ , t) = p(γ , u, t) du is the marginally distribution of γ (and t). Thus, integrating (4.41) with respect to u yields an equation for π(γ , t), ∂π ∂ 1 ∂2 (σ 2 (γ , t)π ) := L F P (π ), = − (F(γ , t)π ) + ∂t ∂γ 2 ∂γ 2 γ

(4.43)

where the boundary terms at u = ±∞ have been dropped since p(u = ±∞) = 0. The n-th conditional moment of p(γ , u, t), conditioned on γ (and t), is defined by   (4.44) Mn (γ , t) = u n p(γ , u, t)du = π(γ , t) u n p(u|γ , t) du. Multiplying (4.41) by u n and integrating with respect to u yields ∂ ∂ 1 ∂2 (σ 2 (γ , t)Mn ) Mn (γ , t) = − (F(γ , t)Mn ) + ∂t ∂γ 2 ∂γ 2 γ    ∂(up) ∂p ∂2 1 u n 2 (σu2 (γ , t) p) . −r (γ , t) u n − l(γ , t) u n + ∂u ∂u 2 ∂u  

(4.45)

{A}

In light of integration by parts, {A} equals to 1 {A} = r (γ , t)n Mn (γ , t) + nl(γ , t)Mn−1 (γ , t) + n(n − 1)σu2 (γ , t)Mn−2 (γ , t). 2 The conditional moment Mn (γ , t) then satisfies the following equation [34]: ∂ Mn (γ , t) = L F P Mn (γ , t) + r (z, t)n Mn (γ , t) ∂t 1 + nl(γ , t)Mn−1 (γ , t) + n(n − 1)σu2 (γ , t)Mn−2 (γ , t), 2

(4.46)

where M0 = π(γ , t) according to the definition (4.44) and M−1 := 0 and M−2 := 0. The operator L F P M N (γ , t) is defined as: L F P Mn (γ , t) := −

∂ 1 ∂2 (σ 2 (γ , t)Mn (γ , t)). (F(γ , t)Mn (γ , t)) + ∂γ 2 ∂γ 2 γ

The moment Eq. (4.46) can be solved recursively as a function of the order n. Figure 4.6 shows a schematic comparison of the true joint PDF and the approximate one at a fixed instant. The latter exploits the conditional moments M0 , M1 , and M2 to build a conditional Gaussian approximation for each γ via the maximum entropy principle. Although the conditional distribution p(u|γ ) for each fixed γ is Gaussian, the joint distribution as well

4.4

Nonlinear SDEs with Exactly Solvable Conditional Moments

65

Fig. 4.6 A schematic comparison of the actual joint PDF and the approximate one using the conditional moment approximation up to M2 at a fixed time instant

as the two marginal distributions p(γ ) and p(u) remain highly non-Gaussian. The marginal distribution of γ is even exact.

5

Data Assimilation

5.1

Introduction

Data assimilation seeks to optimally integrate different sources of information to improve the state estimation of a complex system [144, 155, 159, 183]. Typically, the output from a numerical model is combined with observations to reduce the bias and the uncertainty in the estimated states, where observations are often noisy and sparse while the model usually contains model errors. In predicting chaotic or turbulent signals (see Sect. 1.5), such as the numerical weather forecast, data assimilation plays a vital role in improving the state estimation at the initialization stage, which facilitates the extension of the range of the skillful prediction. As uncertainty quantification is essential in estimating the state of turbulent signals, the result of data assimilation is naturally represented by the PDF of the state variables. Thus, data assimilation can be formulated in the Bayesian context. Denote by u ∈ Cn a collection of the state variables and v ∈ Cl the noisy observation given by a function of a subset of u. The general formula of data assimilation reads p(u|v) ∼ p(u) p(v|u),

(5.1)

where p(u) comes solely from the model output and is known as the prior distribution while p(v|u) is the likelihood function representing the probability of the specific observational value given the model state. Once the information from the model is combined with that from the observation, the so-called posterior distribution p(u|v) provides the optimal estimated model state conditioned on the given observations. Note that the notation ‘∼’ in (5.1) stands for ‘proportional to’ since the exact Bayesian formula for the conditional distribution (1.6)  requires to solve the normalization distribution p(v) = p(u) p(v|u) du on the right hand side of (5.1).

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_5

67

68

5 Data Assimilation

The mathematical formula of data assimilation in (5.1) is straightforward. However, there are several fundamental difficulties in directly applying (5.1) for practical problems, where the dimension of the state variables is often quite large. This leads to a computational issue in solving p(u), which is given by the solution of the Fokker-Planck equation. Another challenge is to compute the normalization factor p(v). Unless in exceptional cases where the analytic formula is available, computing such a high-dimensional integral numerically suffers from the curse of dimensionality. Thus, suitable approximations must be adopted to find the solution to data assimilation, as computational efficiency is one of the most important issues in the real-time forecast. Before presenting various practical strategies, the classical Kalman filter is introduced. Analytic formulae are available to solve the exact posterior distribution of the Kalman filter that provides an intuitive understanding of data assimilation.

5.2

Kalman Filter

Kalman filter is a simple yet powerful data assimilation algorithm [143]. The Kalman filter requires both the underlying dynamics and the observation operator to be linear. In addition, both the system and the observational noises are assumed to be Gaussian. These conditions indicate that the prior distribution, the likelihood function, and the posterior distribution are all Gaussians, allowing closed analytic formulae for state estimation and uncertainty quantification. The Kalman filter is the optimal linear estimator in the minimum meansquare-error sense. The Kalman filter has two steps. The first step involves a statistical prediction using the given dynamical model. The resulting prior distribution is then corrected based on the statistical input of noisy observation in the second step, leading to the posterior distribution. These two steps are known as ‘forecast’ (or prediction) and ‘analysis’ (filtering or correction).

5.2.1

One-Dimensional Kalman Filter: Basic Idea of Data Assimilation

Denote by u m ∈ C a complex scalar random variable. Consider a simple situation with the following linear scalar difference equation: u m+1 = Au m + Fm + σm ,

(5.2)

where the subscript m represents the state variable u m , the deterministic forcing Fm and the stochastic noise σm being evaluated at the time instant tm . Here, the forward operator A is a constant throughout the time, Fm is a known constant for each fixed m and σm is a complex

5.2

Kalman Filter

69

√ Gaussian noise with σm = (σmR + iσmI )/ 2, where both σmR and σmI are real. The random variable σm has zero mean and variance r = σm σm∗  = 21 (σmR (σmR )∗  + σmI (σmI )∗ ), where · is the ensemble average. On the other hand, the noisy observation at tm+1 is given by o vm+1 = gu m+1 + σm+1 ,

(5.3)

where g is a constant representing the linear observation operator and σmo ∈ C is another Gaussian noise with variance r o = σmo (σmo )∗ . Now assume a perfect model scenario, which means the model (5.2) is utilized to generate the single realization of the true signal. It also serves as the model to compute the statistical output in the procedure of the Kalman filter, known as the forecast model. The Kalman filter can be regarded as a recursive update procedure in time. Starting from the posterior distribution at time tm , which is denoted by u m|m , the model output is given by forecasting the model from tm to tm+1 using (5.2). The result is the prior distribution p(u m+1 ) ∼ N (u¯ m+1|m , rm+1|m ).

(5.4)

In (5.4), the prior mean u¯ m+1|m and the prior variance rm+1|m are solved by applying the Reynolds decomposition (see in Sect. 4.1.1) to (5.2), u¯ m+1|m = Au¯ m|m + Fm ,

(5.5)



rm+1|m = Arm|m A + r ,

(5.6)

where ·∗ is the conjugate transpose. To combine the prior information p(u m+1|m ) with the observation vm+1 at tm+1 , the Bayesian update (5.1) is utilized, 1

p(u m+1 |vm+1 ) ∼ p(u m+1 ) p(vm+1 |u m+1 ) = e− 2 J (u m+1 ) ,

(5.7)

(u − u¯ m+1|m )∗ (u − u¯ m+1|m ) (vm+1 − gu)∗ (vm+1 − gu) + . rm+1|m ro

(5.8)

where J (u) =

Since the posterior distribution p(u m+1 |vm+1 ) is Gaussian, the posterior mean u¯ m+1|m+1 equals the value that minimizes J (u) in (5.8), which is given by u¯ m+1|m+1 = (1 − K m+1 g)u¯ m+1|m + K m+1 vm+1 , where K m+1 =

ro

grm+1|m + g 2 rm+1|m

(5.9)

(5.10)

is the so-called Kalman gain and 0 ≤ K m+1 g ≤ 1 in this scalar case. Therefore, the posterior mean u¯ m+1|m+1 in (5.9) is a weighted summation of the prior mean u¯ m+1|m and the scaled

70

5 Data Assimilation

observation vm+1 /g. The latter, according to (5.3), is the noisy version of the truth, and the unit of its amplitude is not amplified or compressed. These two weights are completely determined by the ratio of the observational noise r o and the model forecast variance rm+1|m . In particular, the posterior mean almost fully weighs towards prior forecast when K m+1 g ≈ 0 (i.e., r o rm+1|m ) and towards the observation when K m+1 g ≈ 1 (i.e., r o rm+1|m ). Based on the argument from the information theory in Sect. 2.1.3, r o and rm+1|m represent the level of the uncertainty in the observation and the model, respectively. Since the perfect model is utilized and the Gaussian observational noise is symmetric, the model forecast (starting from a perfect initial condition) and the observation both converge to the truth when the associated uncertainty decreases to zero. This means both the prior mean and the observation are unbiased estimators in the limit of vanishing the uncertainty. Therefore, in the presence of uncertainty, the posterior mean should trust more towards the one with less uncertainty. In such a linear Gaussian situation, (5.9) implies that the weight in front of the prior mean and observations should be proportional to the inverse of the uncertainty of the observation and the forecast, respectively. As a quick remark, (5.9) is sometimes written in the following form: u¯ m+1|m+1 = u¯ m+1|m + K m+1 (vm+1 − g u¯ m+1|m ),

(5.11)

where the first term on the right hand side, i.e., u¯ m+1|m , is the forecast mean, while the second term is the Kalman gain multiplied by the so-called innovation or the measurement pre-fit residual vm+1 − g u¯ m+1|m . The second term serves as the correction of the model forecast, which comes to the name of the two-stage procedure: ‘prediction-correction’. Finally, the posterior variance rm+1|m+1 is given by rm+1|m+1 = (1 − K m+1 g)rm+1|m ,

(5.12)

which indicates that the uncertainty in the posterior distribution is always smaller than that in the prior one in the scalar case. Importantly, it can be shown that both the prior and the posterior variance asymptotically converge to constants. This also implies that the Kalman gain becomes a constant after a few assimilation cycles. Finally, it is helpful to notice the direct connection between the Kalman filter (5.10)–(5.12) and the formula of the general conditional distribution for Gaussian random variables (1.6). Figure 5.1 includes a schematic illustration of the filtering procedure, including the above Kalman filter.

5.2.2

A Simple Example

The Kalman filter discussed above uses a scalar difference equation (5.2). Nevertheless, applying the above framework to the situation where the forecast model is an SDE is straightforward. Consider the following complex-valued linear Gaussian SDE:

5.2

Kalman Filter

71

Forecast Analysis True signal

Fig. 5.1 Schematic illustration of the two-step filtering procedure: forecast and analysis

  du = (−γ + iω0 )u + f 0 + f 1 eiω1 t dt + σ dW ,

(5.13)

where the two real-valued constants γ > 0 and ω0 are the damping and the oscillation frequency, respectively. The model has a constant and a time-periodic deterministic forcing, f 0 and f 1 eiω1 t , as well as a stochastic forcing σ dW in the form of Gaussian white noise. The exact path-wise solution from tm to tm+1 can be expressed explicitly, u(tm+1 ) = u(tm )e(−γ +iω0 )(tm+1 −tm )     f 1 eiω1 tm+1 f0 1 − e(−γ +iω0 )(tm+1 −tm ) + + 1 − e−(γ +iω1 −iω0 )(tm+1 −tm ) γ − iω0 γ + i(−ω0 + ω1 )  tm+1 +σ e(−γ +iω0 )(tm+1 −s) dW (s). (5.14) tm

Despite the complicated forms, the terms in the three rows on the right hand side of (5.14) correspond to Am u m , Fm and σm in (5.2), respectively, where Am and Fm are both constants ¯ m+1 ) and σm is a Gaussian random noise. The formula in (5.14) leads to the forecast mean u(t and the forecast variance r (tm+1 ) at tm+1 starting from tm ,   f0 1 − e(−γ +iω0 )(tm+1 −tm ) γ − iω0   iω t f 1 e 1 m+1 1 − e−(γ +iω1 −iω0 )(tm+1 −tm ) , + γ + i(−ω0 + ω1 )  σ2  1 − e−2γ (tm+1 −tm ) . r (tm+1 ) = r (tm )e−2γ (tm+1 −tm ) + 2γ

¯ m )e(−γ +iω0 )(tm+1 −tm ) + u(t ¯ m+1 ) = u(t

As a simple illustration, consider the following parameters in (5.13): γ = 0.4, ω0 = 1,

f 0 = 2,

f 1 = 0, ω1 = 0, and σ = 1.

(5.15)

The model with these parameters will be utilized to generate a true signal and serve as the forecast model in the Kalman filter as well. The observation operator g = 1. The obser-

72

5 Data Assimilation

vational time step and observational noise vary, the associated results of which are shown in Fig. 5.2. In this scalar model case, the posterior estimate is always more accurate than the prior one regarding both the path-wise error in the mean estimate and the uncertainty reflected in the variance. Comparing Panel (a) with Panels (b) and (c), it is seen that the error increases as the observational noise or the observational time step. When the observational noise increases, the Kalman gain decreases, which means the filter solution trusts more towards the forecast model. On the other hand, if the observational time step becomes large, then the uncertainty in the forecast grows up. Consequently, the Kalman gain increases and the filter solution puts more weight on the observations.

Fig. 5.2 Filtering (5.13) with parameters (5.15). Different panels show the results with different observational time steps and observational noise levels. In each panel, the blue curve is the truth, the black dots are the observations, and the green and red curves are the prior and posterior mean time series computed from (5.5) and (5.9), respectively. The green and red shading areas show the confidence interval represented by the two standard deviations of the prior and posterior distribution (i.e., the square root of the prior and posterior variance (5.6) and (5.12)). The constant Kalman gain shown here is its long-term asymptotic value, which is arrived only after a few assimilation cycles in this test example

5.2.3

Multi-Dimensional Case

The derivation of the multi-dimensional case is similar to the scalar model discussed above. However, the Kalman filter in the multi-dimensional case exhibits more features and has much broader applications. The goal here is to obtain the posterior distribution

5.2

Kalman Filter

73

p(um+1 |vm+1 ) through the Bayesian formula that combines the prior distribution p(um+1 ) of a true signal um+1 ∈ C N and the observation vm+1 ∈ C M . Consider a vector-valued linear model: um+1 = Aum + Fm + σ m , where σ m is an N -dimensional Gaussian white noise with zero mean and covariance R. The observation process is vm+1 = Gum+1 + σ om+1 , where G ∈ R M×N and σ om+1 is an M-dimensional Gaussian white noise with zero mean and covariance Ro . The prior mean and prior covariance are then given by u¯ m+1|m = Au¯ m+1|m + Fm , Rm+1|m = ARm|m A∗ ,

(5.16)

while the posterior mean and posterior covariance can be computed as u¯ m+1|m+1 = u¯ m+1|m+1 + Km+1 (vm+1 − Gu¯ m+1|m ), Rm+1|m+1 = (I − Km+1 G)Rm+1|m

(5.17)

where the Kalman gain is Km+1 = Rm+1|m GT (GRm+1|m GT + Ro )−1 .

(5.18)

Note that the dimension of the observational variable does not necessarily equal that of the model state variable. In addition, each observation can be a linear combination of several state variables with noise. In many practical situations, the dimension of the observations is much less than the number of state variables, i.e., M N . This is known as filtering with partial observations. Usually, recovering the unobserved state variables is more challenging than denoising the observed ones. But the former has more practical implications. On the other hand, sometimes M N . One such example is Lagrangian data assimilation, where the number of Lagrangian tracers may exceed the degree of the freedom of the underlying flow field. See Sect. 8.3 for more details. It is worthwhile to introduce the observability in the multi-dimensional Kalman filter [141, 238]. Consider a two-dimensional linear system with state variables u 1 and u 2 : u 1m+1 = A11 u 1m + A12 u 2m + σm1 , u 2m+1 = A21 u 1m + A22 u 2m + σm2 , where only the noisy observation of u 1 is available, 1 vm+1 = g1 u 1m+1 + σmo,1 .

74

5 Data Assimilation

Clearly, the observational information would help improve the state estimation of u 2 only when A12 = 0 or A21 = 0. In such a case, the system is said to be observable. In many practical applications, the two coefficients A12 or A21 may not be zero but can be intermittently quite small. The system then loses practical observability at these phases and has difficulty recovering the unobserved state variable since the random noise submerges the useful information from u 1 to u 2 . See Sect. 7.5.3 for a concrete example.

5.2.4

Some Remarks

In addition to understanding the Kalman filter as a recursive process, another way to look into the Kalman filter (and other data assimilation methods) is from the following general optimization viewpoint. Recall that the task is to minimize J (u) in (5.8). The minimization problem in the multi-dimensional case can be rewritten as min J (u) = vm+1 − G(u) Ro + u − u¯ m+1|m Rm+1|m . u

(5.19)

where · A = A−1/2 · with A being a positive definite matrix and · being the standard Euclidean norm. In (5.19), a nonlinear observation operator G(u) has been incorporated to describe more general situations. The formulation in (5.19) can be regarded as a constrained optimization, which seeks for the maximum likelihood estimators when a prior distribution from the model forecast is incorporated as a regularizer [14]. As the goal is to find arg minu J (u), the approach is usually named the variational method. The optimization in (5.19) is called the 3DVAR [19, 167] since it aims to find the optimal solution at one time instant tm+1 . Thus, the searching domain of the optimization is only the space (‘3D’ stands for the general physical space, namely the Cartesian coordinates (x, y, z)). Suppose the right-hand side is replaced by a summation over a time interval that optimizes over multiple time instants simultaneously. In that case, the optimization problem is called the 4DVAR [74, 168, 264] (‘4D’ includes one additional dimension that is time t). The expression in (5.19) is a general formula for many data assimilation problems, including the ensemble Kalman filter to be discussed shortly. In the linear Gaussian case, the exact solution of (5.19) can be easily reached. Yet, for nonlinear and non-Gaussian problems, no simple closed analytic formula is available for solving (5.19). In addition, in the presence of non-Gaussian distributions, (5.19) only provides a suboptimal solution as the Gaussian approximation is utilized in the regularizer. The perspective in (5.19) can further advance many physical applications. In practice, certain known physical laws, such as the energy conservation or positivity of certain state variables, can be incorporated into the data assimilation framework (5.19) such that it becomes a constrained optimization. Note that these additional constraints do not aim at reducing the absolute error in the estimated state from data assimilation. In fact, with the additional constraints, the searching space shrinks; therefore, the absolute error should be increased

5.3

Nonlinear Filters

75

compared with the unconstrained optimization. Nevertheless, the data assimilation solution will retain critical physical properties and facilitate subsequent forecasts. See [108, 137, 274] for the detailed formulation and applications of such constrained data assimilation approaches.

5.3

Nonlinear Filters

Kalman filter is the optimal filter in the linear Gaussian situation with the perfect model as the forecast model. However, most systems in practice are nonlinear. Therefore, developing nonlinear filters and understanding their strengths and weaknesses is crucial for dealing with real-world problems. This section will briefly introduce three commonly used filtering strategies for nonlinear systems. The presentation here focuses on the following discrete system: um+1 = Am (um ) + σ m , (5.20) vm+1 = Gm+1 (um+1 ) + σ om+1 , where Am and Gm+1 are both nonlinear operators. In (5.20), the signal um+1 ∈ R N and the observation vm+1 ∈ R M while σ m is an N -dimensional Gaussian white noise with zero mean and covariance R and σ om+1 is an M-dimensional Gaussian white noise with zero mean and covariance Ro . For notation simplicity, all the variables and operators are assumed to be real-valued in this section.

5.3.1

Extended Kalman Filter

The idea of the extended Kalman filter is to linearize the nonlinear operators Am and Gm+1 at each observational time instant, forming a local linear system and then applying the Kalman filter from tm to tm+1 [139, 214]. The linearized operators read: Am ≈ A0,m + A1,m (u − u¯ m|m ), Gm+1 ≈ G0,m+1 + G1,m+1 (u − u¯ m+1|m ),

(5.21)

where A0,m = Am (u¯ m|m ) and G0,m+1 = Gm+1 (u¯ m+1|m ) are the two constants from the zeroth order Taylor expansion while A1,m = ∇Am (u)|u=u¯ m|m and G1,m = ∇Gm+1 (u)|u=u¯ m+1|m are the coefficients of the linear terms resulting from the first-order Taylor expansion. The linearizations of the model and the observation are taken at the posterior mean state of the previous step and the prior mean state of the current step, respectively. With these linearizations, the extended Kalman filter solves the following linear filtering problems from tm to tm+1 :

76

5 Data Assimilation

um+1 = A0,m + A1,m (um − u¯ m+1|m ) + σ m , vm+1 = G0,m+1 + G1,m+1 (um+1 − u¯ m+1|m ) + σ om+1 . Then the Kalman update formulae in Sect. 5.2.3 can naturally be applied. The mathematical framework of the extended Kalman filter is straightforward, which allows the method to be successful in certain applications. However, there are several major issues in applying the extended Kalman filter to more general nonlinear filtering problems [132, 183]. First, the gradients of Am and Gm need to be taken at each assimilation cycle, which introduces a large computational burden. Second, recall that the prior covariance will asymptotically converge to a constant matrix in the Kalman filter. This important feature allows an off-line calculation of the prior covariance, the posterior covariance, and the Kalman gain, significantly saving the computational cost. However, since the forward operator A1,m = ∇Am (u)|u=u¯ m|m in the nonlinear system varies in time, computing the covariance matrices becomes a severe challenge for systems with large dimensions. In fact, for many real applications, such as the numerical weather forecast, the dimension of the system is vast, and therefore, numerically solving the entire covariance matrix Rm+1|m is almost impossible; let alone updating it recursively in time. Thus, many practical applications involve using a constant covariance matrix to replace Rm+1|m . Such a constant matrix is called the background covariance matrix, which is determined in advance in a certain empirical way, for example, using the equilibrium covariance of the forecast model. This will introduce additional errors but helps accelerate the computations. In addition to the computational issue, the linearization may also cause instability in the system and lead to significant errors in filtering strongly nonlinear systems. For example, suppose the real part of one or a few eigenvalues of A1,m is positive, and the observational time step is relatively long. In that case, the forecast solution can quickly become very large and may even blow up within a short period.

5.3.2

Ensemble Kalman Filter

Different from the extended Kalman filter that requires linearization of the forecast model, the ensemble Kalman filter (EnKF) [42, 89, 129] exploits the original nonlinear system and aims to approximate the forecast and analysis states by a finite number of samples, which is k , k = 1, . . . , K an ensemble of the forecast state of u called an ensemble. Denote by um+1|m at tm+1 obtained by running the Monte Carlo simulation of the underlying nonlinear system. K k um+1|m , where for the Then the forecast mean can be approximated by u¯ m+1|m = K −1 k=1 simplicity of notation the approximation ‘≈’ has been replaced by the equality ‘=’. Similar K k um+1|m+1 . On manipulation is applied to the posterior mean with u¯ m+1|m+1 = K −1 k=1 the other hand, the prior and the posterior covariance matrices Rm+1|m and Rm+1|m+1 are given by

5.3

Nonlinear Filters

Rm+1|m =

77

1 1 T T and Rm+1|m+1 = , Um+1|m Um+1|m Um+1|m+1 Um+1|m+1 K −1 K −1

respectively, where 1 2 K Um+1|m = [um+1|m − u¯ m+1|m , um+1|m − u¯ m+1|m , . . . , um+1|m − u¯ m+1|m ]T , 1 2 K − u¯ m+1|m+1 , um+1|m+1 − u¯ m+1|m+1 , . . . , um+1|m+1 − u¯ m+1|m+1 ]T . Um+1|m+1 = [um+1|m+1

With the approximation by samples, the Kalman gain in (5.18) can be rewritten as Km+1 = Rm+1|m GT (GRm+1|m GT + Ro )−1

−1 = (K − 1)−1 U(GUm+1|m )T (K − 1)−1 (GUm+1|m )(GUm+1|m )T + Ro .

(5.22)

In the above equation, the observation operator G is assumed to be linear. In the situation with a nonlinear observation operator, the GUm+1|m term can be replaced by the following approximation: (5.23) V = [G(u1 ) − v¯ , G(u2 ) − v¯ , . . . , G(u K ) − v¯ ], ¯ or v¯ = G(u) can be adopted. Similar to the forecast step, the above manipwhere v¯ = G(u) ulation of the observations does not involve taking the gradient of the observation operator that is required in the extended Kalman filter. With these in hand, the update of the posterior samples yields k k k k = um+1|m + Km+1 (vm+1 − G(um+1|m )), (5.24) um+1|m+1 k = vm+1 + ε km by a Gaussian random noise ε where the observation vm is perturbed vm+1 o with zero mean and covariance R . Such a random perturbation ensures an asymptotically correct analysis error covariance estimate for large ensembles. Note that when the ensemble size is small, this stochastic perturbation (5.24) is an additional source of sampling errors and is responsible for suboptimal filtering. A simple remedy to avoiding this suboptimal filtering is to inflate the covariance matrix [11, 268] a priori. There are also many practical strategies for circumventing such stochastic perturbations, such as the ensemble transform Kalman filter (ETKF) [28, 134] and the ensemble adjustment Kalman filter (EAKF) [8] or in general the family of the ensemble square root filter (EnSRF) [246, 268]. On the other hand, localization is a technique widely incorporated into the ensemble-based data assimilation approaches for high-dimensional systems to mitigate the impact of sampling errors, as only a small number of samples is affordable in practice. It can effectively ameliorate the spurious long-range correlations between the background and the observations. Some practical localization approaches can be found in [9, 10, 43, 93, 130, 134]. The EnKF is one of the most widely used data assimilation approaches in practice due to its simplicity and robustness. With covariance inflation and localization, the EnKF has been applied to many real-world problems with large dimensions. When the statistics of the system are not strongly non-Gaussian, EnKF often leads to reasonably accurate results with a relatively low computational cost. For systems with highly non-Gaussian distributions,

78

5 Data Assimilation

such as the bimodal one, the particle filter (see Sect. 5.3.4) can outperform the EnKF if the system’s dimension is moderate.

5.3.3

A Numerical Example

As a numerical example, consider the noisy version of the L63 model (1.39), dx = σ (y − x) dt + σx dWx ,

dy = x(ρ − z) − y dt + σ y dW y ,

(5.25)

dz = (x y − βz) dt + σz dWz . Compared with the deterministic version (1.39), the noisy L63 model includes more turbulent and small-scale features and their interactions with the three large-scale variables while retaining the characteristics in the original L63. In (5.25), the classical parameters are adopted for part σ√= 10, ρ = 28 and β = 8/3 while additional noises with σx = √ the deterministic √ 2, σ y = 12 and σz = 12 are imposed. A perfect model experiment is carried out to filter the noisy L63 model. The ETKF [134] is utilized as the EnKF method in the test here. Panel (a) in Fig. 5.3 shows the true signal (blue), the observations (black dots), and the filtering results in two different scenarios. In the first case (magenta), all three variables are observed; in the second case (cyan), only x is observed. The observational time step in both cases is t obs = 0.5 and the observational noise is r o = 3. The ensemble size is K = 10. Only the posterior mean is shown here. As is expected, in the full observational situation, the filtered solution is quite accurate since it combines the useful information in the forecast model and the observations. On the other hand, when only x is observed, the error becomes much larger in filtering the unobserved variables y and z. Panel (b) displays the filter skill as a function of the ensemble size K , the observational noise r o and the observational time step t obs . Here the normalized RMSE is utilized as the skill score, which is the RMSE (2.19a) divided by the standard deviation of the true signal. Thus, when the normalized RMSE reaches 1, the error in the filter mean equals the standard deviation of the true signal, and the filter is regarded as unskillful when the normalized RMSE is bigger than 1 since it becomes even less accurate than using the equilibrium mean value of the system. First, it is seen that K = 10 is sufficient to provide a quite accurate result. Such a small number of samples is usually insufficient to reach an accurate forecast PDF using the Monte Carlo simulation. Nevertheless, the observation in filtering weakens the role of the Monte Carlo forecast, and the transform technique inside the ETKF significantly mitigates the sampling error [134]. In addition, only the mean and the covariance are explicitly involved in the entire EnKF procedure (for example, in computing the Kalman gain), which are more robust than the higher-order moments using a small number of samples. Next, as the observational noise r o or the observational time step t obs increases, the error in the posterior mean estimate becomes larger. One notable result is that,

5.3

Nonlinear Filters

20

79

(a) Time series (K = 10, ro = 3, and

tobs=0.5)

1

(b) RMSE

1

10

x

0

0.3

0.5

0.5

-10

0.4

0.2 0

20

25

30

35

1

20

y

2 5 20 100 500

40

1

0

0.6 0.5

0

0.4

0.5 -20 20 50

30

35

2 5 20 100 500

40 1.4 1.2 1 0.8 0.6 0.4

z 0 20

0.2

0 25

25

30

t

35 Truth Obs

40 ETKF (full obs) ETKF (partial obs)

0 1 2 3 4 5

5 10 15 20

2 5 20 100 500

K

0

5 10 15 20

0 1 2 3 4 5

1

1

0.5

0.5 0

0 0

5 10 15 20

r

o

0 1 2 3 4 5

tobs

Fig. 5.3 Filtering √ the noisy√L63 model (5.25) √ in a perfect model scenario, where σ = 10, ρ = 28, β = 8/3, σx = 2, σ y = 12 and σz = 12. Panel (a): filtering skill with K = 10, r o = 3 and t obs = 0.5. The blue curve shows the true signal, and the black dots show the observations. The magenta and the cyan curves show the filtering results by observing all three variables and only x, respectively. Panel (b): the normalized RMSE as a function of K , r o , and t, respectively, where the other two are fixed as the values used in Panel (a)

among the unobserved state variables, the filtering skill of y is higher than that of z when x is the only observed variable. This is related to practical observability. The variable y explicitly appears in the governing equation of the observed variable x; therefore, the observations directly affect the filtering skill of y. In contrast, the observational information has to pass through y to recover z, which results in a weaker role of the observed x in filtering z.

5.3.4

Particle Filter

Particle filter [80, 165] is another nonlinear filtering technique widely used for problems involving strongly non-Gaussian features, such as bimodal distributions. The main idea of the particle filter is to directly apply the Bayes theorem (5.1) to update an ensemble of solutions (or particles) without assuming any Gaussianity on the prior or posterior distributions. Therefore, no explicit formulation as the Kalman filter equations is utilized. 1 , u2 , . . . , u K at time t with the distribuConsider an ensemble posterior state um|m m m|m m|m tion function, K k k wm δ(u − um|m ) (5.26) pm|m (u) = k=1

where the delta function δ(x − x0 ) = 1 if x = x0 and zero otherwise. In (5.26), the term k determines the weight (or importance) of each ensemble member uk . These weights wm m|m  k k ≥ 0 that ensures p satisfy k wm = 1 and wm m|m to be a PDF.

80

5 Data Assimilation

To illustrate the procedure of the particle filter at each assimilation cycle, consider the k } initial distribution pm|m being represented through the particles {um|m k=1,...,K , which are equally weighted. Then propagate these particles forward in time by solving the underlying k }k=1,...,K that leads to the prior distribution nonlinear model to reach an ensemble {um+1|m pm+1|m (u) =

K 1 k δ(u − um+1|m ). K

(5.27)

k=1

The posterior distribution is obtained by applying the Bayesian formula (5.1), pm+1|m+1 (u) = p(u|vm+1 ) =  =

K

p(vm+1 |u) pm+1|m (u) p(vm+1 |u) pm+1|m (u) du (5.28)

k wm+1 δ(u

k − um+1|m ),

k=1

where the new weight is given by k ) p(vm+1 |um+1|m k wm+1 = K . k k=1 p(vm+1 |um+1|m )

(5.29)

Comparing with the expressions of the prior and the posterior distribution (5.27)–(5.29), it is clear that the Bayes formula re-weights the particles. One of the main issues in practice is that many particles tend to have low weights in high-dimensional systems or systems with a moderate dimensional chaotic attractor. If this two-step process is repeated for several assimilation cycles, then typically, only one ensemble member will remain to have a significant weight, and the remaining become negligible [21, 27, 237, 259]. One way to avoid this ensemble collapse is to resample the particles by duplicating those particles with large weights to replace those with small weights [69, 70, 83, 218].

5.4

Continuous-In-Time Version: Kalman-Bucy Filter

The Kalman-Bucy filter is a continuous time version of the Kalman filter [90, 143] and it deals with the following linear coupled systems, dv = [G1 (t)u + G0 (t)] dt +  v (t) dWv (t),

(5.30a)

du = [A(t)u + F(t)] dt +  u (t) dWu (t).

(5.30b)

In (5.30), all the vectors and matrices G0 , G1 , F, A,  u and  v are functions of only t and they have no dependence on u or v in order to guarantee the linearity in the coupled system, and Wv and Wu are independent Wiener processes. In the Kalman-Bucy filter, (5.30b) is the

5.4

Continuous-In-Time Version: Kalman-Bucy Filter

81

underlying dynamics, and (5.30a) is the observation process. The observation is a continuous time series of v(s) with s ≤ t. Similar to the Kalman filter, the Kalman-Bucy filter aims at finding the optimal state estimation via solving the conditional distribution p(u(t)|v(s ≤ t)), which is also known as the posterior distribution. Due to the linear nature and the Gaussian noises in (5.30), the posterior distribution p(u(t)|v(s ≤ t)) is Gaussian, ¯ R), p(u(t)|v(s ≤ t)) ∼ N (u,

(5.31)

where the posterior mean and the posterior covariance can be solved via the following explicit formulae: ¯ ¯ dt du(t) = (F(t) + A(t)u) 

dR(t) = A(t)R

¯ dt) , + (RG1∗ (t))( v  ∗v )−1 (t) ( dv − (G0 (t) + G1 (t)u) + RA (t) + ( u  ∗u )(t)  − (RG1∗ (t))( v  ∗v )−1 (t)(RG1∗ (t))∗ dt.

(5.32a)



(5.32b)

The posterior mean in (5.32a) has a similar feature as the Kalman filter in (5.11). The first term on the right-hand side of (5.32a) is the forecast mean obtained by running the forecast model (5.30b) while the second term is the correction in light of observations. The pre-factor (RG1∗ (t))( v  ∗v )−1 (t) is an analog to the Kalman gain, which stands for the ratio between the prior covariance and the observational noise level. On the other hand, the posterior covariance (5.32b) satisfies a Riccati equation. A formal derivation of (5.32) follows the two-step prediction-correction procedure by first adopting a time discretization with a short but finite time step t and then taking the limit t → 0. Rigorous derivations can be found in [22]. As an analog to the EnKF, the ensemble Kalman-Bucy filter (EnKBF) (and its improved versions) has been developed and shown to be a useful numerical tool for finding the approximate solution of continuous-in-time filtering problems with nonlinear dynamics [7, 23, 79, 244]. The Kalman-Bucy filter provides the first path for data assimilation, where the observation process is given by a dynamical equation. The Kalman-Bucy filter, or in particular its nonlinear extension (see Chap. 8), has many applications. One such example is the Lagrangian data assimilation of the ocean velocity field, where the observations are the trajectories of the Lagrangian drifters linked with the ocean velocity via Newton’s law. In addition, the Kalman-Bucy filter or its nonlinear version is also quite useful for parameter estimation, model identification, and the development of stochastic parameterizations exploiting continuous observations of a subset of the state variables (see Chap. 9).

82

5.5

5 Data Assimilation

Other Data Assimilation Methods and Applications

Filtering is only one of the many data assimilation methods. Another data assimilation framework that has wide applications in practice is called smoothing [84, 227]. The main difference between filtering and smoothing is that filtering only utilizes the observational information in the past while smoothing exploits both the history and future data in observations. It is thus expected that the state estimation from smoothing is, in general, more accurate than filtering because it involves additional observational information. Nevertheless, filtering outweighs smoothing in the sense that it can be used for the online initialization of the real-time forecast. In contrast, smoothing is more applicable for the post-processing of data in an offline fashion. Similar to the EnKF, the ensemble Kalman smoother (EnKS) is an analog in the content of nonlinear smoothing [31, 91]. Section 8.2 includes some comparisons between the results utilizing filtering and smoothing for a rich class of nonlinear and non-Gaussian systems, for which closed analytic formulae of both the filter and smoother are available. One application of smoothing is data interpolation. For example, satellite images are sometimes blurred by cloud covers. Therefore, a certain time series obtained from these satellite images may have missing values. Simple linear or cubic interpolations can be applied to fill in the missing data. However, the Bayesian framework (5.1) is often a more appropriate choice for bridging such gaps since the dynamical information is incorporated. See [77] for a real-world application of filling in the missing observations of sea ice floe trajectories based on the ensemble Kalman smoother. Another application of the smoothing technique appears in many parameter estimation or model identification problems, where only partial observations are available. In such a situation, recovering the unobserved time series and estimating the parameters must be carried out simultaneously; otherwise, the likelihood of the parameters cannot be easily computed. Here, smoothing is naturally applied to recover the unobserved time series and quantify the associated uncertainty. See Sect. 9.2 for more details. In addition, sampling from the posterior distribution of the smoother estimates can mitigate the observational noise and help recover the unobserved time series, which provides more accurate training data for machine learning forecasts.

6

Prediction

6.1

Ensemble Forecast

Predicting the future states of complex dynamical systems is a central topic in contemporary science with significant societal impacts. The general procedure of forecast is straightforward: starting from an initial condition, run the forecast model forward in time to obtain the forecast values. The two factors that determine the forecast results are (1) the initial condition and (2) the forecast model. They highlight the importance of data assimilation and appropriate modeling of complex systems, respectively.

6.1.1

Trajectory Forecast Versus Ensemble Forecast

Due to the chaotic or turbulent nature of many complex systems, the trajectory forecast based on a single realization of the model state quickly loses track of the truth (see also Sect. 1.5). Alternatively, the ensemble forecast, which adopts a probabilistic characterization of the model states utilizing a Monte Carlo type approach, is the predominant strategy in predicting complex turbulent systems in practice [161, 201, 252]. In the ensemble forecast, different ensemble members are sampled from an initial distribution and are subject to additional random forcing. The ensemble forecast aims at providing an indication of the PDF of possible future states by tracking the evolution of the group of ensemble members. In many practical applications, such as the numerical weather forecast, ensemble forecast is the primary approach for predicting the future state and uncertainty.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_6

83

84

6.1.2

6 Prediction

Lead Time and Ensemble Mean Forecast Time Series

Figure 6.1 includes a schematic illustration of the ensemble forecast. Starting from a given initial condition with a certain uncertainty, an ensemble of the model simulation is carried forward in time. The uncertainty typically grows until it arrives at the statistical equilibrium (or statistical attractor if the system has a periodic forcing). Panel (a) shows such a process. The time evolution of the PDF from the ensemble forecast is illustrated in the red shading area, and the equilibrium distribution is in the gray area. The actual distribution is often unbounded, but only the domain covered by the mean plus and minus two standard deviations are shown for illustration purposes. The x-axis is the lead time, which is the time between the initiation and completion of the forecast. Panel (b) shows the procedure of forecasting the state and uncertainty at a fixed lead time of τ = 0.7 (red) starting from different initial time instants (green). Such a procedure is utilized to assess the forecast skill. In particular, as is shown in Panel (c), the mean values of these forecast PDFs at the given lead time are often adopted to form a time series, known as the ensemble mean time series. The skill scores are computed based on the difference between the ensemble mean time series and the true discrete time series taking values at the same instants. The two widely used skill scores are the RMSE and the pattern correlation (see Sect. 2.4.1 for their definitions). The typical profile of these skill scores of the ensemble mean forecast time series as a function of lead time τ is shown in Panel (d), where the RMSE increases to the standard deviation of the time series, and the pattern correlation decreases to zero. Despite the simplicity of assessing the forecast skill based on the ensemble mean, the full PDF should also be considered for obtaining the uncertainty and other information about the forecast.

Fig. 6.1 Schematic illustration of ensemble forecast. Panel a time evolution of the predicted PDF starting from a specific initial condition (red), which eventually converges to the equilibrium distribution (gray). Panel b starting from different initial time instants (green), forecast the state and uncertainty at a fixed lead time of τ = 0.7 units (red) and compare with the truth (blue). The red circles represent the forecast PDF constructed by the ensembles. Panel c the ensemble mean forecast from Panel b, where the red dots in Panel c are the ensemble mean of the red circle in Panel b. Panel d the typical profile of the skill scores of the ensemble mean forecast as a function of lead time τ , where the RMSE increases to the standard deviation of the time series and the pattern correlation (Corr) decreases to zero

6.2

Model Error, Internal Predictability and Prediction Skill

6.2

85

Model Error, Internal Predictability and Prediction Skill

Skillful predictions depend on several factors. In addition to the correctness of the model adopted for the forecast, the useful information provided by the model beyond the prior knowledge in hand is also crucial for the practical forecast.

6.2.1

Important Factors for Useful Prediction

Since the perfect model is seldom known or too complicated to be utilized, model error is often inevitable in the development of forecast models in practice. Model error in the forecast systems may come from the lack of certain crucial physics, the parameterization of complicated processes, or the coarse-graining of the dynamics and statistics. Hereafter, the forecast model with model errors utilized for the actual forecast tasks in practice is referred to as the imperfect model while the true system is called the perfect model. Since the PDF is adopted as the indicator in the ensemble forecast, the model error can be quantified by measuring the difference between the forecast PDF from the imperfect model and that from the perfect one. On the other hand, even if the perfect model is known, it is not always guaranteed to give practically useful predictions. An ensemble forecast is said to be informative if it provides additional information beyond the equilibrium statistics. Otherwise, there is no need to run the model as the known equilibrium PDF, which is often easily constructed from historical data, can simply be adopted as the forecast. Therefore, data assimilation plays a vital role in facilitating useful predictions. The initial uncertainty is expected to be as small as possible, allowing the model forecast to stay away from the equilibrium PDF for a long time.

6.2.2

Quantifying the Predictability and Model Error via Information Criteria

Given the importance of both the model development and the initialization of the forecast, it is crucial to building systematic measurements for quantifying the ability of the model forecast [106, 177]. Denote by pt (u|u 0 ) the forecast PDF of the state variable u at time t starting from a specific initial condition u 0 := u(0) using the perfect model. Denote by peq the equilibrium distribution of the model forecast, which is given by peq = limt→∞ pt (u|u 0 ). Note that the initial condition does not affect the equilibrium distribution as a chaotic or M turbulent system loses its memory in a finite time. Similarly, denote by ptM (u|u 0 ) and peq the forecast PDF at time t and at the statistical equilibrium using an imperfect model. Definition 6.1 (Internal predictability) The internal predictability measures the information provided by the forecast state beyond the prior knowledge available through the equilibrium

86

6 Prediction

statistics. It can be quantified by the relative entropy. For the perfect model and the imperfect model, the internal predictability is defined as

Dt = P ( pt , peq )

and

M DtM = P ( ptM , peq ),

(6.1)

respectively, where P (·, ·) is the relative entropy. The internal predictability is expected to be significant and last for a long time if the uncertainty in the initial value is much smaller than that of the equilibrium distribution. As time evolves, the forecast PDF converges to the equilibrium distribution. Therefore, the internal predictability curve eventually goes to zero, according to the property of the relative entropy in Sect. 2.2. Note that internal predictability only tells the possible time range of useful predictions. It does not indicates the accuracy of the prediction using an imperfect model, as the internal predictability of the imperfect model DtM does not involve any information for the perfect model. Definition 6.2 (Model error) The model error measures the lack of information in the imperfect model density compared to a perfect statistical forecast,

Et = P ( pt , ptM ).

(6.2)

The model error connects the forecast distributions related to the two models. However, it is worthwhile to highlight that neither the model error nor the internal predictability reflects the path-wise prediction skill since they do not involve the true trajectory. The traditional measurements for assessing the overall prediction skill in the ensemble mean time series are the RMSE and the pattern correlation, as described in Fig. 6.1. Note that the true signal is never known in practical situations for the real-time forecast. Thus, the path-wise prediction skill of a model is often studied in the training data set. In contrast, internal predictability and model error are proper measurements to provide statistical information about the real-time forecast. Overall, an appropriate forecast model satisfies two conditions: 1. its internal predictability has a similar time evolution as that of the perfect system, and 2. the model error remains at a low level at all lead times. To provide a better understanding of these concepts, let us consider the linear Gaussian model (4.1) as the perfect model, dxt = (−axt + f ) dt + σ dWt ,

(6.3)

where a, f and σ are all constant. For the imperfect model, consider another linear Gaussian model, dxt = (−a M xt + f M ) dt + σ M dWt , (6.4)

6.2

Model Error, Internal Predictability and Prediction Skill

87

Fig. 6.2 Predictability and model error. Top row the time evolution of the forecast PDF using the perfect model (gray shading) and that using the imperfect model (red shading), respectively, starting from a Gaussian initial distribution N (3, 0.1). The true signal realization from the perfect model is also shown for completeness, which does not affect the ensemble forecast PDF. Bottom the internal predictability of the perfect model (gray) and the imperfect model (red) as a function of the lead time τ [left y-axis] as well as the model error (black) [right y-axis]. The three columns show the imperfect models with different sets of imperfect parameters. In the bottom panel of the third column, the red and gray curves are overlapped with each other

but with different parameters. Figure 6.2 shows three different cases. Each panel in the top row includes the time evolution of the forecast PDF using the perfect model (gray shading) and that using the imperfect model (red shading), respectively, starting from a Gaussian initial distribution N (3, 0.1). The true realization from the perfect model is also shown for completeness, which, however, does not affect the ensemble forecast. Each panel in the bottom column shows the internal predictability of the perfect model (gray) and the imperfect model (red) as a function of the lead time τ [left y-axis] as well as the model error (black) [right y-axis]. In Case (a), f M = −0.5 is chosen to be different from f = 0, which makes the initial value farther from the equilibrium in the imperfect model. Thus, the internal predictability associated with the imperfect model starts with a more significant value, indicating the additional information provided by the initial condition beyond the statistical equilibrium. Then the internal predictability decreases for both the perfect and imperfect models as the forecast lead time increases. Meanwhile, since the two forecast models start with the same condition, there is no initial model error. Yet, the role of the model dynamics dominates the initial value as time evolves, which makes the model error monotonically increase in such a case. Next, Case (b) shows a scenario where the imperfect

88

6 Prediction

model has the same equilibrium statistics as the perfect model. However, the damping is stronger in the imperfect model, resulting in a shorter memory of the system and a faster relaxation towards the equilibrium. Thus, the curve of the internal predictability converges faster toward zero using the imperfect model than the perfect model. The most significant model error occurs in a short-term transient phase, which is the typical feature in many practical situations. The model error then decays to zero as the imperfect model captures the true equilibrium distribution. Finally, in Case (c), it is shown that, although the internal predictability of the imperfect model is the same as the perfect one, the imperfect model is completely biased in predicting the truth. This highlights that internal predictability only describes the information captured by the model beyond the equilibrium distribution but cannot assess the accuracy of the model. Therefore, the two criteria need to be applied simultaneously to quantify the prediction skill.

6.3

Procedure of Designing Suitable Forecast Models

The results in Fig. 6.2 reveal several vital aspects in facilitating the development of a suitable forecast model, which is expected to have similar behavior as the perfect model in terms of internal predictability and has a small model error over time. These requirements can be satisfied to a large extent by incorporating the following two necessary conditions in the model development. (a) Model fidelity: The forecast model must have the skill to reproduce the equilibrium distribution of nature (i.e., the perfect model). Such a condition guarantees the consistency of the model at long lead times with the truth. If the model lacks fidelity, the information distance between ptM and peq at a long lead time will never become zero. Consequently, the model error will remain for an infinitely long time. (b) Model memory: The forecast model must be able to capture the overall temporal autocorrelation of nature. Satisfying the model fidelity guarantees the long-term statistics being captured. However, without additional constraints, the model fidelity itself is not sufficient to ensure that the time evolution of the statistics and the associated relaxation tendency towards the equilibrium distribution of the forecast model are consistent with those of nature (see Case (b) in Fig. 6.2). In other words, the time evolution of the internal predictability computed from the forecast model can be biased due to the failure of the model to capture the transient behavior of nature. Since the ACF measures the overall memory of a chaotic system, the difference between the temporal ACFs can be utilized as a simple and effective practical criterion to characterize the similarity of the transient behavior between the two systems. A suitable forecast model should have an ACF that resembles the one of nature. Since the perfect model is never known in practice, the observational data can be utilized to compute the equilibrium PDF and the ACF of nature. The forecast model is then calibrated based on these statistical quantities. In practice, stochastic parameterizations and statistical

6.4

Predicting Model Response via Fluctuation-Dissipation Theorem

89

closure approximations can be incorporated into the existing models to improve the statistical accuracy [24, 71, 114, 208]. A certain optimization algorithm can calibrate the additional components by minimizing the error in the PDFs and ACFs being the cost function. This can be easily implemented for at least simple or conceptual models [61, 225].

6.4

Predicting Model Response via Fluctuation-Dissipation Theorem

In addition to the short- and long-range forecast of the model states by running the given model forward, another important task in practice is to forecast the model response when certain parameters in the system are perturbed. Assessing the model response has significant implications since it facilitates understanding the adjustment of the model states to the variation of the external environment. Similar to the ensemble forecast, predicting the statistical response is a more appropriate target for complex turbulent systems. For example, computing the variance of the temperature response to climate change helps infer the increased occurrence of extreme heat wave events, which have significant societal impacts. Admittedly, ensemble forecast can be implemented by plugging the perturbed parameters into the system to assess the response to a specific perturbation. However, as many nonlinear dynamical systems in practice are pretty complicated and high dimensional, such a straightforward approach is far from being computationally efficient. In particular, the entire expensive ensemble simulation has to restart for a different perturbation. Note that it is often interesting and essential to find the most sensitive parameters for the change of the model status. The exhaustive searching method based on the ensemble forecast is not practical for complex systems with a large degree of freedom. To overcome such a computational challenge, the fluctuation-dissipation theorem (FDT) [160, 190] is introduced, which provides a much more efficient way to find a suitable approximate solution to the model response. The FDT exploits only the statistics of the present states. Therefore, it avoids running the Monte Carlo simulation with a large number of ensemble members as in the direct method.

6.4.1

The FDT Framework

Consider a general nonlinear dynamical system, du = F(u) dt + σ (u) dW,

(6.5)

where u ∈ R N is the state variable, σ ∈ R N ×K is the noise coefficient matrix and W ∈ R K is a Wiener process. The evolution of the PDF p(u) associated with u is driven by the Fokker-Planck equation (see Sect. 1.3.3),

90

6 Prediction

∂p 1 = −divu [F(u) p] + divu ∇u ( p) := L F P p, ∂t 2

(6.6)

where  = σ σ T and p|t=0 = p0 (u). Let peq (u) be the smooth equilibrium PDF that satisfies L F P peq = 0. Recall from Sects. (1.1.2)–(1.1.3) that the statistics of some function A(u) are  determined by A(u) = A(u) peq (u) du. Now consider a perturbed system by adding a small perturbation δF(u, t) to (6.5), du = F(u) dt + δF(u, t) dt + σ (u) dW,

(6.7)

Further assume an explicit time-separable structure for δF(u, t), which naturally occurs in many applications [103, 175, 215]. That is, each component of δF(u, t), denoted by δ Fi (u, t), can be written as δ Fi (u, t) = wi (u)δ f i (t), where w(u) = (w1 , . . . , w N )T and f(t) = ( f 1 (t), . . . , f N (t))T . Then the Fokker-Planck equation associated with the perturbed system (6.7) is given by ∂ pδ (6.8) = L F P p δ + δL ext p δ , ∂t   where δL ext p δ = Lext p δ · δf(t) with Lext p δ = −divu w(u) p δ . Similar to (6.6), the expected value of the nonlinear functional A(u) associated with the perturbed system (6.8)  is given by A(u)δ = A(u) p δ (u) du. The goal is to calculate the statistical response of such a nonlinear functional A(u), namely δA(u) = A(u)δ − A(u). To this end, take the difference between (6.6) and (6.8), ∂ δ p = L F P δ p + δL ext peq + δL ext δ p, ∂t

(6.9)

where δ p = p δ − peq is the change of the PDF. The initial conditions of both (6.6) and (6.8) in formulating (6.9) are assumed to be the equilibrium of the unperturbed system peq . Since δ is small, after ignoring the higher order term δL ext δ p, the formula (6.9) reduces to ∂ δ p = L F P δ p + δL ext peq , ∂t

with

δ p|t=0 = 0.

(6.10)

Note that L F P is a linear operator. Thus, with the semigroup notation, exp[t L F P ], for this solution operator, the solution of (6.10) can be written concisely as 

T

δp =

   exp (t − t  )L F P δL ext (t  ) peq dt  .

(6.11)

0

Combining (6.11) with (6.8) leads to the linear response formula:   t δA(u)(t) = A(u)δ p(u, t) du = R(t − t  ) · δf(t  ) dt  , RN

where the vector linear response operator is given by

0

(6.12)

6.4

Predicting Model Response via Fluctuation-Dissipation Theorem

 R(t) =

RN

  A(u) exp[t L F P ][Lext peq ] (u) du.

91

(6.13)

This general calculation is the first step in the FDT. However, for nonlinear systems with many degrees of freedom, direct use of the formula in (6.13) is completely impractical because the exponential exp[t L F P ] cannot be easily calculated. Nevertheless, FDT provides an efficient way to compute the response operator R(t). Proposition 6.1 (FDT) Solving the response operator R(t) in (6.13) can be reduced to computing the statistical correction vector [175], R(t) = A[u(t)]B[u(0)],

with

B(u) = −

divu (w peq ) . peq

(6.14)

Calculating the correlation functions in (6.14) is much cheaper and practical than directly computing the linear response operator (6.13). Importantly, R(t) in (6.14) is computed through a correlation function evaluated only at the unperturbed state. Note from the perturbation function in (6.7) that if w has no dependence on u, then δF(t) naturally represents the forcing perturbation. If w(u) is a linear function of u, then δF(u, t) represents the perturbation in dissipation. It is also clear that if the functional A(u) is given ¯ 2 is by A(u) = u, then the response computed is for the mean. Likewise, A(u) = (u − u) used for computing the response in the variance. As a final remark, despite imposing the condition of the small perturbation to make the evolution equation of the statistics in (6.10) to be valid, the FDT in (6.12) and (6.14) does not require any linearization of the underlying dynamics in (6.5). Therefore, the result reflects the nonlinear features in the underlying turbulent systems.

6.4.2

Approximate FDT Methods

One major remaining issue in applying the FDT (6.14) is that a simple closed form of the equilibrium distribution peq (u) may not be available in many applications. Thus, developing suitable approximate methods is essential to computing the FDT. For a rich class of nonlinear systems, the so-called conditional Gaussian nonlinear system (CGNS), peq (u) can be effectively expressed by a Gaussian mixture, which significantly facilitates the computation of the FDT. See Chap. 8 for more details. Note that, even if the true system does not belong to the CGNS family, a data-driven approach can always be applied to find an approximation system that fits into the CGNS framework and shares almost the same peq (u) as the truth. Thus, the Gaussian mixture can help compute the FDT in many cases. In addition to fully approximating the original non-Gaussian PDF, an even simpler but effective approximation is called the quasi-Gaussian (qG) FDT . The qG FDT exploits the

92

6 Prediction

approximate equilibrium measure 

G peq

1 ∗ −1 ¯ , ¯  (u − u) = C N exp − (u − u) 2

(6.15)

where the mean u¯ and covariance matrix  match those in the equilibrium peq and C N is a normalized constant. One then calculates B G (u) = −

G) divu (w peq G peq

(6.16)

and replaces B(u) by B G (u) in the qG FDT. The correlation in (6.14) with this approximation is calculated by integrating the original system in (6.5) over a long trajectory or an ensemble of trajectories covering the attractor for shorter times, assuming mixing and ergodicity for (6.14). For the special case of changes in external forcing w(u)i = ei , i ≤ i ≤ N , where ei is the unit vector with the ith entry being 1 and the others being 0, the response operator for the qG FDT is given by the matrix ¯ R G (t) = A(u(t)) −1 (u − u)(0).

(6.17)

Other FDT techniques that have skillful performance in dealing with complex nonlinear dynamical systems include blended response algorithms [2] and kicked FDT [35]. FDT has been demonstrated to have high skill for the mean and variance response in the upper troposphere for changes in tropical heating in a prototype atmospheric general circulation model (GCM) and can be utilized for complex forcing, and inverse modeling issues of interest in climate change science [118]. Note that GCMs usually have a vast number of state variables, and applying FDT to the entire phase space is challenging due to the limitations in calculating the covariance matrix. Practical strategies involve computing the response operator on a reduced subspace [180].

6.4.3

A Nonlinear Example

In the following, a two-dimensional nonlinear system is utilized to illustrate the skill of the FDT in computing the response. The study also aims at showing the deficiency of linear reduced models in capturing the response beyond the first-order statistics. The model adopted here is named the stochastic parameterized extended Kalman filter (SPEKF) model [102]. It reads: du = (−γ u + f u ) dt + σu dWu , dγ = −dγ (γ − γˆ ) dt + σγ dWγ ,

(6.18)

6.4

Predicting Model Response via Fluctuation-Dissipation Theorem

93

which belongs to the model family (4.40) and has wide applications in data assimilation and prediction for turbulent signals. The detailed discussions of this model can be found in Sect. 7.5. In (6.18), γ is driven by a linear Gaussian process, and it appears as stochastic damping in the equation of u when γ changes its sign as time evolves. Intermittency and nonGaussian statistics of u can thus be generated. Despite the nonlinearity, the time evolution of the moments can be written down using closed analytic formulae [102]. Such a feature allows us to have an exact reference solution in subsequent studies. The parameters adopted here are: dγ = 1.3, σγ = 1, γˆ = 1, f u = 1. (6.19) σu = 0.8, Panels (a)–(d) of Fig. 6.3 show the time series and the PDFs of u and γ , respectively. It is seen that γ alternates between positive and negative values. Once γ becomes negative, such an anti-damping triggers large bursts in the time series of u. With the constant forcing f u = 1 in (6.19), the time evolutions of the mean u and the variance Var(u) of u are shown in Panels (e)–(f). For the simplicity of discussion below, the initial time here is set to be t0 = −12.5, at which u(t0 ) = 2. It is clear that after t reaches around t = −6, the model (6.18) arrives at the statistical equilibrium. Starting from t = 0, a forcing perturbation δ f u (t) is added to the model in (6.18), which is a ramp-type perturbation with the following form δ f u (t) = A0

tanh(a(t − tc )) + tanh(atc ) , 1 + tanh(atc )

(6.20)

with A0 = 0.1,

a = 1,

tc = 2.

(6.21)

The profile of δ f u (t) is shown in Panel (g) of Fig. 6.3. The forcing perturbation δ f u (t) starts from 0 at time t = 0 and it reaches 0.1 at roughly t = 5. After t = 5, δ f u (t) stays at δ f u (t) = 0.1. Due to this forcing perturbation, u and Var(u) also have corresponding changes, which are shown in the blue curves in Panels (e) and (f). Note that these responses are computed using the analytical formulae of the time evolutions of the statistics [102], which are exact. They are known as idealized responses. In most realistic scenarios, the true dynamics are unknown or are too expensive to run. Therefore, simplified or reduced models are widely used in computing the responses. One type of the simple model that is widely adopted is the linear model, du M = (−duM u M + f uM ) dt + σuM dW .

(6.22)

Note that adopting such a linear model to compute the responses shares the same philosophy as one of the ad hoc FDT procedures [205], where linear regression models are used for the variables of interest. The three parameters in (6.22) are calibrated by matching the equilibrium mean, equilibrium variance, and decorrelation time with those of u in the perfect model (6.18) (see Sect. 7.2 and consider the special case with the real variables there for details). With such calibrations, the linear model (6.22) automatically fits the unperturbed

94

6 Prediction (a) True signal of u

10

(e) Time evolution of u

1.5

Truth (via analytic solution) qG FDT

1.4

Response from MSM MC of the true system

5 1.3 0 500

1.2 1000

1500

-10

(b) True signal of

4

-5

0

1.2

2

1.1

0

Response to forcing perturbation

New equilibrium

0.9 1000

1500

t

(c) p(u)

5

u

10

5

10

1.1 Forcing perturbation f + fu

Original forcing fu

1

0

0

1.05

0.2

0.2

-5

(g) Forcing fu and f + fu

0.4

0.4

-10

(d) p( ) 0.6

0.6

0

10

Relaxation towards Equilibrium of equilibrium unperturbed model

1

-2 500

5

(f) Time evolution of Var(u)

0 0

2

4

-10

-5

0

5

10

t

Fig. 6.3 Comparison of the model response. Panels a–b a realization of the SPEKF model (6.18) with the parameters in (6.19). Panels c–d the associated PDFs. Panels e–f time evolutions of the mean and variance. The blue curve is the true response computed from solving the exact moment equation of (6.18), the red curve is the qG FDT, and the green curve is the response from the reduced order linear model (6.22). The black curve is the response computed by applying the direct Monte Carlo simulation to (6.19) with 100,000 samples, which differs from the blue curve by just introducing the random sampling error. Panel g the forcing and its perturbation. For the convenience of the statement, the time t = 0 is the starting time of imposing the perturbation, while t = −12.5 is the initial time for the simulation

mean and variance at t = 0. Now the same forcing perturbation is added to the linear model. Since the statistics in the linear model is Gaussian, the formulae in (6.15)–(6.17) become rigorous. The green curves in Fig. 6.3 show the responses using such a linear model. It is clear that the response in the mean using the linear model captures the trend of the truth, but the amplitude is severely overestimated. On the other hand, the response in the variance using the linear model is identically zero since the forcing is not involved in the governing equation of the variance (see (4.6) in Sect. 4.1.1). In contrast, the qG FDT based on the perfect model (red curves) captures the response in the mean. The qG FDT also results in a response in the variance, and the recovery of the time evolution of the variance response is reasonably accurate. It is worthwhile to re-emphasize that, although Gaussian approximation (6.15) is adopted in computing the equilibrium PDF of the unperturbed system, the underlying nonlinear dynamics are utilized in formulating the FDT. Thus, the nonlinear interaction is included in the FDT, and the variance response is captured overall. Finally, the black curves in Panels (e)–(f) show the responses of these moments computed by running a Monte Carlo simulation of the perfect system with 100,000 samples. Such a numerical simulation provides a reasonably accurate approximation to the truth (blue) in terms of the mean response. However, even with such a large ensemble size, the Monte Carlo

6.5

Finding the Most Sensitive Change Directions via Information Theory

95

simulation still contains a non-negligible error in predicting the variance response. This is due to the insufficient sampling of the rare events, to which the resulting numerical approximation of the statistics is very sensitive. Note that the model here is only two-dimensional. In practice, the dimension of the system is often much higher, and the undersampling issue is inevitable in computing the response in the variance or even higher order moments by utilizing the direct Monte Carlo simulation. As a comparison, the qG FDT does not suffer from such an issue and is computationally more efficient.

6.5

Finding the Most Sensitive Change Directions via Information Theory

An important topic in studying complex dynamical systems is finding the perturbation direction that leads to the most sensitive change in the system. Again, due to the chaotic or turbulent nature, the change of the state is in the statistical sense.

6.5.1

The Mathematical Framework Using Fisher Information

To quantify the most sensitive directions, consider a family of parameters λ ∈ R p with πλ (u) being the PDF of the state variable u as a function of λ. Here λ = 0 corresponds to the unperturbed model π(u). Note that λ can consist of external parameters such as changes in forcing or parameters of internal variability such as a change in dissipation. In light of the information theory, the most sensitive perturbed direction is the one with the largest uncertainty related to the unperturbed one,

P (πλ∗ , π ) = maxp P (πλ , π ), λ∈R

(6.23)

where P is the relative entropy. Assume that πλ is differentiable with respect to the parameter λ. Since πλ |λ=0 = π , for small values of λ, (6.23) yields

P (πλ , π ) = λ · I (π )λ + O(|λ|3 ),

(6.24)

where λ · I (π )λ is the quadratic form in λ given by the Fisher information [76, 271]  λ · I (π )λ =

(λ · ∇λ π )2 du, π

(6.25)

and the elements of the matrix of this quadratic form are given by  Ik j (π ) =

∂π ∂π ∂λk ∂λ j

π

du.

(6.26)

96

6 Prediction

Detailed derivations of (6.24)–(6.26) can be found in [177]. Note that the gradients are calculated at the unperturbed state λ = 0. Therefore, if both the unperturbed state π and the gradients λ · ∇λ π are known, then the most sensitive perturbation direction occurs along the unit direction eπ∗ ∈ R p which is associated with the largest eigenvalue λ∗π of the quadratic form in (6.25).

6.5.2

Examples

Consider the following one-dimensional linear model, du = (−au + f ) dt + σ dW ,

(6.27)

u) ¯ 2 the equilibrium PDF of which is a Gaussian distribution π(u) = NC exp − (u− with 2C mean and variance being u¯ = f /a and C = σ 2 /(2a), respectively. The two parameters λ = ( f , a)T ∈ R2 for external forcing and dissipation are the natural parameters which are varied in this model. Therefore, the corresponding I (λ) in (6.25) is a 2 × 2 matrix with entries Ii j , i, j = 1, 2. It is straightforward to compute the first-order derivatives of π concerning f and a, u − u¯ ∂π = π, ∂f aC

and

¯ 2 ∂π σ2 f (u − u) ¯ σ 2 (u − u) π, = 2 π− π − ∂a 4a C a2C 4a 2 C 2

(6.28)

with which the four elements of I have the following explicit expressions, I11 =

1 , Ca 2

I12 = I21 = −

f , a3C

I22 = −

f2 σ4 + . Ca 4 8C 2 a 4

(6.29)

Consider the following two groups of parameters are used, (a) :

a = 1,

f = 1,

σ = 0.5,

(b) :

a = 1,

f = 1,

σ = 2.5.

(6.30)

Since I is a 2 × 2 matrix, there are only two eigenmodes. The eigenvector w associated with the larger eigenvalue corresponds to the most sensitive direction with respect to the perturbation (δ f , δa)T . Plugging the model parameters (6.30) into the matrix I in (6.29) leads to the most sensitive direction:



−0.6960 −0.4384 ∗ ∗ (a) : eπ = , (b) : eπ = . (6.31) 0.7181 0.8988 Note that for this simple problem, the solution in (6.31) can be obtained by directly computing the response from the perturbations. In fact, given a small perturbation (δ f , δa)T to ( f , a)T , the corresponding perturbed mean and variance are:

6.5

Finding the Most Sensitive Change Directions via Information Theory

u¯ δ =

f +δf , a + δa

Cδ =

σ2 . 2(a + δa)

97

(6.32)

Since both the unperturbed and perturbed PDFs are Gaussian, the explicit formula of the relative entropy in (2.15) can be utilized to compute the uncertainty P (π, π δ ) due to the perturbation. Recall in (2.15); the total uncertainty is decomposed into signal and dispersion parts, both of which can be computed explicitly: 1 2



2

−1 σ2 1 ( f a + f δa − f a − aδ f )2 2a = 2a 2 a 2 (a + δa)2 σ2       ( f δa − aδ f )2 (6.33) = + o δa 3 + o δa 2 δ f + o δaδ f 2 2 aσ





1 a + δa 1 δa 1 a + δa 1 δa + + Dispersion = − ln − 1 = − ln 1 + 2 a 2 a 2 a 2 a 3 δa 1 δa 2 +o , = 4 a a Signal =

f f +δf − a a + δa

where o(·) contains the higher order terms. Note that the dispersion part depends only on the perturbation in the dissipation δa since f has no effect on the variance. In addition, δa and δ f should have opposite signs to maximize the relative entropy in the signal part. Figure 6.4 shows the total relative entropy and the associated signal and dispersion comT ponents as a function of the perturbations in the two-dimensional parameter  space (δ f , δa) 2 2 using the direct formula (6.33). The numerical simulation here assumes δ f + δa ≤ 0.05 to guarantee the perturbation is small enough. In both cases, the most sensible direction with respect to only the dispersion part lies in the direction (δa, δ f )T = (1, 0)T , due to the fact that δ f has no effect on the dispersion part. In the signal part, the most sensitive direction satisfies aδ f = − f δa. The overall most sensitive direction depends naturally on the

Fig. 6.4 Uncertainty dependence of the perturbation in the two-dimensional parameter space (δa, δ f )T in the linear model (6.27). The total uncertainty is decomposed into signal and dispersion parts according to (2.15). The black dashed line shows the direction of the maximum uncertainty, namely the most sensitive direction of perturbation. Note that the same color bar is utilized for the total uncertainty, signal, and dispersion part

98

6 Prediction

weights of signal and dispersion parts. When σ becomes larger, the weight on the signal part reduces since the signal part is proportional to the inverse of the model variance. The most sensitive directions as indicated by the black dashed lines in Fig. 6.4 are consistent with the theoretical prediction from (6.31) using the Fisher information (6.23)–(6.26).

6.5.3

Practical Strategies Utilizing FDT

The simple examples shown above contain the perfect knowledge of the present model state given by the unperturbed equilibrium PDF π , which can be solved analytically. However, it is often quite difficult in practice to know the exact expression of these PDFs, or it is computationally unaffordable to calculate the gradient in high dimensions. Therefore, many approximations are combined with the information theoretical framework developed above. One practical strategy is to adopt approximated PDFs based on a few measurements, such as the mean and covariance. It is also common to use reduced models from a practical point of view. Note that, in most practical situations, closed analytic formulae of computing ∇λ in (6.25) are not available. Nevertheless, the FDT can be utilized to calculate the gradient of the unperturbed PDF, which is much more efficient and accurate than the direct Monte Carlo simulation. In the situation with approximated PDF such as a Gaussian, the qG FDT naturally fits into the calculation of the Fisher information. Specifically, the FDT can be utilized to compute the response of the mean and that of the covariance, based on which the change of the PDF can be easily solved.

7

Data-Driven Low-Order Stochastic Models

7.1

Motivations

Data-driven low-order stochastic models are the basis for characterizing many complex systems in reality. There are at least two scenarios, in which these models can get involved. First, advanced data analysis methods are often applied to high-dimensional massive observational data sets to obtain a certain spectral decomposition of the data. The resulting leading few spectral modes are usually the most energetic ones that represent the key salient features of the underlying complex natural phenomenon. In practice, the associated lowdimensional time series of these modes, known as the indices, are often regarded as simple and effective indicators to characterize the large-scale features of the phenomenon. Loworder stochastic models are thus natural tools to describe and predict these indices. The following sections will present concrete examples of modeling and predicting such indices. Second, even if the governing equations of the spatiotemporal evolution of many complex turbulent systems are known and are given by partial differential equations (PDEs), spectral methods with appropriate decompositions of these PDEs remain one of the main numerical schemes for finding the solutions. Suitable low-order stochastic models are useful in characterizing the time evolution of each spectral mode in a much more effective way than the equation obtained directly from the spectral decomposition that often involves many complicated nonlinear terms. For the simplicity of the statement, let us consider a scalar field u(x, t) with the Fourier basis functions exp(ikx), where k is the wavenumber, and x is the spatial coordinate. Denote by uˆ k (t) the time series of the spectral mode associated with wavenumber k. Many governing equations in fluids and geophysics involve quadratic nonlinearities, which come from advection and other nonlinear interactions between different scales. In such a case, the resulting equation of uˆ k (t) typically has the following form,

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_7

99

100

7 Data-Driven Low-Order Stochastic Models

  duˆ k (t) = − dk + iωk uˆ k (t) dt + fˆk (t) dt +



cm,k uˆ m uˆ k−m dt,

(7.1)

−K ≤m≤K

where K  1 represents the resolution of the model, i is the imaginary unit, dk and ωk are both real, while fˆk (t) is complex. The first term on the right-hand side of (7.1) is linear, representing the effect of damping/dissipation −dk and phase ωk . Note that −dk and ωk can both be functions of k. For example, −dk often has the form −dk = −ck − κk 2 , where −ck is the linear drag while −κk 2 is the diffusion resulting from the spectral decomposition of the diffusion term u in PDEs. For a system that is overall statistically stable, the damping/dissipation leads to −dk < 0, while dk can also be a function of time in triggering intermittencies (e.g., the model in (1.27)). The phase ωk can depend on k as well, which stands for a specific dispersion relationship. The second term on the right-hand side of (7.1) is a deterministic forcing obtained by applying the Fourier decomposition to the forcing of the original PDE. The last term is a summation of all the quadratic nonlinear interactions with a nonzero projection to the wavenumber k. Typically, a large number of such nonlinear terms will appear in the summation, which is the main computational cost in solving (7.1). In practice, a large portion of the system’s energy is often explained by only a small amount of the total Fourier modes. Yet, the entire set of the Fourier modes has to be solved together in the direct simulation to guarantee numerical stability. Therefore, if cheap stochastic models can be used to effectively approximate the nonlinear terms in (7.1) or even more complicated nonlinearities in other systems for the time evolution of the leading a few Fourier coefficients, then the total computational cost will be significantly reduced. Note that the reduction here is two-folded. First, in the governing equation of each leading Fourier mode, the heavy computational burden of calculating the summation of a large number of the nonlinear terms will be replaced by computing a single stochastic term. Second, only the leading a few Fourier modes need to be retained in such an approximate stochastic system, saving a large amount of computational storage. In addition, a larger numerical integration time step can often be utilized in such a situation as well since the stiffness of the system usually comes from the governing equations of the small-scale modes. It is worthwhile to highlight that a probabilistic characterization of the system (i.e., via the PDF or other statistics) is the main focus of many practical tasks, such as data assimilation and ensemble prediction of turbulent systems. Therefore, simple and effective stochasticity is an extremely useful surrogate for complicated nonlinearities to reproduce the forecast statistics that account for the uncertainty. As a final remark, simple low-order models also provide cheap and powerful ways for the validation of various hypotheses in developing more complicated models.

7.2

Complex Ornstein–Uhlenbeck (OU) Process

7.2

101

Complex Ornstein–Uhlenbeck (OU) Process

Complex Ornstein–Uhlenbeck (OU) process is one of the simplest low-order stochastic models that have wide applications. It is a one-dimensional complex stochastic process and reads [100]: du(t) = (−γ + iω)u(t) dt + f (t) dt + σ dW (t), (7.2) where the constants γ > 0, ω, and σ > 0 are the damping, phase, and stochastic noise, respectively. The deterministic forcing f (t) can either be a constant or a time-dependent function, and W (t) is the Wiener process.

7.2.1

Calibration of the OU Process

One advantage of the OU process is that, given an observed time series, the model parameters can be systematically determined in a simple fashion. Assume f (t) ≡ f being a constant. Then the four parameters γ , ω, f , and σ in (7.2) can be determined by matching the four statistics computed from the observational time series. These statistics are the mean m, the variance E, and the real T and imaginary θ parts of the decorrelation time. The latter two can be calculated as follows:  ∞ R(τ ) dτ = T − iθ, (7.3) 0

where R is the temporal ACF defined in (4.9). Then the four parameters are uniquely determined by these four statistical quantities via the following formulae [183]  T 2E T θ m(T − iθ ) γ = 2 , ω= 2 , f = , and σ = . (7.4) 2 2 2 2 T +θ T +θ T +θ T 2 + θ2 The mean and the variance of a time series are straightforward to compute. On the other hand, the calculation of the decorrelation time in (7.3) requires numerical integration of the ACF. However, due to the finite length of the observed time series, the resulting ACF curve computed numerically never actually decays to zero; instead, the curve contains many small wiggles at the tail. This often leads to certain biases in calculating the decorrelation time if the formula (7.3) is adopted directly. One simple but practically useful trick to compute the decorrelation time numerically is as follows. Note the fact that the cross-correlation function between the real and imaginary parts of the complex OU process—a linear Gaussian SDE— is an exponentially decaying function with the form: XC(t) = sin(ωt)e−γ t ,

(7.5)

which can be easily derived by plugging the complex OU process (7.2) into the definition of the ACF (4.9). Thus, instead of taking the numerical integration of R(t), a simple curve fitting method (for example, via least squares) by matching the right-hand side of (7.5) with

102

7 Data-Driven Low-Order Stochastic Models

the numerically computed cross-correlation function curve can be applied to determine the two parameters T and θ . This will largely suppress the contribution of the wiggles in the tail in numerically calculating the decorrelation time.

7.2.2

Compensating the Effect of Complicated Nonlinearity by Simple Stochastic Noise

The complex OU process of u in (7.2) is one of the simplest surrogate models of the nonlinear governing equation of uˆ k in (7.1) for carrying out statistical forecasts. Since many nonlinear terms in (7.1) involve high frequencies (e.g., uˆ m with |m|  |k|), the time series constructed by these terms is fully chaotic and has very short temporal correlations. This provides a justification to replace the complicated nonlinearity in (7.1) with effective stochastic noise (and additional damping) in (7.2). See examples in [122, 182]. Note that such an underlying mechanism is consistent with the general strategy of stochastic mode reduction [185, 186], in which the equations of motion for the unresolved fast modes are modified by representing the nonlinear self-interactions terms between unresolved modes by damping and stochastic terms.

7.2.3

Application: Reproducing the Statistics of the Two-Layer Quasi-Geostrophic Turbulence

The two-layer quasi-geostrophic (QG) model is a widely used set of PDE systems describing the turbulent fields in the atmosphere, and ocean [223, 257]. Depending on the applications, different versions of the two-layer QG models have slight differences. The one considered here that has been utilized to study the ocean turbulence underneath the sea ice floes has the following form [77]: ∂q1 ∂ q¯1 ∂ψ1 ∂q1 + u¯ 1 + + J (ψ1 , q1 ) = ssd, ∂t ∂x ∂ y ∂x ∂q2 ∂q2 ∂ q¯2 ∂ψ2 + u¯ 2 + + J (ψ2 , q2 ) = − R2 ∇ 2 ψ2 + ssd, ∂t ∂x ∂x ∂x

(7.6)

where q1 and q2 are the potential vorticities (PVs) in the upper and lower level of the ocean, respectively, and ψ1 and ψ2 are the associated stream functions. In (7.6), J (A, B) = ∂ A/∂ x∂ B/∂ y − ∂ A/∂ y∂ B/∂ x is the advection, “ssd” represents small-scale dissipation, u¯ 1 and u¯ 2 are the given background ocean velocities, q¯1 and q¯2 are the given background ocean velocities, and R2 is the known decay rate of the barotropic mode. Despite many state variables, they have relationships with each other. The stream functions and the velocity fields are related by (u 1 , v1 ) = (−∂ψ1 /∂ y, ∂ψ1 /∂ x) and (u 2 , v2 ) = (−∂ψ2 /∂ y, ∂ψ2 /∂ x), 2 −ψ1 ) and while the stream functions and the PVs are connected via q1 = ∇ 2 ψ1 + (ψ (1+δ)L 2 d

7.2

Complex Ornstein–Uhlenbeck (OU) Process

q 2 = ∇ 2 ψ2 +

δ(ψ1 −ψ2 ) , (1+δ)L 2d

103

where δ = H1 /H2 , Hi is the depth of each layer, and L d is the

deformation radius. In short, the two stream functions ψ1 and ψ2 can be regarded as the primary state variables, while the other variables can be recovered from the stream functions. Double periodic boundary conditions are utilized in the two-dimensional physical space. Thus, ψ1 := ψ1 (x, y, t) and ψ1 := ψ2 (x, y, t). Denote by k = (k1 , k2 ) the Fourier wavenumber of each spectral mode. The goal is to implement a statistical forecast of the ocean’s upper layer, which directly interacts with sea ice floes. A spatial resolution of at least 128 × 128 is needed to resolve the so-called mesoscale eddies. Although only ψ1 is required for the forecast solution of the upper layer ocean, the entire coupled system (7.6) consisting of the state variables in both layers has to be run simultaneously. This means the total number of the spectral modes in the simulation is more than 30, 000, which is quite computationally expensive. Panel (a) of Fig. 7.1 shows a snapshot of the model solution. Panel (c) shows the energy spectrum, which indicates that only a small portion of the energy exists in the modes |k| > 12. Panel (b) shows the truncated solution based on modes |k| ≤ 11, which has a very similar pattern

Fig. 7.1 Comparison of the simulations from the two-layer QG model (7.6) and the surrogate model based on the complex OU processes. Panel a a snapshot of the simulation from the two-layer QG model. Panel b a truncated solution containing the modes with |k| ≤ 11. Panel c the energy spectrum of the two-layer QG model as a function of |k| (each point in the curve summing up the energies for all the modes with the same |k|), where the subpanel inside displays the energy for each mode. The black dashed curve indicates the value |k| = 11. Panel d the snapshot of the surrogate model by multiplying the Fourier basis functions to the time series obtained from a set of OU processes corresponding to modes with |k| ≤ 11. Panel e mode (5, 5) from the two-layer QG model. Panel f an independent simulation of mode (5, 5) from the OU process. Note that there is no point-wise correspondence between the curves in Panels e–f as the random noise in the OU process does not recover the nonlinearity in the point-wise sense. So do the snapshots in Panels b and d. The goal is to carry out the ensemble simulation using the surrogate model. Thus, the skill of the OU processes in reproducing the forecast statistics of the two-layer QG model is a crucial feature to illustrate

104

7 Data-Driven Low-Order Stochastic Models

as the truth (Panel (a)). Note that despite most of the energy being concentrated on the leading modes, if a coarser resolution is adopted in solving such a PDE system, the solution will become wildly inaccurate and may suffer from the blowup issue. Due to the high computational cost, running an ensemble of the two-layer QG model for data assimilation and forecast is impractical. Therefore, a cheap surrogate model is preferred for the ensemble simulations of such a turbulent system. As the statistics of each Fourier mode from the two-layer QG model in the setup are nearly Gaussian, the complex OU process is a natural surrogate model. Comparing Panels (e) and (f), the OU process captures each Fourier mode’s overall dynamical and statistical features. Panel (d) of Fig. 7.1 shows a snapshot of the spatial patterns constructed by multiplying the Fourier basis functions by the time series obtained from a set of OU processes corresponding to modes with |k| ≤ 11. It is qualitatively quite similar to that of the two-layer QG model. Notably, only less than 200 modes are needed when using the surrogate model based on the OU processes. Such a number is much smaller than running the original two-layer QG system and significantly accelerates computational efficiency.

7.3

Combining Stochastic Models with Linear Analysis in PDEs to Model Spatial-Extended Systems

In many applications, reduced-order models are needed to model spatial-extended systems, which are often a set of (stochastic) PDEs with several state variables. While simple stochastic models, such as the complex OU process, can be utilized to characterize each state variable independently, it is more important to consider the dependence between different state variables in developing reduced-order stochastic models. To present the idea and the methodology, let us consider the linearized shallow water equation as a particular example. The domain here is two-dimensional with coordinates (x, y) and double periodic boundary conditions. The linearized shallow water equation in the non-dimensional form reads [188, 223, 257]: ∂u + ε−1 u⊥ = −ε−1 ∇η, ∂t

∂η + ε−1 ∇ · u = 0, ∂t

(7.7)

where u = (u, v) is the two-dimensional velocity field, and η is the geophysical height. In (7.7), ε is a non-dimensional number called the Rossby number, the ratio between Coriolis and the advection. A smaller Rossby number means a more rapid rotation of the flows or a faster oscillator in certain associated Fourier coefficients. By defining U = (u, v, η)T , the model in (7.7) can be written in a concise form, ∂U ∂U ∂U =A +B + CU, ∂t ∂x ∂y

(7.8)

where A, B, and C are all constant matrices. Now assume a plane wave ansatz of the solution,

7.3

Combining Stochastic Models with Linear Analysis in PDEs to Model …

U(x, t) = ei(k·x−ω(k)t) r,

105

(7.9)

where r is an eigenvector that connects the three components in U. Then plugging (7.9) into (7.8) leads to a 3 × 3 linear algebraic system. There are three solutions for the eigenvalue ω(k) for a fixed k, each corresponding to a specific eigenvector r. Among the three eigenmodes, one mode has the dispersion relationship ωk,B = 0. It is known as the geophysically balanced (GB) mode that corresponds to the geophysical balance relationship u⊥ := (−v, u)T = −∇η. The associated flow field is incompressible since ∇η = 0. The other two modes appear in a pair with the dispersion ∇ · u = ∇ × u⊥ = −∇ ×  relationship ωk,± = ±ε−1 |k|2 + 1, which are the gravity modes that are compressible. These dispersion relationships imply that the solution of each GB mode in (7.7) is simply a constant (as everything is balanced). In contrast, the solutions of the two gravity modes are linear oscillators. The dispersion relationships from the linearized shallow water equation can be utilized as the basis to formulate the stochastic model to mimic the chaotic solutions of the nonlinear shallow water equation. This can be achieved by adopting the exact dispersion relationship ωk = ωk,B or ωk = ωk,± and adding extra damping and stochastic noise mimicking the effects from nonlinearity and other unresolved features. Such a procedure leads to the following complex OU processes for the Fourier coefficients of each GB mode vˆk,B and gravity modes vˆk,± [64], dvˆk,B = [(−dk,B + iωk,B )vˆk,B + f k,B (t)] dt + σk,B dWk,B (t), dvˆk,± = [(−dk,± + iωk,± )vˆk,± + f k,± (t)] dt + σk,± dWk,± (t).

(7.10)

Then the deterministic solution with the plane wave ansatz in (7.9) is replaced by the following stochastic result,  U(x, t) = eik·x (vˆk,B (t)rk,B + vˆk,+ (t)rk,+ + vˆk,− (t)rk,− ), (7.11) k

where rk,B and rk,± are the eigenvectors of the GB and gravity modes, respectively. If the statistics of a certain Fourier mode are highly non-Gaussian, then those low-order models to be discussed in the following few subsections can be utilized to replace the complex OU process. Note that the GB and gravity modes are the characteristics of the underlying systems but not the actual physical variables. The three physical variables u, v, and η are the linear combinations of the three characteristic variables: the GB mode and the two gravity modes.

106

7.4

7 Data-Driven Low-Order Stochastic Models

Linear Stochastic Model with Multiplicative Noise

The OU process is a suitable data-driven surrogate model if the long-term statistics of the time series are nearly Gaussian and the temporal correlation decays at an exponential rate. However, many time series exhibit strong non-Gaussian characteristics, which are hard to be captured by an OU process. According to Chap. 4, either nonlinearity or multiplicative noise (or both) is needed to reproduce the features beyond the Gaussian statistics. Although a complicated nonlinear model can always be proposed as a starting system, a certain sparse regression method is essential to prevent overfitting and seek a parsimonious model (see Chap. 9 for more details). Yet, since the model structure relies heavily on the complicated ansatz of the proposed model, it is sometimes difficult to provide a clear physical meaning of the resulting parsimonious system. In addition, since the regression is usually based on the summation of the one-step likelihood from t to the next instant t + t, there is no guarantee that the equilibrium non-Gaussian PDF is precisely recovered. In the other direction, if a linear model is combined with a suitable multiplicative noise, then such a model remains in a simple format and can reproduce non-Gaussian features. The general form of such a model (in one-dimension) reads: du(t) = −λ(u(t) − m) dt + σ (u) dW (t),

(7.12)

which will be very useful if a simple and systematic way can be designed to determine the parameters λ, m and the multiplicative noise coefficient σ (u).

7.4.1

Exact Formula for Model Calibration

The damping λ and mean m are determined by the decorrelation time and the mean of the time series (though λ is now only approximately equal to the inverse of the decorrelation time). To determine σ (u), let us start with the Fokker-Planck equation, ∂ 1 ∂2 2 ∂ p(u, t) (σ (u) p(u, t)). = (λ(u − m) p(u, t)) + ∂t ∂u 2 ∂u 2 Because the probability and its derivatives vanish at infinity, the probability density p(u) of the stationary solution satisfies the following equation, λ(u − m) p(u, t) +

d 2 σ (u) p(u, t) = 0. du

Taking the integration once again leads to 2 σ (u) = {−λ(u)}, p(u) 2

 (u) =

u −∞

(y − m) p(y) dy.

(7.13)

7.4

Linear Stochastic Model with Multiplicative Noise

107

Therefore, given a target distribution p(u) and a decorrelation time 1/λ, the multiplicative noise of the corresponding SDE can be easily determined via (7.13). Simple analytic formulae of σ (x) are available for many well-known distributions. See [18] for a set of such SDEs with different target distributions. In particular, if the target 1 u γ e−u/β with 0 < u < ∞, γ > −1 distribution is a Gamma distribution, p(u) = β γ +1 (γ +1) and β > 0, then the corresponding SDE is  du(t) = −λ (u(t) − βγ − β) dt + 2λβu(t) dW (t). (7.14) Gamma distribution is widely used in characterizing variables with extreme events, such as precipitation or convection. See [56] for an application of (7.14) in modeling the convective activity in the equatorial area.

7.4.2

Approximating Highly Nonlinear Time Series

In the data-driven scenario, the mean m, the decorrelation time 1/λ, and the PDF p(x) can be computed from the observed time series. Then (7.13) is applied to numerically obtain the profile of σ (u). With these quantities, the linear model with multiplicative noise is fully determined. Figure 7.2 shows examples where the true signal is generated from the cubic nonlinear model (4.30) with a CAM. The only available information in practice is such a time series rather than the underlying nonlinear model. Then the above procedure can be applied to find the corresponding linear model with multiplicative noise. The resulting model reproduces not only the non-Gaussian statistics but the large-scale dynamical features in the two strongly non-Gaussian regimes, indicating that the underlying nonlinear dynamics are implicitly recovered by the multiplicative noise. The qualitative dynamical features of these two systems are almost indistinguishable by only looking at the time series. One interesting question in climate science is understanding the underlying dynamics of the large-scale El Niño-Southern Oscillation (ENSO). There are two schools of argument. One states that ENSO should be best treated with deterministic nonlinear dynamics. At the same time, the other argues that ENSO dynamics should be linear but is affected by the multiplicative noise from atmospheric perturbation. The question seems to be hard to answer using a purely data-driven method. Therefore, certain physics understanding is crucial in answering this unresolved question. Nevertheless, the forecast systems obtained from these two data-driven modeling approaches may lead to similar results for ensemble forecasts.

7.4.3

Application: Characterizing an El Niño Index

Let us consider modeling a realistic data set, one of the most widely used ENSO indices. In the classical viewpoint, ENSO is often regarded as a phenomenon with semi-cyclical attributes [6, 207, 226], with warmer and colder than average sea surface temperature (SST)

108

7 Data-Driven Low-Order Stochastic Models (b) PDFs

(a) Time series

4

(c) ACFs

1

3

(d) (u)

0.6

Regime I

2

2 0.4 0 -2 1400

0.5 1

0.2 0

0 1420

1440

1460

1480

Cubic nonlinear model

0

1500 Gaussian fit

2

0 0

4

2

4

0

2

4

Linear model with multiplicative noise

0.4

1

0.2

0.5

0

0

3

4 2

Regime II

2

1

0 1400

1420

1440

1460

t

1480

1500

-2 0 2 4 6

u

0 0

5

t (lag)

-2 0 2 4 6

u

Fig. 7.2 Data-driven modeling using a linear model with multiplicative noise (7.12). Panel a the true observed time series (blue) and an independent simulation from the calibrated linear model with multiplicative noise (red), which are not expected to have any point-wise correspondence. Panel b Comparison of the PDFs, where the black dashed curve shows the Gaussian fit of the truth. Panel c Comparison of the ACFs, where the black dashed curve shows the exponential fit of the truth. Panel d the computed multiplicative noise coefficient (7.13). The first row shows a strongly skewed dynamical regime, while the second shows a bimodal regime

alternating in the equatorial eastern Pacific region. The positive and negative phases of the SST anomaly are called El Niño and La Niña, respectively. ENSO has also been shown to have significant features with diversity and complexity [50, 250], which are not the main focus here. The blue curve in the first row of Fig. 7.3 shows the observed Nino 3 index TE , the average SST anomaly in the eastern Pacific. The eastern Pacific SST is known to have strong couplings with the ocean via the thermocline depth in the western Pacific. They have been understood to form a so-called discharge and recharge oscillator [138]. The SST is also affected by atmospheric wind bursts. In particular, the westerly wind bursts (blowing from west to east) push the warm water propagating towards the eastern Pacific [127, 157]. The observed thermocline depth in the western Pacific and the wind bursts averaged in the entire domain, denoted by HW and τ , are shown in the blue curves in the second and third rows, respectively. The goal here is to model this coupled system. Column (c) shows that the three variables’ PDFs are highly non-Gaussian with obvious skewness and fat tails. One of the key features of the ENSO is the intermittent occurrence of extreme events, known as the super El Niño, which enhances the tail probability of TE . Because TE and HW form an oscillator, it is intuitive to use a two-dimensional linear oscillator model to describe these two variables as a starting system. In addition, since the atmospheric wind τ directly drives the system, it is natural to treat it as an additive forcing to the linear oscillator. The model described so far is a fully linear system. If additive noise supplements the model, the stochastic model fails to capture the non-Gaussian statistics. Therefore, a multiplicative noise is added to the process of the atmospheric wind, and the resulting model reads:

7.4

Linear Stochastic Model with Multiplicative Noise

109

dTE = (−dT TE + ωHW + αT τ ) dt + σT dWT ,

(7.15a)

dHW = (−d H HW − ωTE + α H τ ) dt + σ H dW H ,

(7.15b)

dτ = (−dτ τ ) dt + στ (TE ) dWτ ,

(7.15c)

Fig. 7.3 Comparison of the model simulation (red) with observations (blue). The three columns show the time series, ACFs, and PDFs. The shading area indicates the 95% confidence interval from 30 realizations with the same length of observations (40 years from 1980 to 2020). Note that the two time series have the same x-axis for the simplicity of illustration, but there is no one-to-one correspondence between these two time series. Reanalysis data from satellite observations are used in the study here as the observations (blue). The SST data is from the Optimum Interpolation Sea Surface Temperature (OISST) reanalysis [213] while the zonal winds are at 850 hPa from the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis [145]. The thermocline depth is computed from the potential temperature as the depth of the 20 ◦ C isotherm using the NCEP Global Ocean Data Assimilation System (GODAS) reanalysis data [20]

where dT , d H and dτ are the dampings, ω is the phase triggering the oscillation of TE and HW , αT and α H are the coupling coefficients for the wind forcing, and σT and σ H are the noise strengths. All the above parameters are constants. The multiplicative noise στ is a function of TE . It increases as TE , which describes the physics that a warmer SST induces more convective activities and thus triggers stronger wind. This multiplicative noise is crucial to reproduce the significant non-Gaussian features, as is shown in Fig. 7.3. In fact, with the following choices of parameters dT = d H = 1.5,

dτ = 4,

σT = σ H = 0.8,

ω = 1.5, and

αT = 1,

α H = −0.4,

στ (TE ) = 4.5(tanh(TE ) + 1) + 4,

(7.16)

110

7 Data-Driven Low-Order Stochastic Models

the model can reproduce many desirable features of the observations. Note that the time series from the model (red) is an independent simulation of the observations (blue). Although they have the same x-axis for the simplicity of illustration, there is no one-to-one correspondence between these two time series. Nevertheless, the overall properties of the model simulation are similar to those of the observations. The enhanced TE increases the chance of triggering more winds. If the wind blows towards the eastern Pacific (τ > 0), then the thermocline depth in the western Pacific becomes shallower (HW < 0), and the SST in the eastern Pacific becomes warmer TE > 0. On average, this loop in the model takes 3–7 years, as in the observations. In addition to the physical mechanism, the ACFs and PDFs of the model highly resemble those of the observations, where the occurrence frequency of the non-Gaussian extreme events is accurately reproduced. The results highlight the importance of incorporating multiplicative noise in the modeling procedure. More details of the model (7.15) and its forecast skill can be found in [52, 107].

7.5

Stochastically Parameterized Models

Stochastic parameterization is an extremely useful technique in effectively modeling complex systems with a reduction of the system dimension.

7.5.1

The Necessity of Stochastic Parameterization

Instead of imposing many additional complicated nonlinear terms of the original state variables to improve the existing dynamics, stochastic parameterization augments the starting system with a few simple stochastic processes coupled to the original system. This allows the augmented system to remain relatively simple but enriches the dynamical properties that advance the original state variables of interest to exhibit various nonlinear and non-Gaussian features. Note that simple forms are preferred for these additional processes; therefore, by design, they do not aim to capture the exact physical mechanisms of these variables. Instead, the statistical feedback from these additional processes to the original state variables is one of the most crucial features to be recovered. Stochastic parameterization has wide applications in practice [24, 71]. For example, the earth system models usually involve many state variables. Suppose the quantity of interest lies only in the large-scale variables. In that case, stochastic parameterization can characterize the small-scale variables in a simple fashion and effectively captures the feedback from the small-scale (or the so-called sub-grid-scale) to the large-scale variables. This often leads to a significant computational cost reduction compared with running the original complicated dynamics of the small-scale variables.

7.5

Stochastically Parameterized Models

7.5.2

111

The Stochastic Parameterized Extended Kalman Filter (SPEKF) Model

Here, a widely used and systematic stochastic parameterization is presented based on simple models. Consider the time series u shown in Fig. 7.4, which is chaotic with at least three noticeable features. First, the time series has intermittency and extreme events. The associated PDF is thus expected to be non-Gaussian with fat tails. Second, the oscillation frequency varies at different time instants, leading to a slow-fast model structure, a common feature appearing in many realistic multiscale systems. Third, although the long-term average of the time series is zero, the average value within a local window can be far from zero. This mimics reality, where non-trivial external forcing constantly modulates the short-term behavior of the time series by pushing it away from the statistical equilibrium. These features are typical properties of many complex turbulent systems and are natural outcomes due to strong nonlinear interactions and intrinsic chaotic behavior. Based on these reasonings, a family of stochastic parameterized system is the following so-called stochastic parameterized extended Kalman filter (SPEKF) model [102]:

Fig. 7.4 One simulation from the SPEKF model (7.17). Here u is the resolved state variable while γ , ω, and b are three stochastically parameterized processes representing the stochastic damping, stochastic phase, and stochastic forcing, respectively. The red color in the time series of γ indicates its value below 0 that starts to trigger the intermittency in u. The green and cyan colors in the time series of ω and b show the values above one standard deviation or below minus one standard deviation. Only the real part of u and b is shown. The parameters used here are: σu = 1, F(t) = 0, dγ = 1, γˆ = 1, σγ = 1.2, dω = 0.5, ωˆ = 5, σω = 1.5, db = 0.25, bˆ = 0, and σb = 15

112

7 Data-Driven Low-Order Stochastic Models

 du = (−γ + iω)u + F(t) + b dt + σu dWu , dγ = −dγ (γ − γˆ ) dt + σγ dWγ dω = −dω (ω − ω) ˆ dt + σω dWω

(7.17)

ˆ dt + σb dWb , db = −db (b − b) where u is the resolved complex-valued state variable while γ , ω, and b are three stochastically parameterized processes representing the stochastic damping, stochastic phase, and stochastic forcing, respectively. Here, these three processes are modeled by simple OU processes, where γ and ω are real, and b is complex. Despite these three processes being linear and Gaussian, γ and ω couple with u in a nonlinear way, and therefore intermittency and non-Gaussian features appear in u. In (7.17), F(t) is a given deterministic forcing and σu is the real-valued noise coefficient of the complex Wiener process Wu . The 9 parameters in the three stochastic parameterization processes are all constants, where dγ , dω , and db are positive to guarantee the processes are damped. Only bˆ is complex, while the other 8 parameters are real. It can be seen from Fig. 7.4 that when γ becomes negative, the antidamping −γ triggers intermittency, and therefore, large bursts in u are observed. Note that the role of γ played in the SPEKF model (7.17) in triggering intermittency is in a similar fashion as the two-state Markov process (1.27b) in Sect. 1.3. It can also be seen that if ω is way above/below its mean value, the oscillation in u is relatively fast/slow. Finally, when b has a large amplitude, it tends to drive the signal of u far from the long-term mean state. Note that, although b only interacts with u in an additive way, it is still different from the Wiener process Wu as the noise generated by b has a memory due to a finite value of db in the associated stochastic process. One of the significant advantages of the SPEKF model is that, despite the intrinsic nonlinearity, the time evolution of the moments can be solved via closed analytic formulae [102]. This significantly facilitates the use of the SPEKF model for data assimilation, in which the forecast step is completely solved analytically, avoiding sampling errors in the ensemble methods and significantly improving the computational efficiency and accuracy (see also the example in Sect. 6.4.3). Note that although the distribution of u is generally non-Gaussian, only the nonlinear time evolutions of the mean and the covariance of the SPEKF model are utilized in the forecast step of data assimilation for most applications to save the computational cost. As a consequence, the posterior state is characterized by a Gaussian distribution. Despite the Gaussian approximation, the SPEKF differs from the extended Kalman filter, in which the dynamics must be linearized at each assimilation cycle. In addition to filtering and predicting intermittent signals from nature [59, 177, 183], other important applications of using SPEKF to filter complex spatial-extended systems include stochastic dynamical superresolution [36] and effective filters for Navier–Stokes equations [38]. It has been shown that the SPEKF model has much higher skill than classical Kalman filters using the so-called mean stochastic model (MSM), which treats γ , ω and b all as constants.

7.5

Stochastically Parameterized Models

7.5.3

113

Filtering Intermittent Time Series

In this subsection, a simple example is utilized to illustrate the skill of the SPEKF model in filtering intermittent time series. The true signal is generated from the coupled model (1.27), where u is the observed variable. In contrast, γ satisfying a two-state Markov jump process is utilized to describe the time evolution of the damping coefficient. The parameters used here are as follows: ωu = 2, σu = 0.4,

f u = 0, μ = ν = 0.2, γ+ = 2.27, γ− = −0.04.

(7.18)

The blue curve in the top row of Fig. 7.5 shows a realization of the observed time series u, which is intermittent and has large bursts when γ = γ− . Assume the observations arrive at every t obs = 0.5 time units, where the observational noise satisfies a Gaussian distribution with zero mean and (r o )2 being the variance. To filter the intermittent time series of u, a simplified version of the SPEKF model (7.17) can be adopted as a forecast model,  du = (−γ + iωu )u + f dt + σu dWu , (7.19) dγ = −dγ (γ − γˆ ) dt + σγ dWγ , where only the damping is stochastically parameterized. In (7.19), the parameters ωu , f and σu are taken to be the same values as those in (7.18). Note that model error is introduced in (7.19) as the two-state Markov jump process for characterizing γ is replaced by the OU process in such a forecast model. Nevertheless, since the true dynamics are seldom known in practice, suitable stochastic parameterizations are always preferred to effectively filter and predict the resolved variables associated with complex systems. For simplicity, the three parameters in the γ process are calibrated by matching the mean, variance, and decorrelation time with the signal generated from the two-state Markov jump process. In practical situations, certain parameter estimation algorithms can be adopted to infer the parameter values from the partial observations. See Chap. 9 for more details. To demonstrate the necessity of adopting such a stochastic parameterized forecast model, a crude approximation of u, named the mean stochastic model (MSM), is also tested for the filtering skill. The MSM completely drops the stochasticity in γ and uses its mean value γˆ as the constant damping coefficient, which leads to a complex OU process:  (7.20) du = (−γˆ + iωu )u + f dt + σu dWu .

Panels (a)–(b) in Fig. 7.5 show the filtering skill in two situations with different observational noise levels r o = 0.2 and r o = 0.5. In both cases, the filtered signal using the SPEKF model as the forecast model succeeds in recovering the intermittent time series. In contrast, the filter solution using the MSM has difficulties recovering the large bursts, especially when the observational noise becomes large. This is mainly due to the model error in the MSM,

114

7 Data-Driven Low-Order Stochastic Models

Fig. 7.5 Comparison of the filtering skill using the (simplified) SPEKF model (7.19) and the MSM (7.20). Panel a–b standard tests with two different observational noise levels σ o = 0.2 and σ o = 0.5. Panel c similar setup as Panel a except adding a non-zero forcing f u = 5 in both the model that generates the true signal and the two forecast models. Panel d similar setup as Panel b except inflating the coefficient σu in the MSM is increased from 0.4 to 1.2. The blue, red, and green curves show the true signal, the filtered mean time series using the SPEKF model as the forecast model, and the filtered mean time series using the MSM as the forecast model. The blue and red curves almost overlap in the panels describing u. The black circles are the observations, with observational time step being t obs = 0.5. The red shading area in recovering γ indicates one standard deviation from the posterior mean time series

which does not have the mechanism to create the intermittency. Because of this, the forecast variance using the MSM is underestimated overall. Consequently, the filter solution trusts the forecast result more, leading to larger biases than the SPEKF model. In practice, covariance inflation (sometimes known as noise inflation) is often utilized in the imperfect forecast model to improve filtering skills. As is shown in panel (d), when the noise coefficient σu in the MSM is increased from 0.4 to 1.2 (note: the coefficient is still 0.4 in generating the true signal), the accuracy of the filter solution is also improved since it assigns less weight towards the forecast model and trusts more towards observations. On the other hand, the SPEKF model provides a reasonably accurate recovery of γ , especially in revealing the timing of the intermittent phases. However, in Panel (a), the recovered γ signal from the SPEKF model cannot capture the exact values at the quiescent phases. This is due to the lack of practical observability of γ . The signal-to-noise ratio is small at the quiescent phases; thus, the filter fails to recover the true signal from the ‘very’ noisy observations. Therefore, the role of the observations in helping recover the state is very weak, which leads to the fact that recovered γ stays around the mean value of the forecast model. In other words, the contribution from γ to u is insignificant in such a situation, where a large error in γ does not

7.6

Physics-Constrained Nonlinear Regression Models

115

affect the value of u. Usually, deterministic forcing helps gain observability. As is shown in Panel (c), by adding a non-zero forcing f u = 5 in generating the true signal and in the forecast models, the value of u is no longer near zero when γ = γ+ . In such a situation, the entire trajectory of γ is accurately recovered by the SPEKF.

7.6

Physics-Constrained Nonlinear Regression Models

Physics-constraint nonlinear regression models are a rich family of nonlinear SDEs with a specific condition for the nonlinear terms called the physics constraint.

7.6.1

Motivations

The motivation for such a low-order SDE comes from an important feature in many nonlinear PDEs, where the total energy in the nonlinear terms is conserved. If the energy in the nonlinear terms increases as a function of time, then the solution of the PDE will blow up very quickly. In many fluids and geophysical flow systems, one of the major nonlinearities is the advection, which is a quadratic nonlinearity and converses the energy given a suitable boundary condition, for example, the periodic boundary condition in a finite rectangle domain or a vanishing boundary condition in an infinite domain. Intuitively, the advection only transfers energy from one location to the other and keeps the total energy unchanged. Mathematically, consider the following one-dimensional advection model on a line for x ∈ [a, b], ∂u ∂u +u =0 ∂t ∂x

(7.21)

with the periodic boundary condition u(a) = u(b). The energy is often defined as E =

b 2 a u /2 dx, which is motivated by kinetic energy. Multiplying by u on both sides of (7.21) and then integrating over the entire domain yields dE + dt



b

u2

a

∂u dx = 0. ∂x

(7.22)

It can be shown, by using integration by part and the periodic boundary condition u(a) = u(b), that the second term in (7.22) is zero, 

b

u a

2 ∂u

∂x

dx =



x=b u 3 x=a

 − a

b

2u 2

∂u dx. ∂x

(7.23)

Therefore, the energy is conserved if the only nonlinearity of the system is advection.

116

7.6.2

7 Data-Driven Low-Order Stochastic Models

The General Framework of Physics-Constrained Nonlinear Regression Models

Following the same argument as in Sect. 7.6.1, the quadratic nonlinearity in the low-order stochastic model is designed to be energy conserved. Such a family of models has the following form [123, 184]:

du = (L + D)u + B(u, u) + F(t) dt + (u) dW, (7.24) with u · B(u, u) = 0, where u is an n-dimensional state variable. In the model, L + D is a linear operator representing damping and dispersion, where L is skew-symmetric representing dispersion, and D is a negative definite symmetric operator representing dissipative processes such as surface drag, radiative damping, viscosity, etc. In (7.24), F(t) is a vector standing for the deterministic forcing,  is an n × m-dimensional matrix and W is an m-dimensional Wiener process. In (7.24), B(u, u) contains the quadratic nonlinear terms. Mimicking the manipulation in Sect. 7.6.1, the energy conservation in such a quadratic nonlinear term is satisfied by first multiplying this quadratic term by the state variable u and then taking a certain ‘integral’. Here, summing over different components of u · B(u, u) plays the role of the ‘integral’ (as the integral in Sect. 7.6.1 can be regarded as the summation over the state variables at all grid points). Therefore, the physics constraint is given by the second equation in (7.24). Note that the linear part of the system (7.24) still has energy injection from both the deterministic and stochastic forcing. Energy can also be dissipated due to the linear damping (sometimes, there can be one additional cubic term representing high order dissipation). The damping and forcing eventually reach a statistical balance, which, together with the energy-conserving quadratic nonlinearity, leads to a statistical equilibrium distribution when F is a constant forcing (or a statistical attractor if F(t) is time-periodic). One significant advantage of these physics-constrained nonlinear stochastic models is that they overcome both the finite-time blowup and the lack of physical meaning issues [187] that may exist in various ad hoc regression models.

7.6.3

Comparison of the Models with and Without Physics Constraints

To understand the difference between low-order models with and without physics constraints, let us start with the SPEKF model (7.17). The only nonlinearity appears on the right-hand side of the equation of u. Since u is a complex state variable, it is convenient to write it as u = u 1 + iu 2 . A similar manipulation can be applied to b with b = b1 + ib2 . Then the total energy is given by 1 E = (u 21 + u 22 + γ 2 + ω2 + b12 + b22 ). 2

7.7

Discussion: Linear and Gaussian Approximations for Nonlinear Systems

117

Now considering only the nonlinear part on the right-hand side of (7.17), the corresponding system reads:  dγ = 0 dω = 0, du 1 = − γ u 1 − ωu 2 dt,  (7.25) db1 = 0 du 2 = − γ u 2 + ωu 1 dt, db2 = 0. 2 2 A straightforward calculation shows that dE dt = −γ (u 1 + u 2 ), which implies that the energy in the nonlinear terms is not conserved. A physics-constrained version of the SPEKF model requires adding nonlinear feedback in the process of γ in (7.17),

dγ = −dγ (γ − γˆ ) dt + (u 21 + u 22 ) dt + σγ dWγ

(7.26)

The additional feedback term u 21 + u 22 in (7.26), which serves as a forcing of γ , plays a vital role in mitigating the large amplitude of γ when it grows too fast. When u 1 and u 2 have large amplitudes, corresponding to intermittency, this feedback term also increases significantly, which advances the value of γ to increase dramatically with a positive sign. Since γ is the damping of u, such strong damping immediately suppresses the amplitude of u and avoids the blowing up of the signal. Although the physics-constrained models are more appropriate choices for the long-term forecast, the original SPEKF model (without physics constraints) (7.17) is still very useful in data assimilation. Data assimilation only requires running the model forward for a short term in the forecast step of each assimilation cycle, after which the observations will be utilized to correct part of the forecast error. Therefore, the imperfectness of the model is more amenable in the data assimilation than the long-term forecast. On the other hand, with the additional feedback term accounting for the physics constraint, the closed analytic formulae for the time evolution of the moments are no longer available. Without such a desirable feature, ensemble methods have to be used for data assimilation, which will introduce sampling errors, reduce the computational efficiency and increase the computational cost. In practice, different reduced-order models can be utilized for data assimilation and forecast.

7.7

Discussion: Linear and Gaussian Approximations for Nonlinear Systems

Recall in Chap. 4 and Sect. 7.2, linear models with Gaussian additive noise leads to Gaussian statistics. Therefore, the words ‘linear’ and ‘Gaussian’ appear simultaneously in many situations. Yet, ‘linear approximation’ and ‘Gaussian approximation’ for nonlinear systems are quite different. In particular, although Gaussian statistics often accompanies the linear approximation of a model, the Gaussian approximation is usually imposed directly on the statistical output of nonlinear equations, which does not require any manipulation of the original dynamics and generally preserves the underlying nonlinearities.

118

7 Data-Driven Low-Order Stochastic Models

It has been shown in Sect. 6.4.3 that the linear model fails to capture the response in the variance due to the independence between the forcing and the variance in such a system. It has also been discussed in Sect. 5.3.1 that the extended Kalman filter, which involves a linearization of the dynamics at each assimilation circle, may suffer from numerical instability issues and can lead to significant biases in the data assimilation of nonlinear problems. In addition, as was illustrated in Sect. 7.5.3, the MSM is not a suitable forecast model for the state estimation of intermittent time series. The common issue in these failure examples is the use of linear approximate models, which completely change the underlying dynamics. Thus, a direct linear approximation of the original nonlinear dynamics is often not a suitable strategy to effectively model and forecast complex nonlinear turbulent systems. On the other hand, Sect. 6.4.3 showed that the qG FDT, which involves a Gaussian fit of the PDF, successfully predicts the responses for the highly nonlinear systems. It was also demonstrated in Sect. 6.4.3 that the predicated response from the direct Monte Carlo simulation with even a large number of samples, which is much more expensive, suffers from the numerical error resulting from the failure of constructing the tail of the distribution. Such a comparison indicates the necessity of utilizing the Gaussian approximation in computing the response operator. Similarly, by exploiting only the Gaussian statistics, rather than the entire non-Gaussian PDF, to compute the weights (the Kalman gain) of summing up the prior distribution and the observations, the ensemble Kalman filter in Sect. 5.3.2 is one of the most widely used data assimilation methods in practice. Section 9.4.2 will also show that using the Gaussian approximation in computing the relative entropy (2.15) can significantly facilitate the data-driven discovery of underlying chaotic dynamics. In contrast, the direct use of the original highly non-Gaussian PDF in the learning process is impractical for systems with large dimensions. These are all supportive examples to illustrate the advantages of improving the efficiency of the methods by appropriately applying the Gaussian approximation to the statistical output of nonlinear dynamics.

8

Conditional Gaussian Nonlinear Systems

8.1

Overview of the Conditional Gaussian Nonlinear System (CGNS)

Simple low-order models described in Chap. 7 have unique mathematical features and computational advantages, which lead to success in handling many linear and nonlinear problems. To describe a much more comprehensive range of nonlinear phenomena with increased complexity, it is beneficial to have a systematic modeling framework that is amenable to mathematical analysis and is adaptive to many practical applications. As nature is highly nonlinear and non-Gaussian, such a modeling framework should be far beyond the family of linear models. On the other hand, the framework is not expected to contain arbitrary nonlinearities that often prevent the development of rigorous mathematical analysis and efficient numerical methods. It is worth highlighting that modeling is only the first step towards understanding nature. A suitable modeling framework provides practical ways to build appropriate nonlinear systems that describe the underlying physics and facilitates the development of efficient numerical algorithms for data assimilation, uncertainty quantification, and prediction. The latter should be considered when determining the state variables and designing the model structures. In this chapter, a nonlinear modeling framework, called the conditional Gaussian nonlinear system (CGNS), is introduced. The CGNS contains a rich class of nonlinear models, where the joint and marginal distributions can both be highly non-Gaussian. Nevertheless, the conditional distributions of certain variables are Gaussian, which advances the development of effective analysis and computational strategies to understand and predict complex systems. The conditional Gaussian structure appears to have explicit physical reasonings, which will be justified in Sect. 8.1.3. The CGNS first appeared in [164] for studying a class

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_8

119

120

8 Conditional Gaussian Nonlinear Systems

of optimal filtering and control problems. Recently, it has been widely applied to model and forecast complex systems in various natural science and engineering disciplines [56, 59].

8.1.1

The Mathematical Framework

The general mathematical framework of the CGNS is as follows,   dX = A0 (X, t) + A1 (X, t)Y dt + B1 (X, t) dW1 (t),   dY = a0 (X, t) + a1 (X, t)Y dt + b2 (X, t) dW2 (t),

(8.1a) (8.1b)

where X ∈ Cn 1 and Y ∈ Cn 2 are both multi-dimensional state variables. In (8.1), A0 , a0 , A1 , a1 , B1 and b2 are vectors or matrices that can depend nonlinearly on the state variables X and time t while W1 and W2 are independent Wiener processes. The coupled system (8.1) is highly nonlinear, and multiplicative noise as a function of X can appear in the CGNS. Therefore, the marginal distributions p(X) and p(Y) as well as the joint distribution p(X, Y) can be strongly non-Gaussian. Nevertheless, if a trajectory of X(s) for s ≤ t is given, then conditioned on this trajectory, Y(t) becomes a linear Gaussian process. This can be seen by plugging the given trajectory of X(s) into the coupled system (8.1), which then becomes only a linear equation of Y(t) with time-varying but known coefficients. Thus, the conditional distribution p(Y(t)|X(s), s ≤ t) ∼ N(μf (t), Rf (t))

(8.2)

is Gaussian. In Sect. 8.2.1, it will be shown that the conditional mean μf (t) and the conditional covariance Rf (t) in (8.2) can be solved utilizing closed analytic formulae, which facilitate the analysis and simulation of the CGNS.

8.1.2

Examples of Complex Dynamical Systems Belonging to the CGNS

To provide intuition and facilitate the understanding of the CGNS, let us start with presenting some typical examples of the CGNS. In fact, a large number of classical and widely used systems possess the CGNS structures, including many physics-constrained nonlinear stochastic models, a large number of stochastically coupled reaction-diffusion models in neuroscience and ecology, and quite a few large-scale dynamical models in turbulence, fluids, and geophysical flows. See [59] for a gallery of such systems. The first example of the CGNS is the noisy Lorenz 63 (L63) model (5.25). The noisy L63 model is a CGNS with X = x and Y = (y, z)T . It also belongs to the CGNS framework with X = (y, z)T and Y = x. Note that the noisy L63 model (5.25) is a physics-constrained nonlinear model. Another example within the family of the Lorenz models is the noisy Lorenz 84 (L84) model. It is a simple analog of the global atmospheric circulation [223, 257], which has the following form [171, 172]:

8.1

Overview of the Conditional Gaussian Nonlinear System (CGNS)

  dx = −(y 2 + z 2 ) − a(x − f ) dt + σx dWx ,   dy = − bx z + x y−y + g dt + σ y dW y ,   dz = bx y + x z−z dt + σz dWz .

121

(8.3)

In (8.3), the zonal flow x represents the intensity of the mid-latitude westerly wind current (or the zonally averaged meridional temperature gradient, according to thermal wind balance), and a wave component exists with y and z representing the cosine and sine phases of a chain of vortices superimposed on the zonal flow. Relative to the zonal flow, the wave variables are scaled so that x 2 + y 2 + z 2 is the total scaled energy (kinetic plus potential plus internal). Note that these equations can be derived as a Galerkin truncation of the two-layer quasigeostrophic potential vorticity equations in a channel. Linking (8.3) to the general CGNS (8.1), it is clear that the zonal wind current is X = x and the two phases of the large-scale vortices give Y = (y, z)T . The noisy L84 model (8.3) is also a physics-constrained nonlinear model. In addition to the L63 and L84 models, many other noisy versions of the Lorenz models that satisfy physics constraints also belong to the CGNS framework, for example, the Lorenz 1980 model [48, 170] and certain types of the two-layer Lorenz 96 model [58]. Other widely seen physics-constrained nonlinear models belonging to the CGNS framework include a simple stochastic model with critical features of atmospheric low-frequency variability [179], a nonlinear multiscale triad model mimicking structural features of lowfrequency variability of GCMs with non-Gaussian features [180], a paradigm model for topographic mean flow interaction [211], a low-order model of the Charney-DeVore flow [78], and many more systems (see [59]). It is worthwhile to point out that the SPEKF model (7.17) is a CGNS, where X = u and Y = (γ , ω, b)T , although it does not satisfy the physics constraint. In addition, the three-dimensional linear model with multiplicative noise (7.15) for the El Niño is a CGNS with X = (TE , HW )T and Y = τ . The wide appearance of the CGNSs is not simply a coincidence. In Sect. 8.1.3, it will be shown that the CGNS is a natural way of modeling complex systems. The next example is a stochastically coupled FitzHugh-Nagumo (FHN) model, which belongs to the family of reaction-diffusion models in neuroscience and ecology. The FHN is a prototype of an excitable system, which describes the activation and deactivation dynamics of a spiking neuron [162]. Stochastic versions of the FHN model with the noise-induced limit cycles were widely studied and applied in the context of stochastic resonance [163, 166, 253, 269]. Its spatially extended version has also attracted much attention as a noisy excitable medium [126, 131, 140, 197]. One such stochastically coupled FHN model in the lattice form reads:   √ 1  du i = du (u i+1 + u i−1 − 2u i ) + u i − u i3 − vi dt + δ1 dWu i , 3 (8.4)   dvi = u i + a dt + δ2 dWvi , i = 1, . . . , N ,

122

8 Conditional Gaussian Nonlinear Systems

where  is a small singular perturbation parameter which determines the time scale separation of the system, and du (u i+1 + u i−1 − 2u i ) can be regarded as a finite difference discretization to the diffusion ∇ 2 u. The model is equipped with periodic boundary condition u 0 = u N and u 1 = u N +1 . Note that the parameter a > 1 is required to guarantee that the system has a global attractor in the absence of noise and diffusion. The random noise can drive the system above the threshold level of global stability and triggers limit cycles intermittently. The model behavior of (8.4) in various dynamical regimes has been studied in [58]. It can display both strongly coherent and irregular features as well as scale-invariant structures with different choices of noise and diffusion coefficients. The FHN belongs to the CGNS with X = (. . . , u i , . . .)T and Y = (. . . , vi , . . .)T . The examples exhibited so far are all systems of stochastic ODEs, while in many applications, the governing equations are complex PDEs. To see the connection between the CGNS and the complex PDEs, the last example illustrates that, after certain manipulations, many PDE systems will also belong to the CGNS framework. This significantly expands the range of applications for the CGNS. The complex system shown here is the Boussinesq equation. It is derived when the Boussinesq approximation is applied to the full Navier-Stokes equation, which assumes that variations in density do not affect the flow field, except that they give rise to buoyancy forces [223, 236]. The Boussinesq equation reads: 1 ∂u + u · ∇u = − ∇ p + ν∇ 2 u − gαT + Fu , ∂t ρ0 ∂T + u · ∇T = κ∇ 2 T + FT . ∂t

∇ · u = 0, (8.5)

where T is the temperature, u is the velocity fields, ρ0 is the reference density, p is the pressure, g is the acceleration due to gravity, Fu and FT are the external forcing, and κ is the diffusion coefficient. Note that Fu and FT can involve both deterministic and stochastic forcing. The three equations in (8.5) are the continuity equation, the momentum equation, and the thermal equation. The Boussinesq equation has wide applications, including modeling the Rayleigh-Bénard convection [44, 263] and describing strongly stratified geophysical flows [181]. Intuitively, if u in (8.5) is given, then the process of T becomes conditional linear. Therefore, it is expected that with additional stochastic Gaussian white noise, the system in (8.5) will have the conditional linear and Gaussian structure. To systematically build a model that belongs to the CGNS, apply a spectral decomposition to (8.5) and keep only the leading few spectral modes. Then add extra damping and noise to the resulting equations. This is similar to the manipulation in Sect. 7.2.2, except that the stochastic approximation is applied only to compensate for the truncation error. In contrast, the quadratic nonlinear terms involving the state variables within the range of interest are retained. The resulting system is a high-dimensional nonlinear SDEs. Following the same argument as in Sect. 7.3, the continuity equation with divergence-free condition provides the eigendirections of each Fourier mode associated with the velocity field. Such a system belongs to the CGNS, where X consists of the Fourier coefficients of u while Y includes those of T .

8.1

Overview of the Conditional Gaussian Nonlinear System (CGNS)

8.1.3

123

The Mathematical and Physical Reasonings Behind the CGNS

Consider the following general framework of a high-dimensional dynamical system with quadratic nonlinearity described in Sect. 7.6.2, which appears in fluids, geophysics and many engineering applications [144, 223, 257], du = [(L + D)u + B(u, u) + F(t)] dt +  dW.

(8.6)

The system (8.6) can be regarded as applying a spectral decomposition method to a starting PDE and then retaining a finite truncation of the modes. Stochastic noise is often added to compensate for the truncation error. Decompose the state variable u into u = (XT , YT , ZT )T , where X ∈ Cn 1 , Y ∈ Cn 2 , Z ∈ n C 3 , and n 1 + n 2 + n 3 = n. A typical criterion for such a decomposition is based on the rank of the scales for the state variables. In other words, X includes the leading a few modes corresponding to the large-scale components of the state variables, Y contains the next a few modes describing the medium-scale features, while Z involves the remaining modes that are treated as the small-scale variables. The three-scale decomposition facilitates a smoother transition from large- to small-scale variables using a set of intermediate-scale variables. It also allows the feedback from medium- to large-scale variables to be explicitly incorporated into the model, which is often crucial to enhance the model’s accuracy in characterizing energy transfer and intermittency. The choices of n 1 , n 2 , and n 3 are case-dependent, but in a typical scenario, Z accounts for the majority number of the modes. A natural way to build a simplified model with a reduced dimension of the coupled system in (8.6) is to project u onto its large- and medium-scale components X and Y, (X )

(X )

(X )

(X )

dX = [A X X + B X X (X, X) + B X Y (X, Y) + BY Y (Y, Y)] dt +  X dW X , dY =

(Y ) [AY Y

(X )

(Y ) (Y ) (Y ) + B X X (X, X) + B X Y (X, Y) + BY Y (Y, Y)] dt (Y )

(8.7a)

+  Y dWY ,

(8.7b) (X )

where A X X and AY Y are the linear terms. The quadratic nonlinear term BY Y means it appears in the X equation (the superscript), and the nonlinear interaction (the subscript) is between Y and itself (similar for the notations of other nonlinear terms). The noise coefficients  X and  Y have been adjusted to compensate for the truncation error from omitting the components involving Z. The system in (8.7) is still not a CGNS due to the (X ) appearance of the quadratic nonlinear self-interactions between Y, i.e., BY Y (Y, Y) and (Y ) BY Y (Y, Y). Nevertheless, these two terms mostly involve the highest frequencies among different nonlinearities in (8.7). They often resemble random noise and contribute the least to the dynamics compared with the others. Thus, a further simplification of (8.7) is to approximate these components. The simplest way to handle these nonlinear self-interactions is to ignore them completely. Such a bare truncation is the building block of many conceptual models, such as the loworder model of the Charney-DeVore flow [78] and the paradigm model for topographic

124

8 Conditional Gaussian Nonlinear Systems

mean flow interaction [211] mentioned in Sect. 8.1.2. Another effective approach to deal with these terms is to apply the stochastic mode reduction strategy that approximates the nonlinear self-interaction terms between unresolved modes by damping and stochastic terms [185, 186]. The simple stochastic model with key features of atmospheric low-frequency variability [179] and the nonlinear multiscale triad model mimicking structural features of low-frequency variability of GCMs with non-Gaussian features [180] discussed above are formulated in such a way. Finally, from a much broader perspective, these nonlinear self-interactions can be systematically approximated by the following closures: (X )

(X )

(X )

(Y )

(Y )

(Y )

BY Y (Y, Y) = τ 1 (X)Y + τ 2 (Y), BY Y (Y, Y) = τ 1 (X)Y + τ 2 (Y), (X )

(X )

(Y )

(Y )

where τ 1 , τ 2 , τ 1 and τ 2 are arbitrary nonlinear function of X. The closure approximations are particularly useful if the dimensions n 1 and n 2 are not very large such that the frequency of Y is not that fast. Utilizing any of the above approximations, the resulting system naturally belongs to the CGNS framework. Finally, physics constraints can be incorporated into the development of the CGNS by imposing certain conditions for the model coefficients.

8.2

Closed Analytic Formulae for Solving the Conditional Statistics

One of the main advantages of the CGNS is that the conditional distribution (8.2) can be solved using closed analytic formulae. This provides a handy tool for studying these systems using rigorous mathematical analysis and developing efficient algorithms for data assimilation, uncertainty quantification, and prediction.

8.2.1

Nonlinear Optimal Filtering

It has been seen in Sect. 8.1.1 that the conditional distribution p(Y(t)|X(s), s ≤ t) in (8.2) is Gaussian. In fact, the conditional mean μf and the conditional covariance Rf can be solved via the following closed analytic formulae [164], dμf = (a0 + a1 μf ) + (Rf A∗1 )(B1 B∗1 )−1 ( dX − (A0 + A1 μf ) dt) ,  dRf = a1 Rf + Rf a1∗ + b2 b∗2 − (Rf A∗1 )(B1 B∗1 )−1 (A1 Rf ) dt,

(8.8a) (8.8b)

with ·∗ being the complex conjugate transpose. The Eq. (8.8b) is a random Riccati equation, as the coefficients are functions of X. If X is regarded as the observation process while Y is the variable for the state estimation, then the conditional distribution p(Y(t)|X(s), s ≤ t) is the posterior distribution of such a filtering problem as was described in Chap. 5. Therefore,

8.2

Closed Analytic Formulae for Solving the Conditional Statistics

125

the conditional mean μf and the conditional covariance Rf are also called the posterior mean and the posterior covariance, respectively. The subscript ‘f’ is an abbreviation for ‘filter’. The classical Kalman-Bucy filter [142] in (5.30)–(5.32) is the simplest special case of (8.8).

8.2.2

Nonlinear Optimal Smoothing

Filtering exploits the observational information up to the current time instant for online state estimation. On the other hand, given the observational time series within an entire time interval, the state estimation can become more accurate. This is known as the smoothing [227], as was briefly mentioned in Sect. 5.5. Given one realization of the observed variable X(t) for t ∈ [0, T ], the optimal smoother estimate p(Y(t)|X(s), s ∈ [0, T ]) of the CGNS (8.1) is also Gaussian [49], p(Y(t)|X(s), s ∈ [0, T ]) ∼ N(μs (t), Rs (t)),

(8.9)

where the conditional mean μs (t) and conditional covariance Rs (t) of the smoother at time t satisfy the following backward equations  ←−−  (8.10a) dμs = −a0 − a1 μs + (b2 b∗2 )Rf−1 (μf − μs ) dt,   ←−− (8.10b) dRs = −(a1 + (b2 b∗2 )Rf−1 )Rs − Rs (a1∗ + (b2 b∗2 )Rf ) + b2 b∗2 dt, with μf and Rf being given by (8.8). Here, the subscript ‘s’ in the conditional mean μs and conditional covariance Rs is an abbreviation for ‘smoother’, which should not be confused ← − with the time variable s in X(s). The notation d· corresponds to the negative of the usual difference, which means that the system (8.10) is solved backward over [0, T ] with the starting value of the nonlinear smoother (μs (T ), Rs (T )) being the same as the filter estimate (μf (T ), Rf (T )). The use of the backward equation here is to take into account future information. The forward run of (8.8) and the backward run of (8.10) collect the past and future observational information, respectively, for the state estimation at time t. As a remark, a similar procedure as (8.10) is utilized in the ensemble Kalman smoother for general nonlinear systems. Yet, unlike the filter, the smoother requires adopting the estimated states and the observations at all the time instants for updating the state at a single point, which is computationally very expensive. In practice, the so-called fixed-lag smoother is often incorporated into the ensemble-based smoother algorithms to reduce the computational cost [31, 195, 227]. The fixed lag smoother exploits the local dependence of the states and uses only the observational information within a local time window [t − τ, t + τ ] to estimate the state at time t. The closed analytic formulae in (8.10) do not suffer from such numerical issues and guarantee the efficiency and accuracy of the nonlinear smoother for the CGNS. The nonlinear smoother is widely used for state estimation and data postprocessing. It also plays a vital role in quantifying the uncertainty in the unobserved variables in the parameter

126

8 Conditional Gaussian Nonlinear Systems

estimation given only partial observations, which will be a topic to be studied in Sect. 9.2. In addition, the nonlinear smoother is the basis for the development of a nonlinear sampling formula, which will be shown below.

8.2.3

Nonlinear Optimal Conditional Sampling

Associated with the nonlinear smoother, a nonlinear conditional sampling formula can be derived. In addition to satisfying the point-wise optimal estimate (8.10), i.e., a distribution formed by conditional mean and conditional covariance at each time instant, the conditional sampled trajectories further take into account the path-wise temporal correlation. These sampled trajectories in the CGNS framework can be regarded as the analogs of the ensemble members in the ensemble Kalman smoother [90], but the former can be obtained via a closed analytic formula. Conditioned on one realization of the observed variable X(s) for s ∈ [0, T ], the optimal strategy of sampling the trajectories associated with the unobserved variable Y satisfies the following explicit formula [54],  ←− ←−−  dY = dμs − a1 + (b2 b∗2 )Rf−1 (Y − μs ) dt + b2 dWY (t),

(8.11)

where WY (t) is a Wiener process that is independent from W2 (t) in (8.1). Similar to (8.11), a closed analytic formula can also be derived for sampling the trajectories based on the filter solution. Yet, the trajectories resulting from the smoother-based sampling procedure (8.11) capture more dynamical and statistical features of the original system as additional future information is taken into account.

8.2.4

Example: Comparison of Filtering, Smoothing and Sampling

Let us end up this section by showing the filtering estimate (8.8), the smoothing estimate (8.10), and the (smoother-based) sampled trajectories (8.11) from the CGNS system, which facilitates the understanding of these methods. Many findings here are qualitatively similar to those in general nonlinear systems. Consider the following nonlinear dyad model: du = [(−du + cv)u + Fu ] dt + σu dWu dv = [−dv v − cu 2 + Fv ] dt + σv dWv

(8.12)

with parameters Fu = 1, Fv = 1, dv = 0.8, du = 0.8, c = 1.2, σv = 2 and σu = 0.5. The nonlinear dyad model in (8.12) is a physics-constrained low-order model. The variable v serves as the stochastic damping of u while u also provides feedback to v. The intermittent

8.2

Closed Analytic Formulae for Solving the Conditional Statistics

127

Fig. 8.1 Comparison of the filtering estimate (8.8), the smoothing estimate (8.10), and three sampled trajectories (8.11) of v in the dyad model (8.12), where a time series of u is observed. Note that, despite a time series of only 15-unit being shown here for illustration, the PDFs are computed based on a long time series with 500 units. The first row shows the observation u. The second, the third, and the fourth rows show the filtering estimate, the smoothing estimate, and the sampled trajectories of v conditioned on u. In addition to offering the filtering and smoothing mean estimates, the associated 95% confidence interval is also illustrated

change of sign in v can trigger extreme events in the time series of u. See also Sect. 7.6.3 for more discussions. Assuming one trajectory of u is observed, shown in the first row of Fig. 8.1. The second to the fourth rows of Fig. 8.1 compare the filtering estimate (8.8), the smoothing estimate (8.10), and three sampled trajectories (8.11) of v. At least two main conclusions can be drawn from this figure. First, the filtering mean estimate always has a delay in recovering the signal v at the onset phase when an intermittent event in u occurs, for example, those at t = 276.3, 278.2, and 287.7. This raises a fundamental challenge in estimating the unobserved states for systems with observed intermittencies. In contrast, the smoother mean estimate traces the truth of v more accurately. The fundamental difference between the filter and the smoother mean estimates is related to the causal relationship between the two state variables u and v. When v grows and exceeds du /c, it starts to trigger the intermittent events of u. Yet, it takes time for the buildup of u. Note that the peak of u corresponds to the demise phase of v when it decreases and hits du /c again. Therefore, before a clear tendency of an increase in the signal u is seen, the filter does not ‘realize’ that v exceeds the threshold value du /c. Thus, the recovered state of v meanders around its equilibrium mean state and differs from the truth. The recovered state becomes accurate when the amplitude of the observed signal of u is sufficiently large, and the filter is also confident of the state estimation. In contrast, since the

128

8 Conditional Gaussian Nonlinear Systems

smoother can ‘see’ the future state, it leads to more accurate state estimation. Note that, as was mentioned in Sect. 5.5, the smoothing process is also known as dynamical interpolation. In fact, filter and smoother are analogs to the one-sided and two-sided moving averages, respectively, which explains the more accurate estimates using smoother. Second, although the filter or smoother mean is the optimal point-wise estimate and the posterior mean time series is widely utilized as an approximation to the true signal, the posterior mean time series can display entirely different features from the truth. As is shown in Fig. 8.1, the smoother estimate resembles the truth with a small uncertainty when an intermittent event of u occurs. On the other hand, when the observed time series of u stays in a quiescent phase, the recovered v does not trace the truth due to the loss of practical observability. Instead, the posterior mean estimate is close to the equilibrium mean of v, and the uncertainty is significant. Thus, due to the quiescent events, the recovered signal of v looks very distinct from the truth, and the posterior mean time series suffers from an underestimation of the uncertainty in its distribution In contrast, the sampled trajectories account for the uncertainty. Thus, the PDFs of the sampled trajectories are the same as the truth. The sampled trajectories can also reproduce many dynamical features of the truth, such as the ACFs.

8.3

Lagrangian Data Assimilation

Lagrangian data assimilation is a special but important type of data assimilation [15, 16, 136] with wide applications in geophysics, climate science, engineering and hydrology [29, 45, 128, 222]. Unlike Eulerian observations at fixed locations, Lagrangian data assimilation exploits the trajectories of moving tracers (e.g., drifters or floaters) as observations to recover the underlying flow field that is often hard to be observed directly. These Lagrangian tracers have particular significance for autonomous data collection in the ocean.

8.3.1

The Mathematical Setup

As in the general data assimilation framework, Lagrangian data assimilation contains two components: a set of observations consisting of Lagrangian tracers and a forecast model of the underlying flow field. For the convenience of discussion, a two-dimensional flow field in the spatial domain (0, 2π ]2 with double periodic boundary condition is used to present the idea. The framework can be easily extended to more complicated systems. Denote by v(x, t) the two-dimensional flow field, where x = (x, y)T is the coordinator for the velocity field v = (u, v)T and t is time. Because of the periodic boundary condition, the full velocity field can be expressed in the spectral representation using Fourier bases,

v(x, t) = vˆk (t) · eik·x · rk , (8.13) k∈K

8.3

Lagrangian Data Assimilation

129

where K is a finite set that contains the Fourier wavenumbers (k1 , k2 ) ∈ Z2 , eik·x is the twodimensional Fourier basis function and vˆk (t) is the corresponding Fourier time series which is a scalar. The two-dimensional vector rk in (8.13) is the eigenvector associated with the wavenumber k. It includes the relationship between the two components u and v for a fixed wavenumber. For example, if the flow field is incompressible, then the relationship between the two velocity components is connected via rk . See also the discussions in Sect. 7.3. Now let each vˆk (t) follow a complex OU process as was described in (7.2), dvˆk (t) = (−dk + iωk )vˆk (t) dt + f k (t) dt + σk dWkv (t),

(8.14)

where the damping dk > 0, the phase ωk and the stochastic forcing coefficient σk are all constants for each fixed k while f k (t) can either be a constant or a time-dependent deterministic function, and Wkv is a Wiener process. Note that in practice, the true underlying dynamics for each vˆk may come from a different and more complicated system. In contrast, the OU process (8.14) is only used as a forecast model in Lagrangian data assimilation that significantly facilitates the computation. As discussed in the previous chapters, although the OU process is only skillful for systems with nearly Gaussian statistics in terms of longrange forecasts, it can be a suitable approximation for certain phenomena with moderate non-Gaussian features in the context of data assimilation since observations help mitigate the model error. This is particularly true for the Lagrangian data assimilation, in which the role of the observation process in recovering the underlying flow field becomes more significant as the number of observed tracers increases. On the other hand, the observations are given by the trajectories of L Lagrangian tracers. The governing equation satisfies Newton’s law, which says the time derivative of the displacement equals the velocity. Therefore, the observation process for the lth Lagrangian tracer xl , with l = 1, . . . , L, reads,

dxl (t) = v(xl (t), t) dt + σx dWlx (t) = vˆk (t) · eik·xl (t) · rk dt + σx dWlx (t), (8.15) k∈K

where (8.13) is utilized to reach the result in the second equality. The additional noise σx dWlx (t) accounts for the uncertainty in the observation process. In particular, it is added to compensate for the effect from the unresolved modes, for example, those with wavenumbers outside the set K, that are not explicitly modeled here. One important finding in (8.15) is that the observation process is highly nonlinear as the state variable xl appears in the exponential function. Such a feature introduces an inherent challenge in Lagrangian data assimilation as the coupled observation-model system is always nonlinear regardless of the forecast model utilized. This is very different from the data assimilation with Eulerian observations, where the observation processes in many cases are linear. Now, collect all the Fourier coefficients of the underlying flow field and put them into a vector U = (. . . , vˆk , . . .)T with size |K| × 1, which is the state variables to recover. Here, |K| denotes the number of elements in K. Likewise, a 2L × 1 dimensional vector X is used

130

8 Conditional Gaussian Nonlinear Systems

to include the displacements of all the Lagrangian tracers with X = (x1 , y1 , . . . , x L , y L )T . Writing (8.15) and (8.14) in terms of X and U yields the following framework of nonlinear Lagrangian data assimilation: Observations: Underlying flow:

dX = P X (X)U dt +  X dW X ,

(8.16a)

dU = −U dt + F(t) dt +  u dWu ,

(8.16b)

where each entry in the matrix P X (X) contains an exponential function of the displacement multiplying by the associated eigenvector, and the matrices ,  X , and  u are all diagonal when the OU process is utilized to model the time evolution of each of the Fourier coefficients. If the observational frequency is much faster than the time scale of the underlying flow field, then assuming the observations are available in a continuous form is reasonable. In such a case, the problem of solving the conditional distribution of the flow field given the observations p(U(t)|X(s ≤ t)) from (8.16) becomes a continuous data assimilation problem. Notably, the state variable to be recovered U appears only in a conditional linear way in the coupled system (8.16). Therefore, the Lagrangian data assimilation from (8.16) becomes a CGNS and the closed analytic formula (8.8) can be exploited to efficiently and accurately solve the posterior distribution p(U(t)|X(s ≤ t)). As a final remark, if the underlying flow field is highly nonlinear or the observations are discrete in time, then the EnKF can be applied to the Lagrangian data assimilation [16, 128, 136]. In addition, a hybrid EnKF-particle filter has been developed to account for the strong nonlinearity in the tracer dynamics [234]. For more sophisticated or operational systems, localizations have also been incorporated into the EnKF for Lagrangian data assimilation [51, 243].

8.3.2

Filtering Compressible and Incompressible Flows

Let us start with filtering the incompressible flow. The incompressibility is given by the divergence-free of the velocity field, namely ∇ · u = 0 or in two-dimensional case ∂u/∂ x + ∂v/∂ y = 0. Since the velocity field associated with each Fourier wavenumber k is uk (x, t) = vˆk (t) · eik·x · rk , the incompressible condition leads to ik1 eik·x r2,k + ik2 eik·x r1,k = 0, where rk = (r1,k , r2,k )T , which means rk = (−k2 , k1 )T / k12 + k22 for k = 0. The two flow components with k = 0 are the time-dependent background mean going uniformly towards x and y directions, respectively. They automatically satisfy the divergence-free condition. Here, for simplicity, the OU process in (8.14) is utilized to generate the true signal of each Fourier model of the underlying flow and is also used as the forecast model for Lagrangian data assimilation. In more practical situations, the quasi-geostrophic (QG) model can be utilized to generate the true incompressible field while a set of the OU processes in (8.14) is

8.3

Lagrangian Data Assimilation

131

still adopted as the forecast model for Lagrangian data assimilation in various applications. See also Sect. 7.2.3. Since incompressibility is related to the geophysical balance, the Fourier modes of the incompressible flow are often named the geophysically balanced (GB) modes. On the other hand, many flow fields are compressible. One simple and classical test model is the linear rotating shallow water equation (7.7), which has one GB mode and a pair of gravity modes associated with each Fourier wavenumber. See Sect. 7.3 for details. Exploiting the dispersion relationship from the linear solution (7.8) with ωk,B = 0 and ωk,± = ±ε−1 |k|2 + 1, independent OU processes (7.10) can be utilized to model the time evolution of the GB and the gravity modes vˆk,B and vˆk,± for each fixed wavenumber k, respectively. The parameters utilized here are: dk,B = dk,± = 0.5, σk,B = 0.15 for k = 0, σ0,B = 0.1 (the two background modes), σk,± = 0.1 and f k,B = f k,± = 0. Note that although the forecast system is written in the form of the characteristic variables (the GB and gravity modes), the observation processes involve the physical variable, the two-dimensional velocity field. This means despite that the three characteristic variables having their own governing equations, they are mixed in the observation processes. To understand the Lagrangian data assimilation skill, three numerical experiments are carried out, where the results are shown in Fig. 8.2. The first experiment considers the incompressible flow, which only involves the GB modes. In contrast, the other two experiments utilize the linear rotating shallow water equation (7.7) that includes both the GB and gravity modes. The latter two experiments differ in the Rossby number, where the second experiment has a relatively fast oscillation with ε = 0.2. In contrast, the third experiment has a moderate oscillation with ε = 1. Here, the underlying flow field is given by summating a finite number of Fourier modes with |k1 | ≤ 2 and |k2 | ≤ 2. Thus, in the first experiment, there are, in total, 26 Fourier modes (24 modes with nonzero wavenumbers and two modes representing the background flows). On the other hand, the other two experiments have, in total, 74 Fourier modes (gravity waves do not have zeroth Fourier modes). Panel (a) of Fig. 8.2 shows the data assimilation skill in the three scenarios. Only the results of mode (1, 1) are illustrated, but the filtering skill of the other modes are similar. Clearly, with an increased number of observed tracers L, the posterior mean (red curve) converges to the truth, and the associated uncertainty (red shading area) shrinks. Note that despite the fast oscillation of the gravity waves in the second experiment, the filtered solution can still recover the truth with a sufficiently large number of tracers. Panels (c)–(d) of Fig. 8.2 show a snapshot of the underlying flow field and the tracer locations at t = 8 for the second and the third experiments. It indicates that when  = 0.2 is small, the effect of the fast gravity wave averages out in time, and the incompressible GB modes dominate the underlying flow field. As a result, the tracers are nearly uniformly distributed. In contrast, when  = 1 is a moderate number, the compressible and the incompressible flow complete with each other. Due to the significant contribution of the gravity modes, the compressibility becomes more dominant, and the tracers may display clustering features rather than being distributed uniformly. See [63, 64] for more theoretical and numerical studies.

132

8 Conditional Gaussian Nonlinear Systems

8.3.3

Uncertainty Quantification Using Information Theory

The numerical experiments above illustrate that, with the increase in the number of tracers, the recovered ocean field becomes more and more accurate. From a practical point of view, it is crucial to explore the exact mathematical relationship between the uncertainty reduction in the posterior distribution and the number of the tracers L. Such a result predicts the number of tracers being deployed into the ocean, given a pre-determined threshold of the tolerance of the uncertainty. To establish such a mathematical relationship, the information theory introduced in Chap. 2 is utilized to analyze the posterior estimate from data assimilation. The results presented below are based on the incompressible flow, which allows us to carry out rigorous mathematical analysis [63]. Numerical simulations can be adopted to explore the scenario when the flow field is compressible [57, 64]. Let us start with the situation with no observation. Then the state estimation of the ocean completely relies on the model. Denote by p(Ut ) ∼ N(mtatt , Rtatt ) the long-term statistical solution from the model (8.16b), i.e., the solution at the statistical attractor. The distribution p(Ut ) can be a static distribution, which is known as the equilibrium distribution if F(t) is a constant. It may also be a time-periodic function if F(t) is time-periodic, reflecting the seasonal variations. Since (8.16b) is a linear Gaussian model, p(Ut ) at each fixed time instant t is Gaussian. Next, denote by p(Ut |Xs≤t ) ∼ N(mt , Rt ) the posterior distribution from Lagrangian data assimilation, which is also Gaussian and depends on the number of tracers L. To quantify the uncertainty reduction with the help of observations, the relative entropy (2.14) is adopted, which measures the information gain (i.e., the uncertainty reduction) in the posterior distribution p(Ut |Xs≤t ) related to the prior one p(Ut ):

p(Ut |Xs≤t ) p(Ut |Xs≤t ) ln P( p(Ut |Xs≤t ), p(Ut )) = . p(Ut ) Since both the prior and the posterior distributions are Gaussian, the information gain can be written utilizing the following explicit formula, according to (2.15), P( p(Ut |Xs≤t ), p(Ut )) 1 = (mt − mtatt )∗ (Rtatt )−1 (mt − mtatt ) 2 1 + tr(Rt (Rtatt )−1 ) − |K| − ln det(Rt (Rtatt )−1 ) 2

· · · Signal

(8.17)

· · · Dispersion

It has been shown in [63] with rigorous mathematical justifications that the information gain in the signal part converges to a constant when L → ∞, Signal →

1 tr uth −1 − mtatt )∗ Ratt (Uttr uth − mtatt ), (U 2 t

(8.18)

8.3

Lagrangian Data Assimilation

133

while the information gain in the dispersion part will never converge, but it increases as a function of ln L, Dispersion → 1. (8.19) |K|+2 ln L 4 The rigorous proof of the results in (8.18)–(8.19) requires advanced probability theory, which is beyond the scope of this book. Nevertheless, intuitions can be provided to explain these results. First, as is seen in Panel (a) of Fig. 8.2, the posterior mean converges to the true signal as L → ∞. Thus, replacing mt in the signal part of (8.17) by the true signal Uttr uth leads to (8.18). Since both mtatt and mtatt are constants for each fixed t, the signal part converges to a constant. Next, as L becomes large, it is expected that the amplitude of the covariance matrix shrinks as a natural outcome of the reduction of the uncertainty. It has been shown in [63] that the posterior covariance√matrix Rt converges to a diagonal matrix, where the diagonal entry is proportional to 1/ L. This means the first two terms in the dispersion part (8.17) become constants as L → ∞ while the third term results in the ln L factor that appears in (8.19). Panel (b) of Fig. 8.2 shows the information gain as a function of L, which is computed by averaging (8.17) over 50 time units. The numerical result is consistent with the theoretic conclusion in (8.18)–(8.19).

Fig. 8.2 Lagrangian data assimilation. Panel a comparison of the truth and the filter posterior estimates. The results show the three experiments described in Sect. 8.3.2. Gravity modes also appear in experiments III, which have similar behavior to the GB modes, and the results are not shown. Panel b the information gain in the signal and dispersion parts (8.18)–(8.19) as a function of the number of tracers L. Panels c–d the ocean fields and the tracer distributions at time t = 8 for the shallow water equation with  = 0.2 and  = 1, respectively. The ocean field of the former is nearly incompressible, and therefore the tracers are nearly uniformly distributed. The latter displays significant clustering behavior of the tracers due to the strong compressibility of the ocean field

134

8.4

8 Conditional Gaussian Nonlinear Systems

Solving High-Dimensional Fokker-Planck Equations

Recall in Sect. 1.3.3 that the Fokker-Planck equation describes the time evolution of the PDF associated with a given SDE. For many complex dynamical systems, including geophysical and engineering turbulence, neuroscience, and excitable media, the solution of the Fokker-Planck equation involves strong non-Gaussian features with intermittency and extreme events [75, 162]. In addition, the dimension of these complex systems is typically very large, representing a variety of features in different temporal and spatial scales [257]. Therefore, solving the high-dimensional Fokker-Planck equation for both the steady state and the transient phases with non-Gaussian features is an important topic. However, traditional numerical methods such as finite element and finite difference can deal with systems with only two or three dimensions. Particle methods based on the Monte Carlo simulation can approximate the PDFs for systems with higher dimensions. But they may lose the accuracy in capturing the non-Gaussian features when the number of state variables becomes much larger. In addition, parametric representations of the PDFs are preferred in many applications, such as the FDT (see Sect. 6.4), but the direct particle methods may not be able to achieve such a goal. The CGNS (8.1) allows for the development of a statistically accurate algorithm for solving the Fokker-Planck equation with semi-analytic expressions that facilitate many applications. By further incorporating block decomposition and statistical symmetry, the algorithm can handle systems with relatively large dimensions.

8.4.1

The Basic Algorithm

Assume that the dimension of X is low while the dimension of Y can be arbitrarily high. Different strategies are developed to deal with the subspaces of X and Y [60]. Step 1. Generate L independent trajectories of the variables X, namely X1 (s ≤ t), . . . , X L (s ≤ t), where L is a small number. This can be achieved by running a Monte Carlo simulation for the full system, which is computationally affordable with a small L. Alternatively, a low-dimensional closure model of X can be developed and then simulate these L trajectories from this reduced-order model. Step 2. The PDF of Y(t) is estimated via a parametric method that exploits a conditional Gaussian mixture to characterize the associated non-Gaussian features, p(Y(t)) = lim

L→∞

L 1

p(Y(t)|Xl (s ≤ t)), L

(8.20)

l=1

where the closed formulae of the conditional Gaussian statistics in (8.8) are utilized to build each of the mixture component p(Y(t)|Xl (s ≤ t)). See [60] for the derivation of (8.20).

8.4

Solving High-Dimensional Fokker-Planck Equations

135

Note that the limit L → ∞ in (8.20) (and (8.21)–(8.22)) is taken to illustrate the statistical intuition, while the estimator is the non-asymptotic version. Step 3. Due to the low dimensionality of X, a Gaussian kernel density estimation (see Sect. 3.4) is utilized for solving the PDF of X(t) based on the L samples, L     1

K H X(t) − Xl (t) , p X(t) = lim L→∞ L

(8.21)

l=1

where K H (·) is a Gaussian kernel centered at each sample point Xl (t) with covariance being given by the bandwidth matrix H(t). The kernel density estimation algorithm here involves a “solve-the-equation plug-in” approach for optimizing the bandwidth [32] that is more appropriate for non-Gaussian PDFs. Step 4. Combining (8.20) and (8.21), a hybrid method is applied to solve the joint PDF of p(X(t), Y(t)) through a Gaussian mixture, p(X(t), Y(t)) = lim

L→∞

L  1  K H (X(t) − Xl (t)) · p(Y(t)|Xl (s ≤ t)) . L

(8.22)

l=1

Note that the closed form of the L conditional distributions in (8.20) can be solved in a parallel way due to their independence [60], which further reduces the computational cost. Rigorous analysis [65] shows that the hybrid algorithm (8.22) requires a much less number of samples as compared with the direct Monte Carlo method, especially when the dimension of Y is large. Intuitively, each Gaussian mixture component can be regarded as one ‘ensemble member’. Unlike the direct Monte Carlo simulation, where each ensemble member is a point, the volume covered by each Gaussian mixture component is much more significant. Therefore, a relatively smaller number of samples is needed in (8.22).

8.4.2

A Simple Example

Consider the noisy L63 model (5.25) with parameters σ = 10, ρ = 28, β = 8/3 and σx = σ y = σz = 20. Only the observation of x is utilized. The initial distribution of the system is Gaussian. It has a zero mean and a 3 × 3 covariance, where the three diagonal entries are 0.25, and others are zero. Figure 8.3 displays the PDFs at both a transient phase t = 0.35 and a nearly statistical equilibrium phase t = 5. Here, the results from utilizing the algorithm in (8.22) are compared with those from the standard Monte Carlo simulation. The latter serves as the truth. Only 500 particles are adopted using the algorithm in (8.22) while 150000 particles are utilized in the standard Monte Carlo simulation. It is shown in Fig. 8.3 that the new algorithm in (8.22) succeeds in recovering the PDFs at both the transient phase and the near equilibrium state. In particular, the strong non-Gaussian

136

8 Conditional Gaussian Nonlinear Systems (a) Transient phase t = 0.35 p(y)

0.02 0.01 0

0 -50 0

Recovered p(x,y)

0

Recovered p(z,x)

z

x

0

0

50

50

100

-40

0 x

40

-50

Truth p(x,y)

50 100 0

x

40

0 -50

Recovered p(x,y)

Recovered p(y,z)

0 0

50

0

50

100

-50

0

50 100 z

50

x

y

z

0

50

0

-20 0 20 40

x

40 80 -50

0 y

50

40 z

80

Truth p(z,x)

0

50 50 100

0 y

-20 0 20

Truth p(y,z)

Truth p(x,y) -50

0

40 80 -50

-20 0 20 40

Truth p(z,x) -40 -20 0 20 40

Recovered p(z,x)

0

0 50

0

x

z

0

0 -40 -20 0 20 40

-50

50

0

50 -40

0 y

-40 -20 0 20 40

Truth p(y,z)

-50

50 100

0.02

Truth (Monte Carlo)

Recovered p(y,z)

-50 y

50

0.02

x

50

New hybrid algorithm

p(z)

0.04

x

0

0.02

p(y)

0.04

z

0 -50

0.02

0.04

p(x)

z

0.01

y

0.04

y

0.02

(b) Equilibrium phase t = 5 p(z)

y

p(x)

-20 0 20 0

40 z

80

Fig. 8.3 Recovering the non-Gaussian PDFs of the noisy L63 model (5.25) using the new algorithm in (8.22) with 500 particles. The results are compared with the standard Monte Carlo simulation with 150000 particles, serving as the truth. The parameters of the L63 model are σ = 10, ρ = 28, β = 8/3 and σx = σ y = σz = 20

features at the transient phase are all captured accurately by the new algorithm with a much smaller number of samples.

8.4.3

Block Decomposition

Block decomposition is a useful technique that facilitates the calculation of high-dimensional PDFs. It breaks the covariance matrix into several smaller blocks, each of which can be easily handled with a much lower computational cost. Consider the following decomposition of state variables uk = (Xk , Yk ) with Xk ∈ Cn 1,k and Yk ∈ Cn 2,k , K K where 1 ≤ k ≤ K , n 1 = k=1 n 1,k and n 2 = k=1 n 2,k . Correspondingly, the full dynamics in (8.1) is also decomposed into K groups. For notation simplicity, assume both B1 and b2 are diagonal and thus the noise coefficient matrices associated with the equations of Xk and Yk are B1,k and b2,k , respectively. The following two conditions are imposed on the coupled system to develop efficient statistically accurate algorithms that beat the curse of dimension. Condition 1: In the dynamics of each uk in (8.1), the terms A0,k and a0,k can depend on all the components of X while the terms A1,k and a1,k are only functions of Xk , namely,

8.4

Solving High-Dimensional Fokker-Planck Equations

A0,k := A0,k (t, X),

a0,k := a0,k (t, X),

A1,k := A1,k (t, Xk ),

a1,k := a1,k (t, Xk ).

137

(8.23)

In addition, only Yk interacts with A1,k and a1,k on the right hand side of the dynamics of uk . Therefore, the equation of each uk = (Xk , Yk ) becomes dXk = [A0 (t, X) + A1 (t, Xk )Yk ] dt + B1 (t, Xk ) dW1,k (t),

(8.24a)

dYk = [a0 (t, X) + a1 (t, Xk )Yk ] dt + b2 (t, Xk ) dW2,k (t).

(8.24b)

Note that in (8.24) each uk is fully coupled with other uk for all k = k through A0 (t, X) and a0 (t, X). There is no trivial decoupling between different state variables. Condition 2: The initial values of (Xk , Yk ) and (Xk , Yk ) with k = k are independent with each other. These two conditions are the common features of many complex systems with multiscale structures, multilevel dynamics, or state-dependent parameterizations [36, 270]. Under these two conditions, the conditional covariance matrix becomes block diagonal, which can be easily verified according to (8.8a). The evolution of the conditional covariance of Yk conditioned on X is given by  ∗ + (b2,k b∗2,k ) dRf,k (t) = a1,k Rf,k + Rf,k a1,k  − (Rf,k A∗1,k )(B1,k B∗1,k )−1 (Rf,k A∗1,k )∗ dt, which has no interaction with that of Rf,k for all k = k since A0 and a0 do not enter into the evolution of the conditional covariance. Notably, the evolutions of different Rf,k with k = 1, . . . , K can be solved in a parallel way, and the computation is highly efficient due to the small size of each block. This facilitates the algorithms to solve the Fokker-Planck equation in large dimensions. Next, the structures of A0,k and a0,k in (8.23) allow the coupling among all the K groups of variables in the conditional mean according to (8.8b). The evolution of μf,k , namely the conditional mean of Yk conditioned on X, is given by dμf,k (t) = [a0,k + a1,k μf,k ]dt + Rf,k A∗1,k (B1,k B∗1,k )−1 [ dXk − (A0,k + A1,k μf,k ) dt].

8.4.4

Statistical Symmetry

As was discussed in the previous two subsections, the hybrid strategy and the block decomposition provide an extremely efficient way to solve the high-dimensional Fokker-Planck equation associated with the conditional Gaussian systems. In fact, the computational cost in the algorithms developed above can be further reduced if the coupled system (8.1) has statistical symmetry [58],

138

8 Conditional Gaussian Nonlinear Systems

    p Xk (t), Yk (t) = p Xk (t), Yk (t) , for all k and k .

(8.25)

namely, the statistical features for variables with different k are the same. The statistical symmetry is often satisfied when the underlying dynamical system represents a discrete approximation of some PDEs in a periodic domain with nonlinear advection, diffusion, and homogeneous external forcing [176, 183]. With the statistical symmetry, collecting the conditional Gaussian ensembles N(μf,k (t), Rf,k (t)) for a specific k in K different simulations is equivalent to collecting that for all k with 1 ≤ k ≤ K in a single simulation. This also applies to N(Xl (t), H(t)) associated with X. Therefore, the statistical symmetry implies that the effective sample size is L = K L, where K is the number of the group variables that are statistically symmetric and L is the number of different simulations of the coupled systems. If K is large, a much smaller L is needed to reach the same accuracy as in the situation without statistical symmetry, significantly reducing the computational cost.

8.4.5

Application to FDT

One direct application of the Gaussian mixture expression of solving the high-dimensional Fokker-Planck equation in (8.22) (together with the block decomposition and the statistical symmetry) is to compute the FDT. Recall in Sect. 6.4 that one of the difficulties in calculating the FDT is to approximate the equilibrium PDF. Now, assume the system has ergodicity. Then, given one long trajectory, each Xl in (8.22) is the observation of X at a different time instant tl . The full joint PDF at the statistical equilibrium state is: peq (X, Y) = lim

l→∞

where

L 1 l p (X, Y), L

(8.26)

l=1

  1 1 l T −1 l p =√ exp − (X − X ) (H) (X − X ) × 2 (2π )n 1 |H|   1 1 l T l −1 l exp − (Y − μf ) (Rf ) (Y − μf ) . 2 (2π )n 2 |Rf l | l

(8.27)

With (8.27) in hand, B(u) = B(X, Y) in (6.14) can be computed utilizing the following explicit formula: B(u) = −

divu (w(u) peq ) peq

L N N l l



∂ l=1 exp(F ) · G 2,i =− wi (X, Y) − wi  L , l l ∂ui l=1 exp(F ) · G 1 i=1

i=1

(8.28)

8.5

Application: Modeling and Predicting Monsoon Intraseasonal Oscillation (MISO)

139

where

  1 1 F l = − (X − Xl )T (H)−1 (X − Xl ) − (Y − μlf )T (Rfl )−1 (Y − μlf ) , 2 2

1 , G l1 = n +n 1 2 |H||Rl | (2π ) f

8.5

G l2,i =

∂ Fl · G l1 . ∂ui

(8.29)

Application: Modeling and Predicting Monsoon Intraseasonal Oscillation (MISO)

Monsoon Intraseasonal Oscillation (MISO) [113, 149, 158, 232, 265] is one of the prominent modes of tropical intraseasonal variability. As a slow-moving planetary scale envelope of convection propagating northeastward, it strongly interacts with the boreal summer monsoon rainfall over south Asia. The MISO plays a crucial role in determining the onset and demise of the Indian summer monsoon and affecting the seasonal amount of rainfall over the Indian subcontinent [96, 112, 113]. Therefore, real-time monitoring and accurate extended-range forecast of MISO phases are important, and they have a large socioeconomic impact on the Indian subcontinent [1, 221].

8.5.1

MISO Indices from a Nonlinear Data Decomposition Technique

Several indices have been proposed for real-time monitoring and extended-range forecast of the MISO. Applying a certain data decomposition method to the high-dimensional spatiotemporal data, the resulting leading a few modes are often suitable representations of the large-scale features of the MISO. Here, each mode is given by the product of a spatial basis function and a time series. The latter is the so-called index. Once the index is accurately predicted, it can be multiplied to the known spatial basis for reaching the predicted spatiotemporal reconstruction of the MISO. For the readers who are not familiar with general data decomposition techniques, consider the Fourier decomposition as one of the simplest examples, where for each wavenumber k the spatial basis function is the exp(ikx) while the time evolution of the Fourier coefficient is the index. In practice, the empirical orthogonal function (EOF), also known as the principle component analysis (PCA) analysis, is one of the dominant methods for effective data decomposition. Different from the traditional indices, which are primarily based on the EOF analysis or its extended version (EEOF), a new MISO index exploiting a nonlinear data decomposition method, called the Nonlinear Laplacian Spectral Analysis (NLSA) [104, 105], has been developed [220]. These modes are computed utilizing the eigenfunctions of a discrete analog of the Laplace-Beltrami operator, which can be thought of as a local analog of the temporal covariance matrix employed in the EOF techniques adapted to the nonlinear geometry of data generated by complex dynamical systems. A key advantage of NLSA over classical

140

8 Conditional Gaussian Nonlinear Systems

Fig. 8.4 Modeling and predicting the MISO indices. Panel a the true MISO indices (by applying NLSA to observational data) and a simulation from the physics-constrained nonlinear model (8.30)– (8.31). Panels b–d: comparison of the model statistics with the observations. Panel e 25-day lead time ensemble mean forecast. Panels f–h ensemble forecast starting from three different dates, corresponding to the onset (April 1), mature (June 1), and demise (October 1) phases of the MISO. In all the panels, the blue curve is the truth, while the red curve is the model simulation or ensemble mean forecast. The red shading area represents the 95% confidence interval, which is computed from 50 ensemble members that are sufficient to provide robust results. Some of the panels in this figure are c reproduced from [62] published by American Meteorological Society. Used with permission

covariance-based methods is that NLSA, by design, requires no ad hoc pre-processing of data such as detrending or spatiotemporal filtering of the complete data set, and it captures both intermittency and low-frequency variability. Thus, the NLSA-based MISO index objectively identifies MISO patterns from noisy precipitation data. The NLSA MISO modes have higher memory and predictability, stronger amplitude and higher fractional explained variance over the western Pacific, Western Ghats, and adjoining Arabian Sea regions, and a more realistic representation of the regional heat sources over the Indian and Pacific Oceans compared with those from the EOF analysis. The MISO indices for the period 1998–2013 are extracted from the daily Global Precipitation Climatology Project (GPCP) rainfall data [133] over the Asian summer monsoon region (20◦ S-30◦ N, 30◦ E-140◦ E), using the NLSA algorithm. The two indices are shown in Panels (a) of Fig. 8.4, which are intermittent, and the PDFs are non-Gaussian with strong fat tails. The large bursts in the indices correspond to the active MISO events occurring in the boreal summer season. See [62, 220] for more details of these indices and the spatiotemporal reconstruction patterns.

8.5

Application: Modeling and Predicting Monsoon Intraseasonal Oscillation (MISO)

8.5.2

141

Data-Driven Physics-Constrained Nonlinear Model

To describe the temporal intermittency and the randomness in the oscillation frequency of the MISO indices, the following stochastic model is proposed [62], du 1 = (−du (t)u 1 + γ vu 1 − ωu 2 ) dt + σu dWu 1 , du 2 = (−du (t)u 2 + γ vu 2 + ωu 1 ) dt + σu dWu 2 , dv = (−dv v − γ (u 21 + u 22 )) dt + σv dWv ,

(8.30)

dω = (−dω ω + ω) ˆ dt + σω dWω , with du (t) = du0 + du1 sin(ω f t + φ).

(8.31)

In addition to the two observed MISO variables u 1 and u 2 , the other two variables v and ω are unobserved, representing the stochastic damping and the stochastic phase, respectively. In (8.30), the constants dv and dω are damping coefficients, γ is a coupling parameter, σu , σv and σγ are noise coefficients and Wu 1 , Wu 2 , Wv and Wω are independent Wiener processes. The parameters du0 and du1 in (8.31) are large-scale and time-periodic damping coefficients, which are utilized to crudely model the active summer and quiescent winter of the MISO seasonal cycle. The hidden variables v, ωu and the observed MISO variables u 1 , u 2 are coupled through energy-conserving nonlinear interactions that satisfy the physics constraints. The hidden variables v, ω, and their dynamics can be regarded as phenomenological surrogates involving the synoptic scale activity and the equatorial convective dynamic equations for temperature, velocity, and moisture that affect the precipitation. Note that similar models as (8.30) have been applied to describe other equatorial climate phenomena, such as the Madden-Julian oscillation [61] and the boreal summer intraseasonal oscillation [55]. With the calibrated model using the data up to 2007 (see [62] for details), the model can capture both qualitative and qualitative features of the observed MISO indices. The third and fourth rows in Panel (a) of Fig. 8.4 show a model simulation that mimics the observational data. Panels (b)–(d) compare the model statistics with observations. The nonGaussian PDFs, the ACFs, and the power spectrums of the model highly resemble the truth. Note that the model fails to capture these critical statistics without the stochastic damping and the stochastic phase.

8.5.3

Data Assimilation and Prediction

The ensemble forecast algorithm described in Sect. 6.1 is adopted to predict the MISO indices. The initial data of the two MISO variables (u 1 , u 2 ) are obtained directly from observations, i.e., the MISO indices. However, the two hidden variables (v, ω) have no direct observational surrogate. Therefore, data assimilation is essential to obtaining the initial condition of these two hidden variables. The nonlinear model in (8.30)–(8.31) is

142

8 Conditional Gaussian Nonlinear Systems

a CGNS, where the distribution p(v(t), ω(t)|u 1 (s), u 2 (s), s ≤ t) is conditional Gaussian. Therefore, the closed analytic formulae (8.8) can naturally be utilized to recover the initial values of the two unobserved state variables. Panels (e)–(h) include the ensemble forecast results for the year 2009, whereas the forecast for the other five years can be found in [62]. Panel (e) shows the ensemble mean forecast at a lead time of 25 days, which is the MISO forecast range of particular practical interest. The pattern correlation between the truth and the forecast is about 0.8, indicating the prediction’s significant skill. Panels (f)–(h) display the ensemble forecast starting from three different dates, corresponding to the onset (April 1), mature (June 1), and demise (October 1) phases of the MISO. Although the ensemble mean predictions for the April 1 starting date do not have any long-range skill, the envelope of the ensemble predictions contains the true signal and forecasts both the summer active and winter quiescent phases. The forecasts from June 1 have skills from both the ensemble mean and ensemble spread for moderate to long lead times. The predictions starting from October 1 have both an accurate mean and small ensemble spread for very long times. As a final remark, it is straightforward to perform twin prediction experiments with the perfect nonlinear stochastic model in (8.30)–(8.31) where 10-year training segments of the data generated from the model are utilized to make 6-year forecasts. Significantly, this internal prediction skill of the stochastic model is comparable to its skill in predicting the MISO indices from observations (not shown here). This supports that the nonlinear stochastic model in (8.30)–(8.31) can accurately determine the predictability limits of the two MISO indices.

9

Parameter Estimation with Uncertainty Quantification

9.1

Markov Chain Monte Carlo

Markov chain Monte Carlo (MCMC) is a general class of methods aiming to sample from a target probability distribution. The idea of the MCMC is to construct a Markov chain with the desired distribution as the target one. After a number of steps, the state of the chain is then utilized as a sample of the target distribution. In many applications, MCMC is adopted to obtain a sequence of random samples from the target distribution, which is usually high dimensional, and direct sampling is difficult. The sequence sampled from the MCMC is used to approximate the distribution or compute a statistical quantity, such as the expected value.

9.1.1

The Metropolis Algorithm

Metropolis algorithm, or its improved version—Metropolis-Hastings algorithm, is one of the most widely used MCMC algorithms [68, 125]. Let f (x) be a function that is proportional to the target distribution p(x). Let g be a proposal density function. In the Metropolis algorithm, g is required to be symmetric, namely g(x|y) = g(y|x), while in the more general Metropolis-Hastings algorithm, such a symmetric assumption is not needed. A common choice of g in the Metropolis algorithm is a Gaussian distribution. The procedure for applying the Metropolis algorithm is as follows. Starting from an arbitrary point x0 as the first sample and setting k = 0, the proposal density g(xk+1 |xk ) is utilized to suggest a candidate for the next sample value xk+1 , given the previous one xk , so that candidates closer to xk are more likely to be visited next. Repeat this process for k = 1, 2, 3, . . ., which makes the sequence of samples into a random walk.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_9

143

144

9 Parameter Estimation with Uncertainty Quantification

The details within each move from k to k + 1 contain the following steps. First, generate a candidate x  for the next sample by picking from the proposal distribution g(x  |xk ). Then calculate the so-called acceptance ratio α = f (x  )/ f (xk ). Since f is proportional to the density of p, it means α = f (x  )/ f (xk ) = p(x  )/ p(xk ). Therefore, the acceptance ratio α is a suitable measurement to decide whether to accept or reject the candidate. To this end, a uniform random number r ∈ [0, 1] is generated. If r ≤ α, then accept the candidate by setting xk+1 = x  . On the other hand, if u > α, then reject the candidate and set xk+1 = xk . Note that, even in the situation that f (x  ) < f (xk ), there is still a probability α that the sample point is accepted. This differs from the gradient descent method, which is designed to find a single (local) optimal point within the searching space. Instead, the MCMC aims to reconstruct the entire PDF. Therefore, the accept-reject procedure allows to use the random samples to reach the right probability of each value due to the fact that f (x  )/ f (xk ) = p(x  )/ p(xk ).

9.1.2

A Simple Example

Let us adopt the Metropolis algorithm to sample from the density function p(x), p(x) ∝ f (x) = exp(−3.2x − 11.2x 2 + 21.33x 3 − 8x 4 ),

(9.1)

where the normalization constant for making f (x) to be a PDF is assumed to be unknown here. The profiles of p(x) and f (x) are shown in Panel (a) of Fig. 9.1. Let g(x  |xk ) = (2π )−1/2 exp((x  − xk )2 /2) be a Gaussian distribution centered at xk with variance being 1. Start with x0 = 0. The first 5 sample steps are listed in Table 9.1 and the sample values are shown in Panel (a) of Fig. 9.1. For k = 1, f (x  ) > f (x0 ), meaning that the probability at x  is larger than that at x0 , and therefore x  is naturally accepted. Then for k = 2, the ratio is α = 0.7119, indicating a relatively high chance of accepting this candidate. Indeed, the random number r = 0.4353 < α, which suggests to take this value as x2 . Similar to x1 , the new candidate at k = 3 has a higher probability than x2 and it is used as x3 . Next, when k = 4, the proposal gives x  = −1.1722. According to Panel (a) of Fig. 9.1, such a candidate is improbable to be sampled since it lies at the tail of the target distribution. The acceptance ratio α is less than 0.0001 while the uniformed distributed random number at this step is r = 0.6193. Thus, such a candidate is rejected, and x4 is set to be the same as x3 . A similar argument applies to x5 , which is again set to be x3 . Following such a procedure, after K = 100 steps, the reconstructed distribution can capture the bimodal profile of the target distribution. As the number of samples K increases, the reconstructed PDF converges to the truth. See Panel (b) of Fig. 9.1. Overall, about 45% of the candidates are accepted. The overall acceptance rate depends on the proposal function. If the variance of the proposal function is very small, then the new candidate is expected to be close to the previously sampled point. Consequently, the acceptance ratio stays around 1, and the overall acceptance rate approaches 100%. On the

9.1

Markov Chain Monte Carlo

145

Fig. 9.1 Using the Metropolis algorithm for sampling the target distribution in (9.1). Panel a the truth PDF p(x), the known function f (x) ∝ p(x) and the first 5 samples corresponding to Table 9.1. Panel b the reconstructed PDF with different sample numbers K Table 9.1 The first 5 sample steps using the Metropolis algorithm for (9.1) k

x

α

r

xk

Accept or reject?

1

−0.1242

1.0952

0.0259

−0.1242

Accept

2

0.1530

0.7119

0.4353

0.1530

Accept

3

−0.0432

1.0595

0.3303

−0.0432

Accept

4

−1.1722

0.0000

0.6193

−0.0432

Reject

5

−0.6503

0.0069

0.2668

−0.0432

Reject

other hand, if the proposal function has a huge variance, then the new candidates can easily be located at the tail of the target distribution and get rejected. In practice, adaptive MCMC is often utilized [13, 216, 260], which adjusts the proposal function as the algorithm runs to approach the prescribed acceptance rate. It has been shown in [101] that the acceptance rate optimizing the efficiency of the process approaches 0.234 for high-dimensional target distributions using the Metropolis algorithm with Gaussian proposals. Generally, the acceptance rate is considered a reasonable choice if it is between 0.2 and 0.5. As a final remark, the normalization constant in the above one-dimensional example is actually easy to compute. Therefore, the example is only utilized to illustrate the procedure of the Metropolis algorithm. Yet, computing the normalization constant in high-dimensional cases is extremely difficult because it requires solving a high-dimensional integral. In such a situation, MCMC becomes quite useful. In many other applications, it is often tough to know the analytic expression of the function f (x) ∝ p(x). This is the typical case in parameter estimation, which will be discussed shortly. In certain cases, the function f (x) may not even be known. Only the entire function f (x  )/ f (xt ) can be computed.

146

9.1.3

9 Parameter Estimation with Uncertainty Quantification

Parameter Estimation via MCMC

Let us focus on the following one-dimensional linear SDE to present the idea of parameter estimation using the Metropolis algorithm. du = (−au + f ) dt + σ dW .

(9.2)

The goal is to estimate the constant parameters a, f and σ . The available information is a time series of u, denoted by u obs (t) for t ∈ [0, T ]. For simplicity, assume the observation is continuous with no additional observational noise. Thus, the observation is one random realization from (9.2). For the more realistic cases involving observational noise and discrete observations, see for example [26, 85, 110, 111, 239]. To apply the MCMC, a suitable optimization criterion needs to be determined, which serves as the target function, i.e., the ‘ f (x)’ presented in Sect. 9.1.1, where ‘x’ stands for the values of the parameters. Due to the finite length of the observational time series, randomness and uncertainties affect the parameter estimation. Therefore, the goal is to find the ‘optimal range’ of the values of the parameters presented by the distribution ‘ p(x)’. It is hoped that the variance of such a distribution is small, which implies the distribution almost being a delta function that provides a point-wise value for each parameter. Here, the likelihood function is utilized as the target function. In many applications, prior knowledge of the parameters is available. Therefore, the maximum a posteriori probability (MAP) estimate, which combines the prior information with the likelihood function, is adopted as the target function. The maximum likelihood estimation can be regarded as a special case of MAP estimation with a non-informative prior. For the convenience of presentation, let us rewrite (9.2) into a time discrete form by applying the Euler-Maruyama scheme, √ (9.3) u i+1 = u i + (−au i + f )t + σ ti , where t is a small time step, u i is the function value of u evaluated at it and i ∼ N (0, 1) is a standard i.i.d. Gaussian random number. According to the linear Gaussian nature of (9.3), the model forecast distribution at (i + 1)t given the exact known value u i at it from the observation is Gaussian, p(u i+1 ) ∼ N (μi+1 , Ri+1 ). According to the discussions in Sect. 4.1.1, the mean and variance are given by μi+1 = u i + (−au i + f )t,

and

Ri+1 =

σ2 (1 − e−2at ), 2a

which depend on the parameters θ = (a, f , σ )T . The likelihood at (i + 1)t is   obs − μ 2 (u i+1 1 i+1 ) obs exp − L i+1 = p(u i+1 |θ ) = √ . 2Ri+1 2π Ri+1

(9.4)

(9.5)

9.1

Markov Chain Monte Carlo

147

Then taking the product of the likelihood at different time instants yields L = L 1 L 2 . . . L N , where N = T /t with · being the floor function. Note that each likelihood L i is a obs is located at the function Gaussian function, which can easily reach a very small value if u i+1 tail. Therefore, the product L may suffer from numerical issues. In practice, log-likelihood is often utilized as an alternative, ensuring more numerical stability and optimizing the optimization. The log-likelihood reads:

L = log(L) = log(L 1 ) + log(L 2 ) + . . . + log(L N ).

(9.6)

With the target function in hand, the Metropolis algorithm can be utilized to find the distribution of the parameters θ = (a, f , σ )T . Note that, although the searching space here is only three-dimensional, the target function, which is the log-likelihood L, is much more complicated than the example in Sect. 9.1.2. It is thus very challenging to compute the normalization constant even for such a simple problem. What remains is to follow the general MCMC procedure using the Metropolis algorithm (see Sect. 9.1.1). At each iteration k, first generate a new set of parameters θ  using the Metropolis algorithm with the help of a symmetric proposal function, followed by computing the corresponding log-likelihood L . Then calculate the acceptance ratio α = L − Lk (the division becomes the subtraction after taking the logarithm). The acceptance ratio is combined with a uniformly distributed random number to decide if the proposal is accepted or rejected. For linear Gaussian systems, the closed form of the likelihood function can easily be written down. However, for general nonlinear systems, suitable approximations are needed to solve the likelihood that guarantees the numerical efficiency of the parameter estimation procedure. As a numerical illustration, consider the cubic system (see also Sect. 4.3), du = ( f + au + bu 2 − cu 3 ) dt + σ dW .

(9.7)

The following two regimes are studied here, where the true parameters are given by: Gaussian regime: Highly skewed regime:

a = −2, b = 0, c = 0, f = 0.5, σ = 0.5. a = −2, b = 2.5, c = 1, f = 0, σ = 1.

(9.8)

Again, assume a continuous time series is available. Since (9.7) is a nonlinear model, the following approximate linear system is utilized in computing the one-step Gaussian likelihood function: √   u i+1 = u i + f + au ˜ i t + σ ti , (9.9) where a˜ is treated as a constant at each step and is evaluated as a˜ = a + 2bu i − 3cu i2 . The parameter estimation procedure uses an observed time series with 500 units for each dynamical regime. See the blue curve in Panel (a) and Panel (d) of Fig. 9.2. An adaptive

148

9 Parameter Estimation with Uncertainty Quantification

Fig. 9.2 Parameter estimation using the Metropolis algorithm for the cubic model (9.7) in the two dynamical regimes with the true parameter values listed in (9.8). Panel a the true time series used as observations (blue) and a model simulation with the estimated parameters (red). The estimated parameters are the averaged values on the trace plot from k = 2000 to k = 10000. Panel b comparison of the model statistics using the estimated parameters with those from the truth. Panel c trace plots from the MCMC of the estimated parameters, where the black dashed line indicates the truth in each sub-panel. Panels a–c and Panels d–f show the results in the Gaussian and the highly skewed non-Gaussian regime, respectively

MCMC [260] is utilized with a target acceptance rate being 0.25. The estimated parameter values are shown in Panel (c) and Panel (f), respectively. In the Gaussian regime, the estimated parameters follow the truth very well. The uncertainty in estimating b and c is slightly larger than that in estimating other parameters, but overall the fluctuations in the trace plots of estimating b and c are around the true values. On the other hand, the two estimated parameters b and c in the highly skewed regime have obvious differences compared with the truth. This, by a glance, may be attributed to the approximation used in (9.9). Yet, the more rudimental reason is the dynamical behavior in such a regime being insensitive to the variation of these parameters within a certain range. As is shown in Panel (e), the model with the estimated parameters (the averaged values on the trace plot from k = 2000 to k = 10000) can perfectly recover the ACF and the non-Gaussian PDF of the truth. In addition, a model simulation with the estimated parameters is qualitatively quite similar to the truth (Panel (d)). In practice, the reference solutions (i.e., the true parameter values) are never known. Comparing the key statistics of the calibrated model with the truth, such as the ACFs and the PDFs, is a practically useful way to validate the parameter estimation skill.

9.2

Expectation-Maximization

9.1.4

149

MCMC with Data Augmentation

In many applications, only the observations of a subset of the state variables are available, known as the partial observations. Since there is, in general, no simple explicit formula for the likelihood function, data augmentation is often needed to sample the trajectories that are not directly observed to facilitate the parameter estimation via MCMC [217, 245]. Consider a coupled nonlinear system du = G(u, v) dt + σu dWu , dv = H (u, v) dt + σv dWv . Assume the observations only contain a time series of u, denoted by u obs . Data augmentation leads to solving the following problem, p(θ , v mis |u obs ) ∝ p(θ , v mis , u obs ) = p(θ ) p(v mis , u obs |θ ) = p(θ ) p(v mis |θ ) p(u obs |v mis , θ),

(9.10)

where p(θ ) is the prior distribution and v mis represents a full trajectory of v. The MCMC first computes the auxiliary conditional distribution p(θ , v mis |u obs ) and then marginalizes over v mis to obtain p(θ |u obs ). Data augmentation is generally very expensive because the dimension of v mis is infinite (or very large after time discretization). Note that, unlike the noise coefficient σu in the observed process, which is effectively determined by the quadratic variation, the noise coefficient σv in the unobserved process is hard to update in a direct fashion. A change of measure is often needed to determine σv [25, 73, 217]. In addition, the observations may be given at discrete points. Thus, data augmentation needs to be applied for both v and the u between nearby observational points. See [12, 110, 111, 202] for more discussions and specific MCMC algorithms with data augmentation.

9.2

Expectation-Maximization

The expectation-maximization (EM) algorithm is a useful approach for parameter estimation with partial observations [82, 194, 231], where the unknown information in the system is not only the parameters but the state of the unobserved variables as well. The EM algorithm is an iterative method that alternates between updating the parameters (M-Step) and recovering the unobserved state variables with uncertainty quantification (E-Step), using the updated value of the other one in the previous step. Unlike random searching algorithms such as the MCMC, the EM method exploits a gradient descent-based approach. Thus, the EM algorithm aims to find a local optimal solution, which is sufficient for many applications.

150

9 Parameter Estimation with Uncertainty Quantification

9.2.1

The Mathematical Framework

The presentation of the EM algorithm here is based on the CGNS (8.1), which is one of the most sophisticated nonlinear systems that allow the use of closed analytic formulae for parameter estimation via the EM algorithm. Recall the CGNS (8.1):   dX = A0 (X, t) + A1 (X, t)Y(t) dt + B1 (X, t) dW1 (t), (9.11a)   dY = a0 (X, t) + a1 (X, t)Y(t) dt + b2 (X, t) dW2 (t). (9.11b) Assume a continuous time series is available as the observations for X, while there are no direct observations for the state variable Y. Let θ be a collection of model parameters. Denote Y = {Y0 , . . . , Y j , . . . , Y J } a discrete approximation of by  X = {X0 , . . . , X j , . . . , X J } and  the continuous time series of X and Y, respectively, within the time interval t ∈ [0, T ], where T = J t, X j = X(t j ), Y j = Y(t j ) and t j = jt. The goal is to seek an optimal estimation of the unknown parameters θ by maximizing the log-likelihood function. Since only the time series of X is observed, the log-likelihood estimate is given by the solution, which should average over all the possible values of the state variable Y to account for the uncertainty in Y, namely, p( X,  Y|θ ) d Y, (9.12) L(θ ) = log L( X|θ ) = log  Y

where the integral with respect to  Y takes into account the uncertainty in such an unobserved variable. Using any distribution Q( Y) over the unobserved variables, a lower bound on the likelihood L can be obtained in the following way log



p( X,  Y|θ )  dY Q( Y)    Q(Y) Y Y p( X,  Y|θ )  ≥ Q( Y) log dY   Q(Y) Y     Q(Y) log p(X, Y|θ ) dY − Q( Y) log Q( Y) d Y := F (Q, θ), = p( X,  Y|θ ) d Y = log

 Y

(9.13)

 Y

where Jensen’s inequality has been applied in the second row. In (9.13), the negative value



       of  Y Q(Y) log p(X, Y|θ ) dY is the so-called free energy while −  Y Q(Y) log Q(Y) dY is the entropy. Therefore, based on the fact F (Q, θ) ≤ L(θ ), it is clear that maximizing the log-likelihood is equivalent to maximizing F alternatively with respect to the distribution Q and the parameters θ. This can be achieved by the EM algorithm, E-Step:

Q k+1 ← arg max F (Q, θ k ),

M-Step:

θ k+1 ← arg max F (Q k+1 , θ).

Q

θ

(9.14)

9.2

Expectation-Maximization

151

The EM algorithm alternates between performing an expectation (E) step, which estimates the unobserved state variable  Y with an uncertainty quantification using the current estimate of the parameters, and a maximization (M) step, which updates the parameters by maximizing the expected log-likelihood found in the current E step [81, 152]. Denote by θ k the updated parameters after the kth iteration. The EM algorithm at step k + 1 is the following: E-Step. The optimum in the E-Step is reached when Q is the conditional distribution p( Y| X, θ k ) using the previously estimated parameters θ k . This is exactly the solution from the nonlinear smoother, which is solved via the closed analytic formulae (8.10). M-Step. Plug the solution Q( Y) = p( Y| X, θ k ) from the E-Step into the cost function in (9.13) to compute the function expectation Q(θ; θ k ) = p( Y| X, θ k ) log p( Y,  X|θ ) d Y. (9.15)  Y

Note that, in (9.15), p( Y| X, θ k ) from the nonlinear smoother is treated as a known distribution. This distribution can be regarded as the weight function for computing the average (i.e., the integration) of log p( Y,  X|θ ). Therefore, the Q(θ ; θ k ) is a function of θ . Update the parameters θ utilizing the result from the E-Step, θ k+1 = arg max Q(θ ; θ k ). θ

(9.16)

In many situations, the M-Step involves solving a quadratic optimization problem, the analytic formula of which is thus available.

9.2.2

Details of the Quadratic Optimization in the Maximization Step

Consider the discrete version of the CGNS (9.11) using the Euler-Maruyama scheme: √ j X j+1 = X j + (A0 (X j , t; θ) + A1 (X j , t; θ)Y j )t + B1 (X j , t; θ) tε 1 , (9.17a) √ j j+1 j j j j j = Y + (a0 (X , t; θ) + a1 (X , t; θ)Y )t + b2 (X , t; θ) tε 2 , (9.17b) Y where X j stands for the state variable X evaluated at time t j (similar to other variables). In j j (9.17), ε1 and ε2 are multidimensional i.i.d. standard Gaussian random numbers, and t is sufficiently small. Assume all the parameters appear as multiplicative prefactors of some functions of X j and Y j on the right-hand side of (9.17). Thus, the log-likelihood function of p( X,  Y|θ ) in M-step can be solved explicitly. The one-step likelihood based on a local linear Gaussian approximation on the right-hand side of (9.17) is

1 j |− 2 exp − 1 (u j+1 − μ j )T (R j )−1 (u j+1 − μ j ) , (9.18) N (μ j , R j ) = C|R 2

152

9 Parameter Estimation with Uncertainty Quantification

where u j+1 = (X j+1 , Y j+1 )T and μ j = M j ξ + S j . Here, the vector ξ contains the parameters in the drift part of (9.17) and the matrix M j includes the linear or nonlinear functions of u j . The vector S j contains those terms that do not involve parameters such as the first terms X j or Y j in (9.17). The covariance R j is a block diagonal matrix with entries (B1 (X j , t))(B1 (X j , t))∗ t and (b2 (X j , t))(b2 (X j , t))∗ t, which have one-to-one is due to the correspondences with the parameters in the diffusion terms. The constant C j normalization of a Gaussian distribution. Since the state Y is unobserved, and it contains uncertainty, an expectation of the log-likelihood function as in (9.15) is adopted, and the overall objective function becomes  J  = 1 (u j+1 − M j ξ − S j )∗ (R)−1 (u j+1 − M j ξ − S j ) − log |R|, L 2 2 J

(9.19)

j

where the diffusion coefficients are assumed to be constants for simplicity and therefore R j = R for all j. In (9.19), · denotes the expectation over the uncertain component of u j , namely Y j , at fixed j while the expectation of the observed component X j is simply itself. Since the hidden variables Y j appears in a linear way, only quadratic terms of Y j , Y j+1 , (Y j+1 )∗ , Y j+1 , (Y j )∗ , Y j , (Y j )∗ , need to be solved in the expectation /∂ξ = 0 corresponds to the solution ∂ L in (9.19). See [49] for details. The minimum of L /∂R = 0, which leads to and ∂ L R=

 1   j+1 (u − M j ξ − S j )(u j+1 − M j ξ − S j )∗ , J

(9.20a)

j

ξ = D−1 c,

(9.20b)

where D=

    (M j )∗ R−1 M j and c = (M j )∗ R−1 (u j+1 − S j ) . j

(9.21)

j

The solution of (9.20) is obtained as follows. First, setting ξ = 0 and therefore R in (9.20a) is essentially given by the quadratic variation. The resulting R is then plugged into (9.20b) to reach the solution of ξ . It is worthwhile to point out that the direct application of the EM to a high-dimensional system is often computationally expensive. Nevertheless, many complex dynamical systems have localized structures. Therefore, the block decomposition in Sect. 8.4.3 and other strategies can be utilized to divide the entire problem into several subproblems to solve. Each subproblem contains only a few parameters to be estimated, which the EM algorithm can easily handle.

9.2

Expectation-Maximization

9.2.3

153

Incorporating the Constraints into the EM Algorithm

Taking into account prior knowledge from physics, observations or experiments facilitates the parameter estimation of many complex dynamical systems. This can be achieved by imposing extra conditions or constraints in the parameter estimation process. In particular, one of the most important constraints in modeling nonlinear turbulent dynamical systems is the physics constraint [123, 184], which requires the energy in the quadratic nonlinear terms to be conserved, as was discussed in Sect. 7.6. These constraints require the combinations of certain parameters to satisfy some equality relationships, which can be written in the following way, Hξ = g, (9.22) where H is a constant matrix and g is a constant vector. Such a constrained optimization problem can be solved using the Lagrangian multiplier method   = 1 (u j+1 − M j ξ − S j )∗ (R)−1 (u j+1 − M j ξ − S j ) L 2 j

(9.23)

J − log |R−1 | + λ∗ (Hξ − g), 2 where the vector λ is the Lagrange multiplier. The minimization problem with the new objective function (9.23) can still be solved via closed analytic formula, which is given by R=

 1   j+1 (u − M j ξ − S j )(u j+1 − M j ξ − S j )∗ J

(9.24a)

j

 −1 λ = HD−1 H∗ (HD−1 c − g),   ξ = D−1 c − H∗ λ ,

(9.24b) (9.24c)

where D and c are defined in (9.21).

9.2.4

A Numerical Example

The noisy L84 model in (8.3) is utilized as a simple numerical test to show the skill of the EM algorithm, where the true parameters in the deterministic part are chosen to be the same as in Lorenz’s original work [172], g = 1, b = 4, a = 1/4, and f = 2. On the other hand, depending on the level of the stochastic noise, two dynamical regimes are considered in the numerical experiments here: Small noise regime:

σx = σ y = σz = 0.1,

Large noise regime:

σx = σ y = σz = 3.

(9.25)

154

9 Parameter Estimation with Uncertainty Quantification

Fig. 9.3 Parameter estimation of the noisy L84 model (8.3) using the EM algorithm. Only y and z are observed, and the observational time series has a length of 10 units. Panel a compares the trajectories generated using the perfect parameters (blue) and the estimated ones (red). Note that the random numbers are different in these two simulations; thus, there is no one-to-one correspondence in the trajectories. Panels b–c compare the ACFs and PDFs. Panel d shows the trace plot of the estimated parameters. Panel e includes a comparison of the true x and the nonlinear smoother estimate at the kth iteration with k = 2, 5, and 100. Panels a–e and f–j present the results in the small and large noise regimes, respectively

It is seen in Panels (a) and (f) of Fig. 9.3 that the two regimes have distinct dynamical features. With the increase in the random noise level, the model trajectories become more turbulent and have shorter memories. Here, assume one realization with in total 10 units of y and z is available as the observations while there is no direct observation for x. The EM algorithm estimates the model parameters and simultaneously recovers the state of x. For simplicity, the noise parameter σx is assumed to be known since otherwise, the convergence of this specific parameter may become very slow in the large noise regime due to the lack of practical observability, and a certain change of measure is needed to accelerate the convergence. In both regimes, the model with the estimated parameters accurately recovers the statistics of the two observed variables y and z. In addition, the associated time series are also qualitatively similar to the truth. The main difference in the results of the two regimes lies in the estimation of the unobserved variable x. In the small noise regime, the estimated parameters are pretty accurate and can lead to a nearly perfect recovery of the dynamical and statistical features of x. On the other hand, with an increased noise level, the observability of the system is affected. Thus, the recovery of x is less accurate. Nevertheless, in the large noise regime, the contribution from the unobserved variable x to the two observed variables

9.3

Parameter Estimation via Data Assimilation

155

y and z also becomes weaker. Thus, the error in x does not influence the dynamics of y and z too much. As a final remark, since only the time series of the observed variables y and z are utilized in the likelihood function, there is no guarantee for the resulting model to fully capture the dynamical and statistical features of the unobserved component x. The role of the unobserved variable x playing is more towards a stochastic parameterization that provides suitable feedback to the observed variables. For many complex systems, the unobserved variables lie in small- or subgrid-scales. The main task is often to develop cheap approximate models with simple structures of these unobserved variables that capture the contribution to the large-scale observed variables. Recovering the exact parameter values or even the detailed true dynamics of these unobserved variables are not the primary interests.

9.3

Parameter Estimation via Data Assimilation

As was discussed in Chap. 5, treating the unknown parameters as the augmented state variables, data assimilation provides a natural way for parameter estimation. In this section, the parameter estimation via data assimilation is presented in the framework of the CGNS (8.1), where the closed analytic formulae (8.8) facilitate mathematical analysis of the results. Similar algorithms can be developed for general nonlinear and non-Gaussian systems using suitable ensemble-based data assimilation methods, assuming the sampling errors are handled appropriately. Consider the following stochastic system:   du = A0 (t, u) + A1 (t, u)γ ∗ dt +  u dWu ,

(9.26)

where u = (u 1 , . . . , u m )T contains the state variables and γ ∗ = (γ1∗ , . . . , γn∗ )T are the parameters to be estimated, which are assumed to be constants. Throughout this section, γ ∗ (with an asterisk) always represents the true value of the parameters, where the notation ·∗ should not be confused with the conjugate transpose. In contrast, γ stands for the variables in the parameter estimation framework. For the simplicity of discussions, the noise coefficient  u is assumed to be a known constant matrix.

9.3.1

Two Parameter Estimation Algorithms

Since the parameters γ ∗ are constants, it is natural to augment the system (9.26) by an n-dimensional trivial equations for γ ∗ [209, 235, 258, 267]. Such a method is named direct parameter estimation,

156

9 Parameter Estimation with Uncertainty Quantification

du = (A0 (t, u) + A1 (t, u)γ ) dt +  u dWu ,

(9.27a)

dγ = 0.

(9.27b)

On the other hand, certain prior information about the possible range of the parameters is available in some applications. To incorporate such information into the parameter estimation framework, the system (9.26) can be augmented by a group of stochastic equations of γ , where the equilibrium distributions of these stochastic processes represent the prior information for the range of γ . This approach is called parameter estimation with stochastic parameterized equations [56, 59], du = (A0 (t, u) + A1 (t, u)) dt +  u dWu ,

(9.28a)

dγ = (a0 + a1 γ ) dt +  γ dWγ .

(9.28b)

Given an initial value μi (0) and an initial uncertainty Ri (0) of each component of γ , both the augmented systems (9.27) and (9.28) belong to the conditional Gaussian framework (8.1)–(8.2) and (8.8), where X = u and Y = γ . Therefore, the time evolution of γ can be solved via closed analytic formulae. Below, simple examples will be exploited to study the dependence of the error μi (t) − γi∗ and uncertainty Ri (t) on different factors, such as the system noise, the initial uncertainty, and the model structure. The role of observability in parameter estimation will also be emphasized.

9.3.2

Estimating One Additive Parameter in a Linear Scalar Model

Consider estimating one additive parameter γ ∗ in the following linear scalar model, du = (A0 u + A1 γ ∗ ) dt + σu dWu ,

(9.29)

where ‘additive’ means γ ∗ is not multiplied by the state variable u. Given an initial guess μ(0) of the parameter γ ∗ with a given initial uncertainty R(0), the simple structure of model (9.29) allows an analytic expression of the error μ(t) − γ ∗ in the posterior mean estimation and the posterior uncertainty R(t) as a function of time. Let us start with the direct approach (9.27), du = (A0 u + A1 γ )dt + σu dWu ,

(9.30a)

dγ = 0.

(9.30b)

Proposition 9.1 In estimating the additive parameter γ ∗ in (9.29) utilizing the direct approach (9.30), the posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ have the following closed analytic expressions,

9.3

Parameter Estimation via Data Assimilation

R(t) = μ(t) − γ ∗ =

R(0) 1 + A21 σu−2 R(0)t μ(0) − γ ∗ 1 + A21 σu−2 R(0)t

157

,

(9.31a)

+

A1 σu−1 R(0)



1 + A21 σu−2 R(0)t

t

dWu (s).

(9.31b)

0

The results in (9.31) can be obtained by plugging (9.30) into the evolution equations of the posterior states (8.8). See [56] for details. According to (9.31), both the posterior uncertainty R(t) and the deterministic part of the error in posterior mean converge to zero asymptotically at an algebraic rate of time t −1 . The second term on its right hand side of (9.31b) represents the stochastic fluctuation of the error that comes from the system noise. The variance of this fluctuation at time t is given by var(μ(t) − γ ∗ ) =

(A1 σu−1 R(0))2 t

(1 + A21 σu−2 R(0)t)2

,

(9.32)

the asymptotic convergence rate of which is t −1 as well. Therefore, after a sufficiently long time, the estimated parameter will converge to a constant with no uncertainty. It is clear from (9.31) that decreasing the noise σu or increasing the prefactor A1 helps accelerate the reduction of the error and the uncertainty. In fact, a nearly zero A1 implies the system losses practical observability (see also Sect. 5.2.3), which corresponds to a slow convergence rate. On the other hand, increasing the initial uncertainty R(0) accelerates the convergence of the deterministic part of μ(t), as it forces the posterior state to trust more towards the observations (see the discussion of the Kalman gain in Sect. 5.2.1). Yet, it does not affect the long-term behavior of reducing the uncertainty R(t) and the error in the fluctuation part of μ(t) − γ ∗ . Next, consider the method with the stochastic parameterized equations in (9.28), du = (A0 u + A1 γ ) dt + σu dWu ,

(9.33a)

dγ = (a0 − a1 γ ) dt + σγ dWγ ,

(9.33b)

where the equilibrium distribution of γ in (9.33b) is Gaussian with mean γ¯ = a0 /a1 and variance var(γ ) = σγ2 /(2a1 ). Proposition 9.2 In estimating the additive parameter γ ∗ in (9.29) utilizing the stochastic parameterized equations (9.33), the posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ have the following closed analytic expressions, R(t) = r2 +

1−



R(0)−r1 R(0)−r2



r1 − r2  , · exp −A21 σu−2 (r1 − r2 )t 2 −2 )t

μ(t) − γ ∗ ≈(μ(0) − γ ∗ )e−(a1 +Req A1 σu

+

2 −2 )t

1 − e−(a1 +Req A1 σu

a1 + Req A21 σu−2

(9.34a)

(a0 − a1 γ ∗ )

158

9 Parameter Estimation with Uncertainty Quantification

+ Req A1 σu−1



t

2 −2 )(t−s)

e−(a1 +Req A1 σu

dWu (s),

(9.34b)

0

where R(0) is assumed to be larger than r1 in (9.34a) and r1 > 0 > r2 are the two roots of the algebraic equation −A21 σu−2 R(t)2 − 2a1 R(t) + σγ2 = 0. In (9.34b), the variance R(t) has been replaced by its equilibrium value Req for the conciseness of the expression due to its exponentially fast convergence rate. See again [56] for the detailed derivations. Unlike (9.31b) where the error in the posterior mean estimation converges to zero eventually, the error utilizing the stochastic parameterized equation (9.34b) converges to |μ(t) − γ ∗ |eq =

|a0 − a1 γ ∗ | a1 + Req A21 σu−2

,

which is nonzero unless the mean of the stochastic parameterized equation (9.33b) γ¯ = a0 /a1 equals the true parameter value γ ∗ . Similarly, the posterior uncertainty converges to a nonzero value r1 unless the right hand side of (9.33b) vanishes. Nevertheless, comparing (9.31) and (9.34), the parameter estimation framework utilizing stochastic parameterized equations (9.33) leads to an exponential convergence rate for both the reduction of the posterior uncertainty and the error in the posterior mean, which implies a much shorter observational time series is needed for parameter estimation using (9.33). The convergence rate is controlled by the tuning factors in the stochastic parameterized equations. With a suitable choice of (9.33b), the convergence rate is greatly improved at the cost of only introducing a small error.

9.3.3

Estimating One Multiplicative Parameter in a Linear Scalar Model

Many applications require estimating parameters that appear as the multiplicative factors of the state variables. The focus here is a simple situation where only one multiplicative parameter γ ∗ appears in the dynamics: du = (A0 − γ ∗ u) dt + σu dWu .

(9.35)

In (9.35), the parameter γ ∗ > 0 guarantees the mean stability of the system. Given an initial guess μ(0) of the parameter γ ∗ with an uncertainty R(0), the analytic expressions of the error μ(t) − γ ∗ and the uncertainty R(t) are still available in the framework utilizing the direct approach (9.27), du = (A0 − γ u) dt + σu dWu , (9.36) dγ = 0.

9.3

Parameter Estimation via Data Assimilation

159

Proposition 9.3 In estimating the multiplicative parameter γ ∗ in (9.35) utilizing the direct approach (9.36), the posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ have the following closed analytical expressions, R(0) ,

t 1 + R(0)σu−2 0 u 2 (s) ds μ(0) − γ ∗ μ(t) − γ ∗ =

t 1 + R(0)σu−2 0 u 2 (s) ds R(t) =



(9.37a)

R(0)σu−1

t 1 + R(0)σu−2 0 u 2 (s) ds



t

u(s) dWu (s).

(9.37b)

0

The long-term behavior of (9.37) can be further simplified. Apply the Reynolds decomposition to the random variable u (see Sect. 4.1.1), u(t) = u(t) ¯ + u  (t)

with

u  = 0 and u  u¯ = 0,

(9.38)

where u(t) ¯ represents the ensemble mean of u at a fixed time t. Thus, t t t 2 u 2 (s) ds = (u(s)) ¯ ds + (u  (s))2 ds. 0

0

(9.39)

0

Utilizing ergodicity, the two integrals on the right hand of (9.39) are given by 1 t→∞ t



lim

1 t→∞ t



0 t

lim

t

2 (u(s)) ¯ ds =

(u  (s))2 ds =



−∞ ∞

0

−∞



2

(u) ¯ 2 peq (u) du =

A0 γ∗

(u  )2 peq (u) du =

σu2 , 2γ ∗

, (9.40)

respectively, where peq (u) is the equilibrium Gaussian distribution associated with the system (9.35). Thus, the long-term behavior of (9.37) simplifies to R(t) ≈ μ(t) − γ ∗ ≈ −

R(0) 1 + R(0)σu−2 A20 (γ ∗ )−2 t + R(0)(2γ ∗ )−1 t μ0 − γ ∗ −2 2 ∗ −2 1 + R(0)σu A0 (γ ) t + R(0)(2γ ∗ )−1 t R(0)σu−1 −2 2 ∗ −2 1 + R(0)σu A0 (γ ) t + R(0)(2γ ∗ )−1 t

.

(9.41a)



t

u(s) dWu (s).

(9.41b)

0

Similar to the situation in estimating one additive parameter in (9.31), the convergence of both the error and the uncertainty in (9.41) is at an algebraic rate t −1 . However, the convergence in (9.41) strongly depends on the prefactor A0 . When A0 is zero, the denominator of the terms on the right hand side of (9.41) becomes (1 + R(0)(2γ ∗ )−1 t), which is independent

160

9 Parameter Estimation with Uncertainty Quantification

of the noise amplitude σu . On the other hand, when A0 is highly non-zero, decreasing the noise level σu accelerates the convergence. A nearly zero A0 implies that the mean state of u is nearly zero as well, and the system (9.36) has no practical observability. Simply adding the noise does not help regain observability; thus, the convergence of the algorithm does not depend on the noise level. On the other hand, there is no simple closed expression for the error estimation in the framework utilizing stochastic parameterized equations (9.28), but numerical experiments can show that the convergence is exponential, as the case in estimating an additive parameter. So far, the focus has been on the parameter estimation skill utilizing one observational trajectory. In some applications, such as Lagrangian data assimilation, repeated experiments are available. Therefore it is worthwhile studying the parameter estimation skill given a set of independent observations, where ‘independent’ means the noise in generating each trajectory is independent. Assume the number of the independently observed trajectories is L. Corresponding to (9.36), the parameter estimation framework utilizing the direct approach is given by du = (A0 − γ u) dt +  u dWu , (9.42) dγ = 0, where u is a L × 1 column vector, representing L independent observations. All the entries in the L × 1 column vector A0 are equal to A0 . In addition, Wu is also a L × 1 column vector, with each entry being an independent Wiener process. Finally,  u is a L × L diagonal matrix, where each diagonal entry of  u is σu . Proposition 9.4 In estimating the multiplicative parameter γ ∗ in (9.35) within the parameter estimation framework utilizing the direct approach (9.42) with L independent observed trajectories, the posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ have the following closed analytic expressions, R(0) ,

t 1 + L R(0)σu−2 0 u 2 (s) ds μ(0) − γ ∗ μ(t) − γ ∗ =

t 1 + L R(0)σu−2 0 u 2 (s) ds R(t) =

R(0)σu−1 −

t 1 + L R(0)σu−2 0 u 2 (s) ds

(9.43a)



t

u(s) dWu (s).

(9.43b)

0

Comparing (9.37) and (9.43), the asymptotic convergence with L independent trajectories within the direct approach framework is enhanced by a factor L in front of t. Thus, increasing the number of independent observations accelerates the convergence, but the convergence rate remains algebraic.

9.3

Parameter Estimation via Data Assimilation

9.3.4

161

Estimating Parameters in a Cubic Nonlinear Scalar Model

To understand the difference between estimating parameters in nonlinear and linear dynamics, the last simple example here regards a cubic nonlinear scalar model, du = (A0 − γ ∗ u 3 ) dt + σu dWu ,

(9.44)

where γ ∗ > 0 ensures the mean stability. See [56] for estimating parameters in other nonlinear models. The analytic expressions of the posterior uncertainty and the error in the posterior mean are available for (9.44) utilizing the direct approach (9.27), du = (A0 − γ u 3 ) dt + σu dWu ,

(9.45a)

dγ = 0.

(9.45b)

Proposition 9.5 For any odd number k, the framework utilizing the direct approach (9.27) to estimate the parameter γ ∗ in du = (A0 − γ ∗ u k ) dt + σu dWu is given by du = (A0 − γ u k ) dt + σu dWu ,

(9.46a)

dγ = 0.

(9.46b)

The posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ associated with system (9.46) have the following closed analytic expressions, R(0) ,

t 1 + R0 σu−2 0 u 2k (s) ds μ(0) − γ ∗ μ(t) − γ ∗ =

t 1 + R(0)σu−2 0 u 2k (s) ds R(t) =

(9.47)

R(0)σu−1 −

t 1 + R(0)σu−2 0 u 2k (s) ds Applying Reynolds decomposition (9.38),

t 0

u (s) ds = 2k

0

t



t 0

t

u(s) dWu (s).

0

u 2k (s) ds can be rewritten as

(u(s) ¯ + u (s)) ds = 2k



2k t  0 m=0



2k m u¯ · (u  (s))2k−m ds. m

(9.48)

Regarding the cubic model in (9.45), the index k in (9.46) and (9.48) is set to be k = 3. Further consider the situation with A0 = 0, which implies the system losses practical observability with u¯ = 0 at the equilibrium. Clearly, the only non-zero term on the right

162

9 Parameter Estimation with Uncertainty Quantification

t hand side of (9.48) at a long range is 0 (u  (s))6 ds. Since u¯ = 0, u is utilized to replace u  for notation simplicity. Apply the ergodicity of u, ∞ 1 t 6 u 6 peq (u) du = lim u (s) ds, t→∞ t 0 −∞ where the analytic expression of the equilibrium PDF peq (u) is given by [178],

γ∗ 4 2 − u . peq (u) = N0 exp σu2 4 Then direct calculation shows that

∞ −∞

u peq (u) du = 6

2 γ∗

3 2



−1 1 7 σu3 , 4 4

(9.49)

where is the Gamma function [3]. Therefore, the posterior variance R(t) and the error in the posterior mean μ(t) − γ ∗ for the long-term behavior utilizing the direct approach (9.45) with A0 = 0 have the following closed analytical form, R(0) , 1 + c˜ R(0)σu t t μ(0) − γ ∗ R(0)σu−1 u(s) dWu (s), μ(t) − γ ∗ ≈ − 1 + c˜ R(0)σu t 1 + c˜ R(0)σu t 0 R(t) ≈

(9.50a) (9.50b)

where the constant c˜ = (2/γ ∗ )3/2 (7/4)/ (1/4). Comparing the results in (9.50) of the cubic nonlinear system (9.44) with those in (9.41) of the linear system (9.35), the most significant difference is the role of the noise σu . In the linear model, without practical observability, i.e., A0 = 0, the convergence rate has no dependence on σu . In contrast, in the cubic nonlinear model, increasing the noise σu accelerates the convergence! This seems to be counterintuitive. However, the cubic nonlinearity, serving as the damping in (9.44), indicates that the state variable u is trapped in the region around its attractor u = 0 more severely than that in the linear model. Since the system has no practical observability around u = 0, an enhanced σu is preferred to increase the amplitude of u and thus improve the parameter estimation skill.

9.3.5

A Numerical Example

Finally, a numerical experiment is utilized to provide more intuitions about the two parameter estimation algorithms. Consider estimating the three parameters σ, ρ and γ in the noisy L63 model (5.25) where the true values are ρ ∗ = 28, σ ∗ = 10 and β ∗ = 8/3. The direct approach (9.27b) leads to:

9.3

Parameter Estimation via Data Assimilation

163

Fig.9.4 Comparison of the two parameter estimation approaches for the noisy L63 model. Panels a–c using the direct algorithm (9.51). Panel d the approach with the stochastic parameterized equations (9.52). Only the estimation of ρ is shown

dx = σ (y − x) dt + σx dWx ,   dy = x(ρ − z) − y dt + σ y dW y ,

dσ = 0,

dz = (x y − βz) dt + σz dWz ,

dβ = 0,

dρ = 0,

(9.51)

while the approach with the stochastic parameterized equations (9.28b) gives the following system: dx = σ (y − x) dt + σx dWx ,   dy = x(ρ − z) − y dt + σ y dW y ,

dσ = −dσ (σ − σˆ ) dt + σσ dWσ ,

dz = (x y − βz) dt + σz dWz ,

ˆ dt + σβ dWβ . dβ = −dβ (β − β)

dρ = −dρ (ρ − ρ) ˆ dt + σρ dWρ ,

(9.52)

In (9.52), σˆ = 1.5σ ∗ , ρˆ = 1.5ρ ∗ and βˆ = 1.5β ∗ such that the center of the distribution of each parameter from the prior information does not equal to the truth. In addition, dσ = dρ = dβ = 0.5, σσ = σ ∗ /3, σρ = ρ ∗ /3 and σβ = β ∗ /3 are chosen to allow a moderate uncertainty in the prior information. In both approaches, the initial guess of all the three parameters satisfies a Gaussian distribution N (0, 2). Panels (a)–(c) of Fig. 9.4 show the estimated parameter ρ using the direct algorithm (9.51) (similar for other parameters). It is seen that the convergence becomes slower when the system noise increases. Panel (d) shows the result using the approach with the stochastic parameterized equations (9.52), where the model setups and the observed trajectories are the same as those in Panel (c). With the prior information being incorporated, the convergence towards the statistical equilibrium becomes much faster. The time average of the curve after it arrives at the statistical equilibrium is very close to the truth and rectifies a large amount of the error in the prior mean (50%). These results also indicate a possibility of combining these two methods, where (9.28b) can be utilized for the short term to reach a rapid convergence. Then (9.27b) is adopted to arrive at a more accurate solution.

164

9.4

9 Parameter Estimation with Uncertainty Quantification

Learning Complex Dynamical Systems with Sparse Identification

Learning the underlying dynamical systems from data is a crucial topic in practice. Detecting the model structure and estimating the parameters are often combined in a data-driven learning process. One widely used approach is to build an extensive library of candidate functions and assume a complicated starting model that includes all of these candidate functions. Then the critical step is to apply a certain sparse identification technique to exclude most candidates with little or no contribution to the dynamics. The remaining few candidate functions lead to a parsimonious model [39, 67, 189, 198, 240]. Consider a starting model, du = F(u; θ) dt +  dW = · f(u) dt +  dW.

(9.53)

where u ∈ Rn is a collection of the state variables, F(u; θ) is the nonlinear deterministic part of the dynamics,  ∈ Rn×m is the noise coefficient, and W ∈ Rm is a Wiener process. The deterministic part of the right hand side of the model is further written as · f(u), where ∈ Rn×r is the coefficient matrix while f(u) ∈ Rr is a column vector containing the candidate functions. Note that f is only a function of the state variable u as it comes from the library of the candidate functions while F depends on both the state variables and the parameters. The parameters in F are represented by a vector θ of size nr , which forms the entries in . Usually, r is quite large in the starting model. The goal is to find a subset of θ , denoted by θ˜ ∈ R p with p  nr such that the associated candidate functions have significant contributions to the underlying dynamics. The resulting model then has the following form ˜ du = F(u; θ˜ ) dt +  dW. (9.54) Note that if continuous-in-time data are available, which is the focus here, then  is effectively determined by the quadratic variation. Therefore, the main target of the sparse identification is the deterministic part.

9.4.1

Constrained Optimization for Sparse Identification

Constrained optimization can be utilized for sparse identification. One natural idea is to incorporate an L1 regularization into the regression scheme for parameter estimation, which is known as the LASSO regression (Least Absolute Shrinkage and Selection Operator regression) [224, 249], F(u; θ  ) − du/ dt2 + λθ  1 , (9.55) θ = arg min  θ

9.4

Learning Complex Dynamical Systems with Sparse Identification

165

where the first term on the right-hand side is the original quadratic optimization criterion based on a least square regression while the second term is the L1 regularization. From (9.55), it is seen that the L1 regularization adds a penalty that is equal to the absolute value of the magnitude of the coefficient. This regularization type can result in sparse models with few coefficients while the other coefficients become zero, and the associated terms are eliminated from the model. Larger penalties result in coefficient values closer to zero, which is ideal for producing models with simpler structures. With the L1 norm being added, the optimization problem in (9.55) is no longer a quadratic one. Nevertheless, computational methods have been developed to solve the optimization problem in (9.55) [39, 40, 228, 262]. Note that if the L2 regularization is utilized instead, known as the ridge regression, then the amplitude of the coefficients θ will be reduced, but it does not give the sparsity of the coefficients. The parameter of the regularizer λ in (9.55) needs to be determined in advance before applying the constrained optimization.

9.4.2

Using Information Theory for Model Identification with Sparsity

Another useful way for model identification with sparsity is via information theory. Unlike the LASSO regression, the information-theory-based model identification exploits the causation to pre-determine the terms that significantly contribute to the dynamics and then carries out the parameter estimation. One unique feature of such a method is that the causation computed from the information theory reflects certain underlying physical relationships between different state variables. Let us rewrite the deterministic part of (9.53) as the following component-wise form: ⎡ ⎢ ⎢ ⎢ ⎣

⎤ ⎡ ξ1,1 du 1 / dt ⎢ ξ2,1 du 2 / dt ⎥ ⎥ ⎢ ⎥=⎢ . .. ⎦ ⎣ .. . ξn,1 du n / dt

··· ··· .. . ···

⎤⎡ ⎤ f 1 (u 1 (t), . . . , u n (t), t) ξ1,r ⎢ ⎥ ξ2,r ⎥ ⎥ ⎢ f 2 (u 1 (t), . . . , u n (t), t) ⎥ ⎥ ⎢ ⎥ = × f (u(t), t) , (9.56) .. .. ⎦ . ⎦⎣ . fr (u 1 (t), . . . , u n (t), t) ξn,r

To incorporate physical evidence into this identification process, the following causal inference is utilized [86, 87]. Denote by C fi →u˙ j |[f\ fi ] the causation entropy of f i (t) on du j / dt conditioned on all f except f i , which allows to explore the composition of du j / dt that comes solely from f i (t). If such a causation entropy is zero (or practically nearly zero), then f i (t) does not contribute any information to du j / dt and the associated parameter ξi, j is set to be zero. By computing such a causation entropy for different i = 1, . . . , r and j = 1, . . . , n, a sparse matrix is reached. Then a simple maximum likelihood estimation based on a quadratic optimization (similar to those in Sect. 9.2.2) can be easily applied to determine the actual values of those non-zero entries in . The causation entropy C fi →u˙ j |[f\ fi ] is defined as follows, (9.57) C fi →u˙ j |[f\ fi ] = H (u˙ j | [f\ f i ]) − H (u˙ j |f).

166

9 Parameter Estimation with Uncertainty Quantification

In (9.57), H (·|·) is the conditional entropy, which is related to Shannon’s entropy H (·) and the joint entropy H (·, ·). See Sect. 2.3. The difference on the right-hand side of (9.57) naturally represents the contribution of f i to u˙ j . The causation entropy has unique advantages in identifying model structure in the presence of indirect coupling between features and stochastic noise [212], which are crucial features of complex turbulent systems. In addition, with the pre-determined model structure from the casual relationship, the parameter estimation remains a quadratic optimization problem, which is much easier to solve than constrained optimization that involves an L1 regularization. However, the direct calculation of the causation entropy in (9.57) is nontrivial as reconstructing the exact PDFs from a given time series is very challenging due to the curse of dimensionality. Nevertheless, since determining the model structure only depends on if the causation entropy is zero or not rather than its exact value, a Gaussian approximation can be utilized to compute the entropies. This allows the causation entropy (9.57) to be expressed by closed analytic formulae [177] and can be calculated in high dimensions. Note that the Gaussian approximation does not change the nonlinear nature of the system (see also Sect. 7.7). In fact, in many cases, if a significant causal relationship is detected in the higher order moments, it is very likely in the Gaussian approximation as well. This allows us to efficiently determine the structure of the sparse matrix . Proposition 9.6 (Practical calculation of the causation entropy) By approximating all the joint and marginal distributions as Gaussians, the causation entropy can be computed in the following way: C Z →X |Y = H (X |Y ) − H (X |Y , Z ) = H (X , Y ) − H (Y ) − H (X , Y , Z ) + H (Y , Z ) 1 1 = ln(det(R X Y )) − ln(det(RY )) 2 2 1 1 − ln(det(R X Y Z )) + ln(det(RY Z )), 2 2

(9.58)

where R X Y Z denotes the covariance matrix of the state variables (X , Y , Z )T and similar for other covariances. As an illustration, assume a three-dimensional time series (x, y, z)T generated from the noisy L63 model (5.25) is available, with the true parameters being σ = 10, ρ = 28 and β = 8/3 and σx = σ y = σz = 1. But the model itself is completely unknown. The goal is to learn such underlying dynamics from the observational data. The observational time series is assumed to have in total of 100 units. The numerical integration time step is t = 0.001. Since it is a three-dimensional system, it is natural to assume the model has the following ansatz as a start,

9.4

Learning Complex Dynamical Systems with Sparse Identification

167

x x x 2 x 2 dx = (θxx x + θ yx y + θzx z + θxxy x y + θ yz yz + θzx zx + θxxx x 2 + θ yy y + θzz z ) dt

+ σx dWx , y

y

y

y

y

y

y

y

y

dy = (θx x + θ y y + θz z + θx y x y + θ yz yz + θzx zx + θx x x 2 + θ yy y 2 + θzz z 2 ) dt + σ y dW y , z z z 2 z 2 dz = (θxz x + θ yz y + θzz z + θxz y x y + θ yz yz + θzx zx + θxz x x 2 + θ yy y + θzz z ) dt

+ σz dWz . (9.59) where θba means the parameter appears in the equation of ‘a’ and is the coefficient of the ‘b’ term. The right-hand side of (9.59) includes all the linear and quadratic terms, mimicking the general form of the geophysical flows, the nonlinearity of which is dominated by the quadratic ones. If needed, higher-order nonlinear terms can be easily included in the starting model (9.59). Since the time series of x, y, and z are available, it is straightforward to compute the time series of x y, yz, and other nonlinear terms on the right-hand side of (9.59). For each dimension, there are nine causation entropies. Applying the formula (9.58), the values of the causation entropy are shown in Table 9.2. It is seen that the causation entropies lead to the result that has a perfect match with the true system in identifying the terms that should be maintained on the right-hand side of (9.59). Then, applying the likelihood-based parameter estimation algorithm, the resulting parameter values are summarized in the second row of Table 9.3. The parameters in the deterministic part are very close to the truth. Yet, the three noise coefficients are estimated with certain errors. The inaccuracy in calculating the noise coefficients is due to the use of a finite t. If t is smaller, then the estimation of these noise coefficients from the quadratic variation would be more accurate. Yet, the discretization error always exists in practice, and such an error is commonly seen in chaotic systems. Despite this, the model simulation using the identified model with the estimated parameters is quite similar to the truth and reproduces almost the exact statistics. y One interesting finding from Table 9.2 is that the term θ y y has a very small causation entropy to dy/ dt. Thus, a natural follow-up test is to remove this term from the identified model and study the resulting system. The third row of Table 9.3 lists the associated parameters, which are slightly different from those in the second row to compensate for the y elimination of θ y y. The model simulation still highly resembles the truth. This means the role played by y in the second equation of (9.59) is very weak, and a more parsimonious model without such a term can be used. It is also helpful to understand if the method is robust. To this end, consider another case with σx = σ y = σz = 10 in generating the true signal. The causation entropy leads to a similar model structure as that identified in Table 9.2. The estimated parameters are listed in the last two rows of Table 9.3. The noise coefficients are easier to estimate with the increase in the noise level. In addition, the parameters in the deterministic part remain accurate. These results ensure that the model trajectories generated from the identified model

168

9 Parameter Estimation with Uncertainty Quantification

Table 9.2 The causation entropy of (9.59). The entry with the bold font indicates the term appearing in the noisy L63 model (5.25). The sparse identification from the causation entropy perfectly matches the truth, where the causation entropy for those terms that do not appear in the noisy L63 model is essentially zero x

y

z

xy

yz

zx

x2

y2

z2

dx/ dt

0.0160

0.0365

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

0.0000

dy/ dt

0.1130

0.0004

0.0000

0.0000

0.0000

0.1724

0.0000

0.0000

0.0000

dz/ dt

0.0000

0.0000

0.0031

0.0231

0.0000

0.0000

0.0000

0.0000

0.0000

Table 9.3 Likelihood-based parameter estimation after identifying the model structure. The top three rows show the case with small noise, where σx = σ y = σz = 1. The bottom three rows show the case with large noise, where σx = σ y = σz = 10. ‘Est 1’ and ‘Est 2’ represent the results with the term y θ y y in the governing equation of y of the identified model and without that term, respectively y

y

y

θxx

θ yx

θx

θy

θx z

θzz

θxz y

σx

σy

σz

Truth

−10.000

10.000

28.000

−1.0000

−1.0000

−2.6667

1.0000

1.0000

1.0000

1.0000

Est 1

−9.9749

9.9671

28.052

−1.0135

−1.0017

−2.6669

1.0017

1.6093

2.1796

2.5171

Est 2

−9.9749

9.9671

26.362

/

−0.9787

−2.6057

0.9787

1.6093

2.1796

2.5171

Truth

−10.000

10.000

28.000

−1.0000

−1.0000

−2.6667

1.0000

10.000

10.000

10.000

Est 1

−9.9496

10.135

27.998

−0.9248

−0.9994

−2.6447

0.9994

10.144

10.304

10.378

Est 2

−9.9496

10.135

26.688

/

−0.9842

−2.6055

0.9842

10.144

10.304

10.378

are qualitatively similar to the observations and that the statistics of the identified model are almost the same as the truth.

9.4.3

Partial Observations and Stochastic Parameterizations

In the situation with partial observations, learning the exact dynamics of the unobserved state variables may not always be feasible as the noise may affect the observability. Nevertheless, if the primary goal is to recover the dynamics of the observed variables (which often represent the large-scale or resolved states), then suitable stochastic parameterization of the unobserved variables can be incorporated into the learning process. Usually, a certain form of stochastic parameterization with tractable features is utilized to facilitate the learning process. In particular, the CGNS (8.1) can be an appropriate choice of stochastic parameterization, where X represents the observed state variables, the dynamics of which are fully nonlinear. In contrast, the stochastic parameterization is given by Y that satisfies conditional linear processes. Such a family of stochastic parameterization has been widely used in geophysics and engineering, such as the stochastic superparameterization, dynamical super-resolution, and various stochastic forecast models [38, 54, 62, 102, 120].

9.4

Learning Complex Dynamical Systems with Sparse Identification

169

Now, given an initial guess of the model structure of the observed state variables, the stochastic parameterizations of the unobserved variables, and the model parameters, the above learning algorithm can be modified to include an iterative procedure that alternates between three steps until the solution converges. 1. Conditioned on the observed time series of X, sample a time series of the unobserved state variables Y. The conditional sampling of Y in the CGNS is achieved utilizing the closed analytic formula (8.11) and is computationally inexpensive. 2. Treating the sampled trajectory of Y as the artificial “observations”, compute the causality-based information transfer from each candidate function in the library to the time evolution of the given state variable. Determine the model structure and the form of the stochastic parameterization based on such a causal inference. 3. Utilize a simple maximum likelihood estimation to compute the coefficients of the above selected functions.

Combining Stochastic Models with Machine Learning

10.1

10

Machine Learning

Machine learning has become a prevalent and powerful tool to advance the modeling and forecast of many complex dynamical systems. One of the main advantages of machine learning is its computational efficiency. In fact, once a machine learning model is trained, the forecast utilizing such a model is often much cheaper than numerically integrating a traditional high-dimensional nonlinear dynamical model made up of explicit parametric terms. Another merit of machine learning is that it typically involves sophisticated nonlinear structures, which after a suitable calibration, allow its forecast to capture more detailed features than the traditional knowledge-based dynamical models. The latter is often represented by certain parametric forms determined by human knowledge and may suffer from model error due to the lack of a complete understanding of nature. Therefore, machine learning has great potential to forecast extremely complicated, high-dimensional, and nonlinear systems. Despite these merits, several issues may exist in the machine learning forecast. First, the forecast skill of a machine learning model heavily relies on the quality of the training data. Unfortunately, there is a lack of adequate training data in many practical situations, and the training data may only be available for a subset of the state variables. For example, high-quality satellite data is available only for less than half of a century, which is far from sufficient for studying many decadal or even interannual variabilities in geophysics, climate, atmosphere, and ocean science. Second, due to the intrinsic complicated structures in many machine learning architectures, such as neural networks, machine learning may not be the most appropriate tool for understanding the underlying physics. But it is an essential tool for outcome-driven tasks such as forecasts and parameterizations. One natural way to utilize stochastic models to assist machine learning in the presence of limited data is to apply these stochastic models to generate long time series that calibrate the parameters in the machine learning model. Then the available small amount of © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_10

171

172

10 Combining Stochastic Models with Machine Learning

observational information is adopted to further tune the parameters via transfer learning [251, 266]. Bayesian neural network is another way to combine outputs from a certain knowledgebased stochastic model with observations that also includes uncertainty quantification [98, 156, 173, 255]. The remainder of this chapter aims to present a few additional ideas of combining stochastic models with machine learning to improve the understanding and forecast of nature that are different from transfer learning and traditional Bayesian neural network. There are many other novel but more advanced methods, for example, [17, 233], which are, however, not the main focus of this chapter.

10.2

Data Assimilation with Machine Learning Forecast Models

Data assimilation is one of the scientific areas in which stochastic models and machine learning can benefit from each other. On the one hand, machine learning can serve as the forecast model in data assimilation to improve efficiency and accuracy at the forecast step. On the other hand, data assimilation not only facilitates the improvement of training data quality but also provides additional sample trajectories to train a machine learning model.

10.2.1 Using Machine Learning as Surrogate Models for Data Assimilation Recall the two-step procedure of data assimilation from Chap. 5: the forecast and the analysis. A suitable reduced-order model is always preferred in the forecast step to reduce the computational cost of solving many practical problems, especially when the ensemble-based methods are applied to high-dimensional complex systems. The reduced-order models are not limited to having parametric forms. Neural networks and other machine learning models can serve as cheap surrogates that significantly accelerate the computational efficiency, where the more complicated and expensive full order dynamical model is utilized to generate a large amount of the training data for the machine learning model. Some recent progress in developing such cheap and accurate machine learning surrogate models, including the situations with sparse and noisy observations, can be found in [30, 33, 41, 47, 92, 115, 116, 191, 203, 206, 254]. Other studies focus on exploiting machine learning to create specific ensemble members. For example, a new method is proposed in [219] that trains neural networks to generate ensembles with mass conservation (and other constraints), while the work in [119, 273] utilizes analog ensemble data assimilation to create artificial ensemble members with variational autoencoders.

10.3

Machine Learning for Stochastic Closure and Stochastic Parameterization

173

10.2.2 Using Data Assimilation to Provide High-Quality Training Data of Machine Learning Models In many situations, the direct observations are polluted by large observational noise and may only contain a subset of state variables. Therefore, data assimilation with a suitable knowledge-based parametric model serves as a preprocessing method that improves the quality of the machine learning training data [5, 193]. In fact, data assimilation for preprocessing has already been widely used in geophysics and climate science, the result of which is named the reanalysis data [145, 256]. Special cautions are needed to recover the unobserved state variables from data assimilation. Recall that, according to Fig. 8.1, the ensemble mean time series may not always be the best choice for feeding into the machine learning model as the training data. In fact, the ensemble mean averages out many important dynamical properties and may fail to capture the variabilities when it suffers from the observability issue (see the discussions in Sect. 8.2.4). In contrast, the ensemble members, i.e., the sampled trajectories, are more dynamically consistent with nature. Notably, compared with the ensemble mean time series, the trajectories from the multiple ensemble members also increase the effective amount of the training data though these trajectories may correlate with each other to a certain extent.

10.3

Machine Learning for Stochastic Closure and Stochastic Parameterization

Closure or parameterization are widely used tools for building simplified or reduced-order models. Since the main purpose of implementing these approximations is to improve computational efficiency, machine learning becomes a natural tool to substitute many of the existing parametric forms. There are a large number of new research studies on such topics. For example, ensemble Kalman inversion has been applied to learn stochastic closures in the dynamical systems [229]. A machine learning-based statistical closure model from short-time transient statistics has been developed for dynamical systems [210]. Machine learning has also been utilized to create energy-preserving nonlocal closures for turbulent fluid flows [46]. On the other hand, a systematic approach for data-driven subgrid-scale modeling and parameterizations using deep learning was recently developed. Its generalization to higher Reynolds numbers can be carried out by transfer learning [242]. Machine learning for stochastic parameterization using generative adversarial networks has also been developed [97].

174

10.4

10 Combining Stochastic Models with Machine Learning

Statistical Forecast Using Machine Learning

Traditional machine learning methods aim at constructing complicated nonlinear maps, where both the input and the output are deterministic numbers or arrays. However, the chaotic and turbulent nature of many complex dynamical systems makes it very challenging for machine learning to build an accurate point-to-point map. In fact, the exact trajectories of the random noise in stochastic models are intrinsically unpredictable. Although these noise processes are idealized mathematical concepts, the intrinsic turbulent features due to the complicated multiscale nonlinear interactions still raise a great challenge for an accurate machine learning prediction. Such a difficulty is similar to that of utilizing stochastic models to predict the trajectories, as was discussed in Chap. 6. Therefore, different from the direct machine learning predictions in many other disciplines, additional manipulations should be incorporated into the machine learning techniques to forecast turbulent or stochastic systems.

10.4.1 Forecasting the Moment Equations with Neural Networks Recall the discussions in the Chaps. 1 and 4. Despite the randomness in the underlying systems, the associated governing equations of the time evolution of the statistics, namely the moment equations (or, more generally, the Fokker-Planck equations), are deterministic. Since the statistics average over different random realizations, the dynamical behavior of the moment equations is much less chaotic than the underlying SDEs. In addition, as discussed in Chap. 6, the ensemble forecast is more appropriate for turbulent systems. These facts indicate that machine learning can be appropriately incorporated into the prediction of the moment equations, which is then utilized to advance an ensemble forecast of the original state variables. While a machine learning forecast model can fully replace the parametric form of the moment equations, the following approach demonstrates a new hybrid strategy that exploits machine learning only to predict certain complicated components of the moment equations while the main components of the exact parametric structure of the moment equations are reserved [66]. The complicated parametric components of the moment equations are often computationally expensive via direct numerical integrations, and the expressions may contain certain approximations. Thus, data-driven approaches via machine learning are appropriate substitutes to improve computational efficiency and accuracy. Consider the CGNS (8.1) again as an illustrative example and a general framework for developing such a hybrid strategy. Recall the algorithm in Sect. 8.4.1, which utilizes a small number of particles to advance the ensemble forecast via a Gaussian mixture. Now, machine learning is further incorporated into this framework. First, let us apply a decomposition of the state variable Y = (Y1 , Y2 ), where Y1 is the resolved sub-scale processes and Y2 represents the rest unresolved ones. Correspondingly, the conditional mean of Y can be decomposed as μf = (μf,1 , μf,2 ). The goal is to develop a statistical reduced-order model that forecasts a relatively lower-dimensional system containing only the state variables X and Y1 . Note that

10.4

Statistical Forecast Using Machine Learning

175

the dimension of (X, Y1 ) is still way beyond the range that traditional numerical methods are capable in solving the associated Fokker-Planck equation. One cheap way to obtain a small number of ensemble simulations of the state variable X is to build a closure model of the original governing equation of X that depends on the high-dimensional state variable Y. The closure model includes only the explicit dependence of the state variable X and the conditional mean of Y1 while machine learning takes care of the residual part. The closure model reads: dX = (A0 + A1 Y) dt + B1 dW1 = (A0 + A1,1 Y1 + A1,2 Y2 ) dt + B1 dW1 = (A0 + A1,1 μf,1 ) dt + B1 dW1 + [A1,1 (Y1 − μf,1 ) + A1,2 Y2 ] dt

(10.1)

: = (A0 + A1,1 μf,1 ) dt + B1 dW1 + FX dt. The residual part FX contains the fluctuation of the resolved small-scale and the unresolved modes, which is effectively approximated by a recurrent neural network (RNN). The input of such an RNN contains a segment of its past trajectory as well as those of X and μf,1 to make the system in (10.1) being closed. It reads:   FX (t + 1) = RNN X(t − m : t), μf,1 (t − m : t), FX (t − m : t) ,

(10.2)

On the other hand, despite the closed analytic formulae, the evolution equations of (μf,1 , Rf,1 ) involve complicated nonlinear terms and are fully coupled with (μf,2 , Rf,2 ). To reduce the computational cost, the evaluation equations of (μf,1 , Rf,1 ) in (8.8) are rewritten as dμf,1 = (a0 + a1 μf,1 ) + GY (B1 B∗1 )−1 FY , dt dRf,1 = a1 Rf,1 + Rf,1 a1∗ + b2 b∗2 − GY (B1 B∗1 )−1 GY∗ . dt

(10.3a) (10.3b)

In (10.3), FY and GY are also approximated by RNNs to remove the explicit dependence on (μf,2 , Rf,2 ),   FY (t + 1) = RNN X(t − m : t), μf,1 (t − m : t), FY (t − m : t) ,   GY (t + 1) = RNN X(t − m : t), Rf,1 (t − m : t), GY (t − m : t) .

(10.4)

The above system (10.1)–(10.4) retains the explicit dynamics involving X, μf,1 and Rf,1 , which naturally provide a statistical reduced-order model for (X, Y1 ) that can be efficiently computed with the help of the RNNs. If the interest lies in the entire system, then a similar equation as (10.3) can be written down for μf,2 and Rf,2 with the complicated nonlinear interactions being again approximated by RNNs. Since the RNNs in (10.4) are used to assist the calculation of the moment equations, it is not a good idea to use the standard path-wise mean square error as the loss function.

176

10 Combining Stochastic Models with Machine Learning

Instead, the relative entropy (2.14) is adopted as an information loss function that explicitly emphasizes the minimization of the forecast error in terms of the PDF. Since the conditional distribution is Gaussian, the explicit formula in (2.15) can be utilized as the loss function for training these RNNs. It is also worth highlighting that training and forecasting Gaussian PDFs are often much more accessible than directly predicting the strong non-Gaussian ones. The advantage of the method is that the conditional Gaussian mixture provides a unique way to forecast non-Gaussian statistics by accurately predicting each Gaussian component. The final step is the initialization of the hybrid model, which contains three components. First, the initial ensembles of Y are drawn from the conditional Gaussian posterior distribution N (μ1 , R1 ) computed from the direct data assimilation (8.8). This also makes the entire initial distribution consistent with the traditional ensemble forecast. Second, the functions FY and GY depend on the past information of μf,1 , Rf,1 , X. Therefore, the data assimilation scheme (8.8) starts from a certain time instant in the past and the resulting time series of μf,1 and Rf,1 from T − m to the current time instant T are utilized as the input of the RNNs FY and GY . On the other hand, the time series of X from T − m to T is available from observations. Third, the input of the neural network in (10.2) requires additional path-wise information of the unobserved trajectory Y, which can be sampled utilizing the closed analytic formula in (8.11). The details of the entire procedure and its applications to geophysical flows can be found in [66].

10.4.2 Incorporating Additional Forecast Uncertainty in the Machine Learning Path-Wise Forecast Results Another simple way of exploiting machine learning to predict complex turbulent systems is to add the forecast uncertainty on top of the point-wise forecast results. The machine learning forecast here focuses directly on the state variables instead of the associated moments. The role of the machine learning forecast with uncertainty quantification is to replace the stochastic or dynamical models for the ensemble forecast. For the simplicity of discussion, a neural network (NN) is utilized here as the machine learning forecast model. Denote by xt the forecast of the state variable x at lead time t. The total forecast uncertainty contains two parts. The first part is the ensemble spread of the point-wise forecasts from the NN, denoted by xtk with k = 1, . . . K being the index of the ensemble members. The second and indispensable component is the intrinsic uncertainty associated with each ensemble member. The ensemble spread of the point-wise forecasts is the dominant source of uncertainty at short lead times, where the spread comes from the initialization via data assimilation. As the lead time increases, the point-wise forecast becomes less accurate due to the unpredictable nature of turbulent systems. The associated error contributes to the total uncertainty of the forecast. Because of this, the validation error obtained in the ML training period is utilized as the measurement of the forecast uncertainty associated with each ensemble member. Such a simple criterion is a natural choice for quantifying the

10.4

Statistical Forecast Using Machine Learning

177

forecast uncertainty since it represents the residual part of the dynamics that cannot be well characterized and forecasted by the NN. Note that the validation error can be replaced by the training error, assuming the machine learning model is appropriate, i.e., without being overfitted or underfitted. A non-Gaussian PDF then represents the total forecast. This non-Gaussian PDF is constructed by a mixture distribution, where each component is another non-Gaussian distribution associated with one forecast ensemble member. Each mixture component is given by adding a non-Gaussian distribution  to the point-wise forecast value xtk , p(xtk ) = xtk + ,

(10.5)

where  is the distribution of the validation error for the machine learning to forecast at a lead time t. It is computed from the NN forecast at the same lead time,   (10.6)  ∼ PDF of xn+t − NN(xn−m:n ; θ ∗ ), for m + 1 ≤ n ≤ M , where M is the total length of the validation period, and θ ∗ is the optimized parameters in the NN. See [53] for more details. A connection can be built between utilizing the above machine learning forecast and the knowledge-based stochastic model in characterizing the forecast uncertainty. In fact, the above machine learning ensemble forecast algorithm can be regarded as a surrogate of the ensemble forecast using stochastic models but via a different approach to computing the forecast uncertainty. To see such a connection, consider the following mean-fluctuation decomposition (see Sect. 4.1.1) of the forecast value for the kth ensemble member, xtk = x¯ tk + (xtk − x¯ tk ),

(10.7)

where the first term on the right-hand side is the ensemble mean and the second term is the fluctuation. The uncertainty is essentially given by the statistics, namely the PDF, of the fluctuation part. In the traditional forecast using physics-informed parametric models, each ensemble member xtk is given directly by running the parametric models forward. The results of such an ensemble forecast can be directly used as xtk . However, in most machine learning forecasts, the machine learning model/mapping is deterministic. Thus the forecast output is typically a deterministic value provided by a certain averaging process (i.e., the optimal mapping) inside the complicated machine learning architecture. There is a residual from such an averaging process characterized by the validation error. The machine learning forecast outcome and the validation error for each sample are essentially the first and the second part on the right-hand side of (10.7), respectively. In other words, the statistics of the validation error mimic those of the fluctuation of each ensemble member in the traditional ensemble forecast using physics-informed parametric models. If the training and testing data have the same features, then it is expected that the residual in the testing period should be similar to that in the validation period.

Instruction Manual for the MATLAB Codes

11

Chapter 1 Code: Two_State_Markov_Jump_Process.m. The code simulates the two-state Markov jump process discussed in Sects. 1.4.2–1.4.4, and computes the associated statistics, including the equilibrium distribution (1.31), the expectation (1.32) and the switching time (1.35). Code: Lorenz_63_Model.m. The code simulates the Lorenz 63 model (1.39) with two initial conditions that slightly differ from each other. The results illustrate the chaotic behavior of such a system, as was discussed in Sect. 1.5.

Chapter 2 Code: Max_Entropy_Principle_withMoments.m. Given the first two ([mean, variance]) or the first four moments ([mean, variance, skewness, kurtosis]) as the input, the code utilizes numerical optimization to compute the least biased PDF based on the maximum entropy principle. See (2.8) and (2.9) for the theoretic basis. The associated auxiliary m files are objfun1.m, constraint1.m, objfun2.m, and constraint2.m.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-22249-8_11. The videos can be accessed individually by clicking the DOI link in the accompanying figure caption or by scanning this link with the SN More Media App. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8_11

179

180

11 Instruction Manual for the MATLAB Codes

Chapter 3 Code: Numerical_Integration.m. The code includes the three methods described in Sect. 3.1.2 for utilizing Monte Carlo simulations to compute numerical integrations. The example here is the same as that in Sect. 1 3.1.2: 0 x 2 dx. Code: SDE_Solvers.m. The code includes a simple example of solving an SDE (the geometric Brownian motion) utilizing the Euler-Maruyama and Milstein schemes presented in Sect. 3.2. Code: KDE_Comparison.m. The code includes the kernel density estimation for three distributions: (a) a Gaussian distribution, (b) a Gamma distribution with a one-sided fat tail, and (c) a bimodal distribution. The associated auxiliary m file is kde.m. The code follows the results presented in Fig. 3.3.

Chapter 4 Code: Moments_Linear_Gaussian_Models.m. The code shows the analytic solutions (4.4) and (4.7) for the time evolution of the moments associated with linear Gaussian systems. These analytic solutions are compared with a direct Monte Carlo simulation. Codes: PDFs_Cubic_Model_1.m and PDFs_Cubic_Model_2.m. The codes show the PDFs of the cubic model (4.30). The two codes correspond to the expressions in (4.31) and (4.32), respectively, for the cases A  = 0, B  = 0 and A = B = 0. These codes reproduce the results in Figs. 4.3 and 4.4. Code: qG_Closure_Cubic_Model.m. The code displays the time evolution of the first two moments associated with the cubic model (4.30). In addition to the Monte Carlo simulation, the quasi-Gaussian closure (4.36) and (4.39) and the bare truncation results are shown. The code reproduces the results in Fig. 4.5.

Chapter 5 Code: Kalman_Filter_1D.m. The code deals with a one-dimensional Kalman filtering problem. It deals with the linear Gaussian model (5.13)–(5.15). The code reproduces the results in Fig. 5.2.

11 Instruction Manual for the MATLAB Codes

181

Code: ETKF_L63.m. The code utilizes ETKF for the noisy Lorenz 63 model. The codes include both partial observation and full observation cases. The associated auxiliary m file is L63_TrueSignal.m, which needs to run before running the ETKF_L63.m file.

Chapter 6 Code: Information_Predictability.m. The code shows the internal predictability and the model error in the ensemble forecast using an imperfect model and is compared with the truth. Both the perfect and imperfect models are linear Gaussian models. The code reproduces the results in Fig. 6.2. Code: Response_Idealized.m and Response_Approximations.m These codes compare the response computed from various methods following the SPEKF example in Sect. 6.4.3. Response_Idealized.m provides the idealized response based on the closed analytic formulae of the SPEKF model in solving the statistics. Response_ Approximations.m shows the results of the qG FDT using the SPEKF model, the FDT based on the MSM. It also includes the ‘idealized response’ but uses a Monte Carlo simulation of the SPEKF model rather than the closed analytic formula. The Monte Carlo simulationbased calculation suffers from the sampling error even for this two-dimensional system due to the appearance of the strong non-Gaussian PDF with a one-sided fat tail. These codes reproduce the results in Fig. 6.3. Code: Most_Sensitive_Direction.m. The code uses both the information theory (Fisher information) and the direct searching method to find the most sensitive direction of parameter based on the linear Gaussian model in Sect. 6.5.2. The code reproduces the results in Fig. 6.4.

Chapter 7 Code: Calibration_OU_Process.m. The code follows the procedure described in Sect. 7.2.1 to calibrate a complex OU process based on matching the four statistics. Code: Linear_Model_with_Multiplicative_Noise.m. Following the procedure described in Sect. 7.4, the code uses a one-dimensional observed time series to calibrate the linear model with multiplicative noise, where the observed time series comes from a cubic nonlinear model. The code reproduces the results in Fig. 7.2.

182

11 Instruction Manual for the MATLAB Codes

Code: ENSO_Model.m. The code simulates the three-dimensional ENSO model with multiplicative noise described in Sect. 7.4.3. The code reproduces the model results in Fig. 7.3. Code: Simulation_SPEKF.m. The code simulates the SPEKF model (7.17) and reproduces the results in Fig. 7.4. Code: Filter_SPEKF.m. The code uses the SPEKF model for filtering intermittent time series, as was described in Sect. 7.5.3. The system that generates the true signal involves a two-state Markov jump process for triggering the intermittency. The SPEKF model is utilized as a forecast model for filtering. For comparison, the filtering solution using the MSM is also included. The code reproduces the results in Fig. 7.5.

Chapter 8 Code: CGNS_Filtering_Smoothing_Sampling.m. The code exploits the closed analytic formulae of the CGNS for filtering, smoothing, and conditional sampling of the unobserved state variable in a nonlinear dyad model. See Sect. 8.2.4 for the details. The code reproduces the results in Fig. 8.1. Code: Lagrangian_DA_Different_L.m. The code shows the Lagrangian data assimilation results with different numbers of tracers L. It reproduces the results in Fig. 8.2. There are two auxiliary codes: Shallow_Water_ Equation.m and Lagrangian_DA.m. The code Shallow_Water_Equation.m needs to be run first to generate the right ocean flow field from the stochastic version of the linear rotating shallow water equation. The code Lagrangian_DA.m deals with the Lagrangian data assimilation for a given L. See Sect. 8.3 for the details of Lagrangian data assimilation. Code: Fokker_Planck_L63.m. The code utilizes the hybrid algorithm described in Sect. 8.4.1, based on the conditional Gaussian mixture, to solve the Fokker-Planck equation of the noisy L63 model. The code can show both the transient and statistical equilibrium solutions. Two auxiliary m files are kde.m and kde2d.m. The code reproduces the results in Fig. 8.3.

Chapter 9 Code: MCMC_Sampling_Distribution.m. The code utilizes the Metropolis algorithm to sample a one-dimensional target distribution. It reproduces the results in Fig. 9.1.

11 Instruction Manual for the MATLAB Codes

183

Code: Adaptive_MCMC_Parameter_Estimation.m. The code applies an adaptive MCMC to estimate parameters in the cubic nonlinear model (9.2). See Sect. 9.1.3. The code reproduces the results in Fig. 9.2. Code: EM_Parameter_Estimation.m. The code utilizes the expectation-maximization (EM) algorithm for estimating parameters in turbulent systems with partial observations. The test example is the noisy Lorenz 84 model, where a short time series of y and z is adopted as the observations while x is unobserved. Two dynamical regimes with small and large noises are considered. See Sect. 9.2.4. The code reproduces the results in Fig. 9.3. Code: DA_Parameter_Estimation_1.m and DA_Parameter_Estimation_2.m These codes exploit the data assimilation methods for parameter estimation. DA_Parameter_ Estimation_1.m corresponds to the results utilizing the direct method while DA_Parameter_ Estimation_2.m includes the approach with the stochastic parameterized equations. See Sect. 9.3.5 for the details. The codes reproduce the results in Fig. 9.4 and contain the results for other parameters as well. Code: Sparse_Identification_Causation_Entropy.m. The code utilizes causation entropy to determine the model structure and then adopts a maximum likelihood estimator for parameter estimation. It reproduces the results in Sect. 9.4.2.

References

1. Abhilash S, Sahai AK, Pattnaik S, Goswami BN, Kumar A (2014) Extended range prediction of active-break spells of Indian summer monsoon rainfall using an ensemble prediction system in NCEP climate forecast system. Int J Climatol 34(1):98–113 2. Abramov RV, Majda AJ (2007) Blended response algorithms for linear fluctuation-dissipation for complex nonlinear dynamical systems. Nonlinearity 20(12):2793 3. Abramowitz M (1965) Handbook of mathematical functions with formulas. Graphs Math Tables 4. Adrian RJ, Christensen KT, Liu Z-C (2000) Analysis and interpretation of instantaneous turbulent velocity fields. Exp Fluids 29(3):275–290 5. Albers DJ, Levine ME, Stuart A, Mamykina L, Gluckman B, Hripcsak G (2018) Mechanistic machine learning: how data assimilation leverages physiologic knowledge using Bayesian inference to forecast the future, infer the present, and phenotype. J Am Med Inf Assoc 25(10):1392– 1401 6. Allan R, Lindesay J, Parker D, et al. El Niño southern oscillation and climatic variability. CSIRO Publishing 7. Amezcua J, Ide K, Kalnay E, Reich S (2014) Ensemble transform Kalman-Bucy filters. Q J Royal Meteorol Soc 140(680):995–1004 8. Anderson JL (2001) An ensemble adjustment Kalman filter for data assimilation. Mon Weather Rev 129(12):2884–2903 9. Anderson JL (2007) Exploring the need for localization in ensemble data assimilation using a hierarchical ensemble filter. Phys D: Nonlinear Phenomena 230(1–2):99–111 10. Anderson JL (2012) Localization and sampling error correction in ensemble Kalman filter data assimilation. Mon Weather Rev 140(7):2359–2371 11. Anderson JL, Anderson SL (1999) A Monte Carlo implementation of the nonlinear filtering problem to produce ensemble assimilations and forecasts. Mon Weather Rev 127(12):2741– 2758 12. Andrieu C, De Freitas N, Doucet A, Jordan MI (2003) An introduction to MCMC for machine learning. Mach Learn 50(1):5–43 13. Andrieu C, Thoms J (2008) A tutorial on adaptive MCMC. Stat Comput 18(4):343–373 14. Apte A, Jones CKRT, Stuart AM, Voss J (2008) Data assimilation: Mathematical and statistical perspectives. Int J Numer Methods Fluids 56(8):1033–1046

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8

185

186

References

15. Apte A, Jones CKRT (2013) The impact of nonlinearity in Lagrangian data assimilation. Nonlinear Proc Geophys 20(3):329–341 16. Apte A, Jones CKRT, Stuart AM (2008) A Bayesian approach to Lagrangian data assimilation. Tellus A: Dyn Meteorol Oceanogr 60(2):336–347 17. Arbabi H, Sapsis T (2022) Generative stochastic modeling of strongly nonlinear flows with non-Gaussian statistics. SIAM/ASA J Uncertain Quan 10(2):555–583 18. Averina TA, Artemiev SS (1988) Numerical solution of systems of stochastic differential equations. Russian J Numerical Anal Math Model 3(4):267–286 19. Barker DM, Huang W, Guo Y-R, Bourgeois AJ, Xiao QN (2004) A three-dimensional variational data assimilation system for MM5: Implementation and initial results. Mon Weather Rev 132(4):897–914 20. Behringer DW, Ji M, Leetmaa A (1998) An improved coupled model for ENSO prediction and implications for ocean initialization. Part I: The ocean data assimilation system. Mon Weather Rev 126(4):1013–1021 21. Bengtsson T, Bickel P, Li B et al (2008) Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. Probab Stat: Essays Honor David A. Freedman 2:316– 334 22. Bensoussan A (1992) Stochastic control of partially observable systems. Cambridge University Press 23. Bergemann K, Reich S (2012) An ensemble Kalman-Bucy filter for continuous data assimilation. Meteorol Zeitschrift 21(3):213 24. Berner J, Achatz U, Batte L, Bengtsson L, De La Camara A, Christensen HM, Colangeli M, Coleman DRB, Crommelin D, Dolaptchiev SI et al (2017) Stochastic parameterization: Toward a new view of weather and climate models. Bull Am Meteorol Soc 98(3):565–588 25. Beskos A, Roberts G, Stuart A, Voss J (2008) MCMC methods for diffusion bridges. Stochast Dyn 8(03):319–350 26. Beskos A, Roberts GO (2005) Exact simulation of diffusions. Ann Appl Probab 15(4):2422– 2444 27. Bickel P, Li B, Bengtsson T et al (2008) Sharp failure rates for the bootstrap particle filter in high dimensions. Push Limits Contemp Stat: Contrib Honor Jayanta K. Ghosh 3:318–329 28. Bishop CH, Etherton BJ, Majumdar SJ (2001) Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects. Mon Weather Rev 129(3):420–436 29. Blunden J, Arndt DS (2019) A look at 2018: Takeaway points from the state of the climate supplement. Bull Am Meteorol Soc 100(9):1625–1636 30. Bocquet M, Brajard J, Carrassi A, Bertino L (2020) Bayesian inference of chaotic dynamics by merging data assimilation, machine learning and expectation-maximization. arXiv preprint arXiv:2001.06270 31. Bocquet M, Sakov P (2014) An iterative ensemble Kalman smoother. Q J Royal Meteorol Soc 140(682):1521–1535 32. Botev ZI, Grotowski JF, Kroese DP (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957 33. Brajard J, Carrassi A, Bocquet M, Bertino L (2020) Combining data assimilation and machine learning to emulate a dynamical model from sparse and noisy observations: A case study with the lorenz 96 model. J Computat Sci 44:101171 34. Branicki M, Chen N, Majda AJ (2013) Non-gaussian test models for prediction and state estimation with model errors. Chin Ann Math Ser B 34(1):29–64 35. Branicki M, Majda AJ (2012) Quantifying uncertainty for predictions with model error in non-Gaussian systems with intermittency. Nonlinearity 25(9):2543

References

187

36. Branicki M, Majda AJ (2013) Dynamic stochastic superresolution of sparsely observed turbulent systems. J Comput Phys 241:333–363 37. Branicki M, Majda AJ (2014) Quantifying Bayesian filter performance for turbulent dynamical systems through information theory. Commun Math Sci 12(5):901–978 38. Branicki M, Majda AJ, Law KJH (2018) Accuracy of some approximate Gaussian filters for the Navier-Stokes equation in the presence of model error. Multiscale Model Simul 16(4):1756– 1794 39. Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc Nat Acad Sci 113(15):3932–3937 40. Bühlmann P, Van De Geer, S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer Science & Business Media 41. Buizza C, Casas CQ, Nadler P, Mack J, Marrone S, Titus Z, Le Cornec C, Heylen E, Dur T, Ruiz LB et al (2022) Data learning: Integrating data assimilation and machine learning. J Comput Sci 58:101525 42. Burgers G, Leeuwen PJV, Evensen G (1998) Analysis scheme in the ensemble Kalman filter. Mon Weather Rev 126(6):1719–1724 43. Campbell WF, Bishop CH, Hodyss D (2010) Vertical covariance localization for satellite radiances in ensemble Kalman filters. Mon Weather Rev 138(1):282–290 44. Castaing B, Gunaratne G, Heslot F, Kadanoff L, Libchaber A, Thomae S, Xiao-Zhong W, Zaleski S, Zanetti G (1989) Scaling of hard thermal turbulence in Rayleigh-Bénard convection. J Fluid Mech 204:1–30 45. Castellari S, Griffa A, Özgökmen M, T, Poulain P-M, (2001) Prediction of particle trajectories in the Adriatic sea using Lagrangian data assimilation. J Marine Syst 29(1–4):33–50 46. Charalampopoulos A-TG, Sapsis TP (2022) Machine-learning energy-preserving nonlocal closures for turbulent fluid flows and inertial tracers. Phys Rev Fluids 7(2):024305 47. Chattopadhyay A, Nabizadeh E, Bach E, Hassanzadeh P (2022) Deep learning-enhanced ensemble-based data assimilation for high-dimensional nonlinear dynamical systems. arXiv preprint arXiv:2206.04811 48. Chekroun MD, Liu H, McWilliams JC (2021) Stochastic rectification of fast oscillations on slow manifold closures. Proc Nat Acad Sci 118(48):e2113650118 49. Chen N (2020) Learning nonlinear turbulent dynamics from partial observations via analytically solvable conditional statistics. J Comput Phys 418:109635 50. Chen N, Fang X, Yu J-Y (2022) A multiscale model for El Niño complexity. npj Clim Atmos Sci 5(1):1–13 51. Chen N, Fu S, Manucharyan GE (2022) An efficient and statistically accurate Lagrangian data assimilation algorithm with applications to discrete element sea ice models. J Comput Phys 455:111000 52. Chen N, Gilani F, Harlim J (2021) A Bayesian machine learning algorithm for predicting ENSO using short observational time series. Geophys Res Lett 48(17):e2021GL093704 53. Chen N, Li Y (2021) BAMCAFE: A Bayesian machine learning advanced forecast ensemble method for complex turbulent systems with partial observations. Chaos: An Inter J Nonlinear Sci 31(11):113114 54. Chen N, Li Y, Liu H (2022) Conditional Gaussian nonlinear system: A fast preconditioner and a cheap surrogate model for complex nonlinear systems. Chaos: An Inter J Nonlinear Sci 32(5):053122 55. Chen N, Majda AJ (2015) Predicting the cloud patterns for the boreal summer intraseasonal oscillation through a low-order stochastic model. Math Clim Weather Forecast 1(1) 56. Chen N, Majda AJ (2016) Filtering the stochastic skeleton model for the Madden-Julian oscillation. Mon Weather Rev 144(2):501–527

188

References

57. Chen N, Majda AJ (2016) Model error in filtering random compressible flows utilizing noisy Lagrangian tracers. Mon Weather Rev 144(11):4037–4061 58. Chen N, Majda AJ (2017) Beating the curse of dimension with accurate statistics for the FokkerPlanck equation in complex turbulent systems. Proc Nat Acad Sci 114(49):12864–12869 59. Chen N, Majda AJ (2018) Conditional Gaussian systems for multiscale nonlinear stochastic systems: Prediction, state estimation and uncertainty quantification. Entropy 20(7):509 60. Chen N, Majda AJ (2018) Efficient statistically accurate algorithms for the Fokker-Planck equation in large dimensions. J Comput Phys 354:242–268 61. Chen N, Majda AJ, Giannakis D (2014) Predicting the cloud patterns of the Madden-Julian oscillation through a low-order nonlinear stochastic model. Geophys Res Lett 41(15):5612– 5619 62. Chen N, Majda AJ, Sabeerali CT, Ajayamohan RS (2018) Predicting monsoon intraseasonal precipitation using a low-order nonlinear stochastic model. J Clim 31(11):4403–4427 63. Chen N, Majda AJ, Tong XT (2014) Information barriers for noisy Lagrangian tracers in filtering random incompressible flows. Nonlinearity 27(9):2133 64. Chen N, Majda AJ, Tong XT (2015) Noisy Lagrangian tracers for filtering random rotating compressible flows. J Nonlinear Sci 25(3):451–488 65. Chen N, Majda AJ, Tong XT (2018) Rigorous analysis for efficient statistically accurate algorithms for solving Fokker-Planck equations in large dimensions. SIAM/ASA J Uncertainty Quan 6(3):1198–1223 66. Chen N, Qi D (2022) A physics-informed data-driven algorithm for ensemble forecast of complex turbulent systems. arXiv preprint arXiv:2204.08547 67. Chen Y, Gu Y, Hero AO (2009) Sparse LMS for system identification. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp 3125–3128 68. Chib S, Greenberg E (1995) Understanding the metropolis-hastings algorithm. Am Stat 49(4):327–335 69. Chorin AJ, Krause P (2004) Dimensional reduction for a Bayesian filter. Proc Nat Acad Sci 101(42):15013–15017 70. Chorin AJ, Tu X (2009) Implicit sampling for particle filters. Proc Nat Acad Sci 106(41):17249– 17254 71. Christensen HM, Berner J, Coleman DRB, Palmer TN (2017) Stochastic parameterization and El Niño-southern oscillation. J Clim 30(1):17–38 72. Çinlar E, ðCınlar E (2011) Probability and stochastics, vol 261. Springer 73. Cotter SL, Roberts GO, Stuart AM, White D (2013) MCMC methods for functions: modifying old algorithms to make them faster. Stat Sci 28(3):424–446 74. Courtier P, Thépaut J-N, Hollingsworth A (1994) A strategy for operational implementation of 4D-Var, using an incremental approach. Q J Royal Meteorol Soc 120(519):1367–1387 75. Cousins W, Sapsis TP (2016) Reduced-order precursors of rare events in unidirectional nonlinear water waves. J Fluid Mech 790:368–388 76. Cover TM (1999) Elements of information theory. John Wiley & Sons 77. Covington J, Chen N, Wilhelmus MM (2022) Bridging gaps in the climate observation network: A physics-based nonlinear dynamical interpolation of Lagrangian ice floe measurements via data-driven stochastic models. J Adv Model Earth Syst 14(9):2022MS003218 78. Crommelin DT, Majda AJ (2004) Strategies for model reduction: Comparing different optimal bases. J Atmos Sci 61(17):2206–2217 79. De Wiljes J, Reich S, Stannat W (2018) Long-time stability and accuracy of the ensemble Kalman-Bucy filter for fully observed processes and small measurement noise. SIAM J Appl Dyn Syst 17(2):1152–1181

References

189

80. Pierre Del Moral (1997) Nonlinear filtering: Interacting particle resolution. Comptes Rendus de l’Académie des Sciences-Series I-Math 325(6):653–658 81. Dembo A, Zeitouni O (1986) Parameter estimation of partially observed continuous time stochastic processes via the EM algorithm. Stoch Proces Appl 23(1):91–113 82. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc: Ser B (Methodol) 39(1):1–22 83. Doucet A, Freitas ND, Gordon NJ et al (2001) Sequential monte carlo methods in practice, vol 1. Springer 84. Doucet A, Johansen AM et al (2009) A tutorial on particle filtering and smoothing: Fifteen years later. Handbook Nonlinear Filter 12(656–704):3 85. Elerian O, Chib S, Shephard N (2001) Likelihood inference for discretely observed nonlinear diffusions. Econometrica 69(4):959–993 86. Elinger J (2020) Information theoretic causality measures for parameter estimation and system identification. PhD Thesis, Georgia Institute of Technology 87. Elinger J, Rogers J (2021) Causation entropy method for covariate selection in dynamic models. In: 2021 American Control Conference (ACC). IEEE, pp 2842–2847 88. Epanechnikov VA (1969) Non-parametric estimation of a multivariate probability density. Theory Prob Appl 14(1):153–158 89. Evensen G (2003) The ensemble Kalman filter: Theoretical formulation and practical implementation. Ocean Dyn 53(4):343–367 90. Evensen G et al (2009) Data assimilation: the ensemble Kalman filter, vol 2. Springer 91. Evensen G, Leeuwen PJV (2000) An ensemble Kalman smoother for nonlinear dynamics. Mon Weather Rev 128(6):1852–1867 92. Farchi A, Laloyaux P, Bonavita M, Bocquet M (2021) Using machine learning to correct model error in data assimilation and forecast applications. Q J Royal Meteorol Soc 147(739):3067– 3084 93. Fertig EJ, Hunt BR, Ott E, Szunyogh I (2007) Assimilating non-local observations with a local ensemble Kalman filter. Tellus A: Dyn Meteorol Oceanogr 59(5):719–730 94. Franzke C, Majda AJ, Vanden-Eijnden E (2005) Low-order stochastic mode reduction for a realistic barotropic model climate. J Atmos Sci 62(6):1722–1745 95. Friedman JH (1997) On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Min Knowl Disc 1(1):55–77 96. Gadgil S et al (2003) The Indian monsoon and its variability. Ann Rev Earth Planet Sci 31(1):429–467 97. Gagne DJ, Christensen HM, Subramanian AC, Monahan AH (2020) Machine learning for stochastic parameterization: Generative adversarial networks in the lorenz’96 model. J Adv Model Earth Syst 12(3):e2019MS001896 98. Gal Y, Ghahramani Z (2016) A theoretically grounded application of dropout in recurrent neural networks. Adv Neural Inf Process Syst 29 99. Gardiner C (2009) Stochastic methods, vol 4. Springer, Berlin 100. Gardiner CW et al (1985) Handbook of stochastic methods, vol 3. Springer, Berlin 101. Gelman A, Gilks WR, Roberts GO (1997) Weak convergence and optimal scaling of random walk Metropolis algorithms. Ann Appl Probab 7(1):110–120 102. Gershgorin B, Harlim J, Majda AJ (2010) Test models for improving filtering with model errors through stochastic parameter estimation. J Comput Phys 229(1):1–31 103. Gershgorin B, Majda AJ (2010) A test model for fluctuation-dissipation theorems with timeperiodic statistics. Phys D: Nonlinear Phenom 239(17):1741–1757

190

References

104. Giannakis D, Majda AJ (2012) Comparing low-frequency and intermittent variability in comprehensive climate models through nonlinear Laplacian spectral analysis. Geophys Res Lett 39(10) 105. Giannakis D, Majda AJ (2012) Nonlinear Laplacian spectral analysis for time series with intermittency and low-frequency variability. Proc Nat Acad Sci 109(7):2222–2227 106. Giannakis D, Majda AJ, Horenko I (2012) Information theory, model error, and predictive skill of stochastic models for complex nonlinear systems. Phys D: Nonlinear Phenom 241(20):1735– 1752 107. Giorgini LT, Moon W, Chen N, Wettlaufer JS (2022) Non-Gaussian stochastic dynamical model for the El Niño southern oscillation. Phys Rev Res 4(2):L022065 108. Gleiter T, Janji´c T, Chen N (2022) Ensemble Kalman filter based data assimilation for tropical waves in the MJO skeleton model. Q J Royal Meteorol Soc 148(743):1035–1056 109. Gnedenko BV, Ushakov IA (2018) Theory of probability. Routledge 110. Golightly A, Wilkinson DJ (2005) Bayesian inference for stochastic kinetic models using a diffusion approximation. Biometrics 61(3):781–788 111. Golightly A, Wilkinson DJ (2008) Bayesian inference for nonlinear multivariate diffusion models observed with error. Comput Stat Data Anal 52(3):1674–1693 112. Goswami BN, Ajayamohan RS, Xavier PK, Sengupta D (2003) Clustering of synoptic activity by indian summer monsoon intraseasonal oscillations. Geophys Res Lett 30(8) 113. Goswami BN, Ajaya RSM (2001) Intraseasonal oscillations and interannual variability of the Indian summer monsoon. J Clim 14(6):1180–1198 114. Gottwald GA, Harlim J (2013) The role of additive and multiplicative noise in filtering complex dynamical systems. Proc Royal Soc A: Math Phys Eng Sci 469(2155):20130096 115. Gottwald GA, Reich S (2021) Combining machine learning and data assimilation to forecast dynamical systems from noisy partial observations. Chaos: An Inter J Nonlinear Sci 31(10):101103 116. Gottwald GA, Reich S (2021) Supervised learning from noisy observations: Combining machine-learning techniques with data assimilation. Phys D: Nonlinear Phenom 423:132911 117. Grinstead CM, Snell JL (1997) Introduction to probability. American Mathematical Soc 118. Gritsun A, Branstator G (2007) Climate response using a three-dimensional operator based on the fluctuation-dissipation theorem. J Atmos Sci 64(7):2558–2575 119. Grooms I (2021) Analog ensemble data assimilation and a method for constructing analogs with variational autoencoders. Q J Royal Meteorol Soc 147(734):139–149 120. Grooms IG, Majda AJ (2014) Stochastic superparameterization in a one-dimensional model for wave turbulence. Commun Math Sci 12(3):509–525 121. Gut A, Gut A (2005) Probability: a graduate course, vol 200. Springer 122. Harlim J, Majda AJ (2008) Filtering nonlinear dynamical systems with linear stochastic models. Nonlinearity 21(6):1281 123. Harlim J, Mahdi A, Majda AJ (2014) An ensemble Kalman filter for statistical estimation of physics constrained nonlinear regression models. J Comput Phys 257:782–812 124. Hastie T, Tibshirani R, Friedman JH, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, vol 2. Springer 125. Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97 126. Hempel H, Schimansky-Geier L, García-Ojalvo J (1999) Noise-sustained pulsating patterns and global oscillations in subexcitable media. Phys Rev Lett 82(18):3713 127. Hendon HH, Wheeler MC, Zhang C (2007) Seasonal dependence of the MJO-ENSO relationship. J Clim 20(3):531–543

References

191

128. Honnorat M, Monnier J, Le Dimet F-X (2009) Lagrangian data assimilation for river hydraulics simulations. Comput Vis Sci 12(5):235–246 129. Houtekamer PL, Mitchell HL (1998) Data assimilation using an ensemble Kalman filter technique. Mon Weather Rev 126(3):796–811 130. Houtekamer PL, Zhang F (2016) Review of the ensemble Kalman filter for atmospheric data assimilation. Mon Weather Rev 144(12):4489–4532 131. Hu B, Changsong Z (2000) Phase synchronization in coupled nonidentical excitable systems and array-enhanced coherence resonance. Phys Rev E 61(2):R1001 132. Huang GP, Mourikis AI, Roumeliotis SI (2008) Analysis and improvement of the consistency of extended Kalman filter based SLAM. In: 2008 IEEE International Conference on Robotics and Automation. IEEE, pp 473–479 133. Huffman GJ, Adler RF, Morrissey MM, Bolvin DT, Curtis S, Joyce R, McGavock B, Susskind J (2001) Global precipitation at one-degree daily resolution from multisatellite observations. J Hydrometeorol 2(1):36–50 134. Hunt BR, Kostelich EJ, Szunyogh I (2007) Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Phys D: Nonlinear Phenom 230(1–2):112–126 135. Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecast 22(4):679–688 136. Ide K, Kuznetsov L, Jones CKRT (2002) Lagrangian data assimilation for point vortex systems. J Turbul 3(1):053 137. Janji´c I, McLaughlin D, Cohn SE, Verlaan M (2014) Conservation of mass and preservation of positivity with ensemble-type Kalman filter algorithms. Mon Weather Rev 142(2):755–773 138. Jin F-F (1997) An equatorial ocean recharge paradigm for ENSO. Part I: Conceptual model. J Atmos Sci 54(7):811–829 139. Julier SJ, Uhlmann JK (2004) Unscented filtering and nonlinear estimation. Proc IEEE 92(3):401–422 140. Jung P, Cornell-Bell A, Madden KS, Moss F (1998) Noise-induced spiral waves in astrocyte syncytia show evidence of self-organized criticality. J Neurophysiol 79(2):1098–1101 141. Kalman RE (1963) Mathematical description of linear dynamical systems. J So Indus Appl Math Ser A: Control 1(2):152–192 142. Kalman RE, Bucy RS (1961) New results in linear filtering and prediction theory. J Fluids Eng 83:95–108 143. Kalman RE (1960) A new approach to linear filtering and prediction problems. J Basic Eng 82:35–45 144. Kalnay E (2003) Atmospheric modeling, data assimilation and predictability. Cambridge University Press 145. Kalnay E, Kanamitsu M, Kistler R, Collins W, Deaven D, Gandin L, Iredell M, Saha S, White G, Woollen J et al (1996) The NCEP/NCAR 40-year reanalysis project. Bull Am Meteorol Soc 77(3):437–472 146. Karatzas I, Shreve S (2012) Brownian motion and stochastic calculus, vol 113. Springer Science & Business Media 147. Katsoulakis MA, Majda AJ, Vlachos DG (2003) Coarse-grained stochastic processes and Monte Carlo simulations in lattice systems. J Comput Phys 186(1):250–278 148. Khouider B, Biello J, Majda AJ (2010) A stochastic multicloud model for tropical convection. Commun Math Sci 8(1):187–216 149. Kazuyoshi K, Bin W, Yoshiyuki K (2012) Bimodal representation of the tropical intraseasonal oscillation. Clim Dyn 38(9):1989–2000 150. Kleeman R (2011) Information theory and dynamical system predictability. Entropy 13(3):612– 649

192

References

151. Kloeden PE, Platen E (1992) Stochastic differential equations. In: Numerical solution of stochastic differential equations. Springer pp 103–160 152. Kokkala J, Solin A, Särkkä S (2014) Expectation maximization based parameter estimation by sigma-point and particle smoothing. In: 17th International Conference on Information Fusion (FUSION). IEEE, pp 1–8 153. Kullback S (1997) Information theory and statistics. Courier Corporation 154. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86 155. Khattatov B, Lahoz W, Menard R (2010) Data assimilation. Springer 156. Lampinen J, Vehtari A (2001) Bayesian approach for neural networks-review and case studies. Neural Networks 14(3):257–274 157. Lau K-M, Li P, Nakazawa T (1989) Dynamics of super cloud clusters, westerly wind bursts, 30–60 day oscilations and ENSO: An unified view. J Meteorol Soc Japan Ser II 67(2):205–219 158. Lau WK-M, Waliser DE (2011) Intraseasonal variability in the atmosphere-ocean climate system. Springer Science & Business Media 159. Law K, Stuart A, Zygalakis K (2015) Data assimilation. Springer, Cham, p 214 160. Leith CE (1975) Climate response and fluctuation dissipation. J Atmos Sci 32(10):2022–2026 161. Leutbecher M, Palmer TN (2008) Ensemble forecasting. J Comput Phys 227(7):3515–3539 162. Lindner B, Garcıa-Ojalvo J, Neiman A, Schimansky-Geier L (2004) Effects of noise in excitable systems. Phys Rep 392(6):321–424 163. Lindner B, Schimansky-Geier L (2000) Coherence and stochastic resonance in a two-state system. Phys Rev E 61(6):6103 164. Liptser RS, Shiryaev AN (2013) Statistics of random processes II: Applications, vol 6. Springer Science & Business Media 165. Liu JS, Chen R (1998) Sequential Monte Carlo methods for dynamic systems. J Am Stat Assoc 93(443):1032–1044 166. Longtin A (1993) Stochastic resonance in neuron models. J Stat Phys 70(1):309–327 167. Lorenc AC (1986) Analysis methods for numerical weather prediction. Q J Royal Meteorol Soc 112(474):1177–1194 168. Lorenc AC, Rawlins F (2005) Why does 4D-Var beat 3D-Var? Q J Royal Meteorol Soc: A J Atmos Sci Appl Meteorol Phys Oceanogr 131(613):3247–3257 169. Lorenz EN (1963) Deterministic nonperiodic flow. J Atmos Sci 20(2):130–141 170. Lorenz EN (1980) Attractor sets and quasi-geostrophic equilibrium. J Atmos Sci 37(8):1685– 1699 171. Lorenz EN (1984) Formulation of a low-order model of a moist general circulation. J Atmos Sci 41(12):1933–1945 172. Lorenz EN (1984) Irregularity: A fundamental property of the atmosphere. Tellus A 36(2):98– 110 173. MacKay DJC (1995) Probable networks and plausible predictions-a review of practical Bayesian methods for supervised neural networks. Network: Comput Neural Syst 6(3):469 174. Majda A (2003) Introduction to PDEs and waves for the atmosphere and ocean, vol 9. American Mathematical Soc 175. Majda A, Abramov RV, Grote MJ (2005) Information theory and stochastics for multiscale nonlinear systems, vol 25. American Mathematical Soc 176. Majda A, Wang X (2006) Nonlinear dynamics and statistical theories for basic geophysical flows. Cambridge University Press 177. Majda AJ, Chen N (2018) Model error, information barriers, state estimation and prediction in complex multiscale systems. Entropy 20(9):644 178. Majda AJ, Franzke C, Crommelin D (2009) Normal forms for reduced stochastic climate models. Proc Nat Acad Sci 106(10):3649–3653

References

193

179. Majda AJ, Franzke C, Khouider B (2008) An applied mathematics perspective on stochastic modelling for climate. Philos Trans Royal Soc A: Math Phys Eng Sci 366(1875):2427–2453 180. Majda AJ, Gershgorin B, Yuan Y (2010) Low-frequency climate response and fluctuationdissipation theorems: Theory and practice. J Atmos Sci 67(4):1186–1201 181. Majda AJ, Grote MJ (1997) Model dynamics and vertical collapse in decaying strongly stratified flows. Phys Fluids 9(10):2932–2940 182. Majda AJ, Grote MJ (2007) Explicit off-line criteria for stable accurate time filtering of strongly unstable spatially extended systems. Proc Nat Acad Sci 104(4):1124–1129 183. Majda AJ, Harlim J (2012) Filtering complex turbulent systems. Cambridge University Press 184. Majda AJ, Harlim J (2012) Physics constrained nonlinear regression models for time series. Nonlinearity 26(1):201 185. Majda AJ, Timofeyev I, Eijnden EV (1999) Models for stochastic climate prediction. Proc Nat Acad Sci 96(26):14687–14691 186. Majda AJ, Timofeyev I, Eijnden EV (2001) A mathematical framework for stochastic climate models. Commun Pure Appl Math: J Issued Courant Inst Math Sci 54(8):891–974 187. Majda AJ, Yuan Y (2012) Fundamental limitations of ad hoc linear and quadratic multi-level regression models for physical systems. Discrete Continuous Dyn Syst-B 17(4):1333 188. Malek-Madani R (2012) Physical oceanography: a mathematical introduction with MATLAB. CRC Press 189. Mangan NM, Brunton SL, Proctor JL, Kutz JN (2016) Inferring biological networks by sparse identification of nonlinear dynamics. IEEE Trans Mol Biol Multi-Scale Commun 2(1):52–63 190. Marini Bettolo MU, Andrea P, Lamberto R, Angelo V (2008) Fluctuation-dissipation: response theory in statistical physics. Phys Rep 461(4–6):111–195 191. Maulik R, Rao V, Wang J, Mengaldo G, Constantinescu E, Lusch B, Balaprakash P, Foster I, Kotamarthi R (2022) AIEADA 1.0: Efficient high-dimensional variational data assimilation with machine-learned reduced-order models. Geosci Model Develop Discuss 2022:1–20 192. Mil’shtejn GN (1975) Approximate integration of stochastic differential equations. Theory Prob Appl 19(3):557–562 193. Mojgani R, Chattopadhyay A, Hassanzadeh P (2022) Discovery of interpretable structural model errors by combining Bayesian sparse regression and data assimilation: A chaotic Kuramoto–Sivashinsky test case. Chaos: Inter J Nonlinear Sci 32(6):061105 194. Moon TK (1996) The expectation-maximization algorithm. IEEE Sig Process Mag 13(6):47–60 195. Moore JB (1973) Discrete-time fixed-lag smoothing algorithms. Automatica 9(2):163–173 196. Müller P (2006) The equations of oceanic motions. Cambridge University Press 197. Neiman A, Schimansky-Geier L, Cornell-Bell A, Moss F (1999) Noise-enhanced phase synchronization in excitable media. Phys Rev Lett 83(23):4896 198. Novara C (2012) Sparse identification of nonlinear functions and parametric set membership optimality analysis. IEEE Trans Automat Control 57(12):3236–3241 199. Øksendal B (2003) Stochastic differential equations. In: Stochastic differential equations. Springer, pp 65–84 200. Oksendal B (2013) Stochastic differential equations: an introduction with applications. Springer Science & Business Media 201. Palmer T (2019) The ECMWF ensemble prediction system: Looking back (more than) 25 years and projecting forward 25 years. Q J Royal Meteorol Soc 145:12–24 202. Papaspiliopoulos O, Roberts GO, Stramer O (2013) Data augmentation for diffusions. J Comput Graph Stat 22(3):665–688 203. Pawar S, Ahmed SE, San O, Rasheed A, Navon IM (2020) Long short-term memory embedded nudging schemes for nonlinear data assimilation of geophysical flows. Phys Fluids 32(7):076606

194

References

204. Pedlosky J et al (1987) Geophysical fluid dynamics, vol 710. Springer 205. Penland C, Sardeshmukh PD (1995) The optimal growth of tropical sea surface temperature anomalies. J Clim 8(8):1999–2024 206. Penny SG, Smith TA, Chen T-C, Platt JA, Lin H-Y, Goodliff M, Abarbanel HDI (2022) Integrating recurrent neural networks with data assimilation for scalable data-driven state estimation. J Adv Model Earth Syst 14(3):e2021MS002843 207. George S, Philander H (1983) El Nino southern oscillation phenomena. Nature 302(5906):295– 301 208. Plant RS, Craig GC (2008) A stochastic parameterization for deep convection based on equilibrium statistics. J Atmos Sci 65(1):87–105 209. Plett GL (2004) Extended Kalman filtering for battery management systems of LiPB-based HEV battery packs: Part 3. State and parameter estimation. J Power Sour 134(2):277–292 210. Qi D, Harlim J (2022) Machine learning-based statistical closure models for turbulent dynamical systems. Philos Trans Royal Soc A 380(2229):20210205 211. Qi D, Majda AJ (2017) Barotropic turbulence with topography. Low-dimensional reducedorder models for statistical response and uncertainty quantification. Phys D: Nonlinear Phenom 343:7–27 212. Quinn CJ, Kiyavash N, Coleman TP (2015) Directed information graphs. IEEE Trans Inf Theory 61(12):6887–6909 213. Reynolds RW, Smith TM, Liu C, Chelton DB, Casey KS, Schlax MG (2007) Daily highresolution-blended analyses for sea surface temperature. J Clim 20(22):5473–5496 214. Ribeiro MI (2004) Kalman and extended Kalman filters: Concept, derivation and properties. Inst Syst Robot 43:46 215. Risken H (1996) Fokker-planck equation. In: The Fokker-Planck equation. Springer, pp 63–95 216. Roberts GO, Rosenthal JS (2009) Examples of adaptive MCMC. J Comput Graph Stat 18(2):349–367 217. Roberts GO, Stramer O (2001) On inference for partially observed nonlinear diffusion models using the Metropolis-hastings algorithm. Biometrika 88(3):603–621 218. Rossi V, Vila J-P (2006) Nonlinear filtering in discrete time: A particle convolution approach. In Ann de l’ISUP 50:71–102 219. Ruckstuhl Y, Janji´c T, Rasp S (2021) Training a convolutional neural network to conserve mass in data assimilation. Nonlinear Process Geophys 28(1):111–119 220. Sabeerali CT, Ajayamohan RS, Giannakis D, Majda AJ (2017) Extraction and prediction of indices for monsoon intraseasonal oscillations: An approach based on nonlinear Laplacian spectral analysis. Clim Dyn 49(9):3031–3050 221. Sahai AK, Sharmila S, Abhilash S, Chattopadhyay R, Borah N, Krishna RPM, Joseph S, Roxy M, De S, Pattnaik S, et al (2013) Simulation and extended range prediction of monsoon intraseasonal oscillations in NCEP CFS/GFS version 2 framework. Current Sci 1394–1408 222. Salman H, Ide K, Jones CKRT (2008) Using flow geometry for drifter deployment in Lagrangian data assimilation. Tellus A: Dyn Meteorol Oceanogr 60(2):321–335 223. Salmon R (1998) Lectures on geophysical fluid dynamics. Oxford University Press 224. Santosa F, Symes WW (1986) Linear inversion of band-limited reflection seismograms. SIAM J Sci Stat Comput 7(4):1307–1330 225. Sapsis TP, Majda AJ (2013) A statistically accurate modified quasilinear Gaussian closure for uncertainty quantification in turbulent dynamical systems. Phys D: Nonlinear Phenom 252:34– 45 226. Sarachik ES, Cane MA (2010) The El Nino-southern oscillation phenomenon. Cambridge University Press 227. Särkkä S (2013) Bayesian filtering and smoothing. Cambridge University Press

References

195

228. Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science 324(5923):81–85 229. Schneider T, Stuart AM, Wu J-L (2021) Learning stochastic closures using ensemble Kalman inversion. Trans Math Appl 5(1):tnab003 230. Scott DW (1979) On optimal and data-based histograms. Biometrika 66(3):605–610 231. Shumway RH, Stoffer DS (1982) An approach to time series smoothing and forecasting using the EM algorithm. J Time Ser Anal 3(4):253–264 232. Sikka DR, Gadgil S (1980) On the maximum cloud zone and the ITCZ over indian longitudes during the southwest monsoon. Mon Weather Rev 108(11):1840–1853 233. Slawinska J, Ourmazd A, Giannakis D (2019) A quantum mechanical approach for data assimilation in climate dynamics. In: Proceedings of the 36th International Conference on Machine Learning 234. Slivinski L, Spiller E, Apte A, Sandstede B (2015) A hybrid particle-ensemble Kalman filter for Lagrangian data assimilation. Mon Weather Rev 143(1):195–211 235. Smedstad OM, O’Brien JJ (1991) Variational data assimilation and parameter estimation in an equatorial Pacific ocean model. Prog Oceanogr 26(2):179–241 236. Smith LM, Waleffe F (2002) Generation of slow large scales in forced rotating stratified turbulence. J Fluid Mech 451:145–168 237. Snyder C, Bengtsson T, Bickel P, Anderson J (2008) Obstacles to high-dimensional particle filtering. Mon Weather Rev 136(12):4629–4640 238. Sontag ED (2013) Mathematical control theory: deterministic finite dimensional systems, vol 6. Springer Science & Business Media 239. Sørensen H (2004) Parametric inference for diffusion processes observed at discrete points in time: a survey. Int Stat Rev 72(3):337–354 240. Sorokina M, Sygletos S, Turitsyn S (2016) Sparse identification for nonlinear optical communication systems: SINO method. Opt Exp 24(26):30433–30443 241. Sparrow C (2012) The Lorenz equations: bifurcations, chaos, and strange attractors, vol 41. Springer Science & Business Media 242. Subel A, Chattopadhyay A, Guan Y, Hassanzadeh P (2021) Data-driven subgrid-scale modeling of forced Burgers turbulence using deep learning with generalization to higher Reynolds numbers via transfer learning. Phys Fluids 33(3):031702 243. Sun L, Penny SG (2019) Lagrangian data assimilation of surface drifters in a double-gyre ocean model using the local ensemble transform Kalman filter. Mon Weather Rev 147(12):4533–4551 244. Taghvaei A, Wiljes JD, Mehta PG, Reich S (2018) Kalman filter and its modern extensions for the continuous-time nonlinear filtering problem. Dyn Syst Meas Control 140(3) 245. Tanner MA, Wong WH (1987) The calculation of posterior distributions by data augmentation. J Am Stat Assoc 82(398):528–540 246. Thomas SJ, Hacker JP, Anderson JL (2009) A robust formulation of the ensemble Kalman filter. Q J Royal Meteorol Soc: J Atmos Sci Appl Meteorol Phys Oceanogr 135(639):507–521 247. Thual S, Majda AJ (2015) A suite of skeleton models for the MJO with refined vertical structure. Math Clim Weather Forecast 1(1) 248. Thual S, Majda AJ, Chen N, Stechmann SN (2016) Simple stochastic model for El Niño with westerly wind bursts. Proc Nat Acad Sci 113(37):10245–10250 249. Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J Royal Stat Soc: Ser B (Methodol) 58(1):267–288 250. Timmermann A, An S-Il, Kug J-S, Jin F-F, Cai W, Antonietta C, Kim MC, Matthieu L, Michael JM, Malte FS, et al (2018) El Niño-southern oscillation complexity. Nature 559(7715):535–545 251. Torrey L, Shavlik J (2010) Transfer learning. In: Handbook of research on machine learning applications and trends: algorithms, methods, and techniques. IGI global, pp 242–264

196

References

252. Toth Z, Kalnay E (1997) Ensemble forecasting at NCEP and the breeding method. Mon Weather Rev 125(12):3297–3319 253. Treutlein H, Schulten K (1985) Noise induced limit cycles of the Bonhoeffer-Van der Pol model of neural pulses. Berichte der Bunsengesellschaft für physikalische Chemie 89(6):710–718 254. Tsuyuki T, Tamura R (2022) Nonlinear data assimilation by deep learning embedded in an ensemble Kalman filter. J Meteorol Soc Japan Ser II 255. Turchetti C (2004) Stochastic models of neural networks, vol 102. IOS Press 256. Uppala SM, Kållberg PW, Simmons AJ, Andrae U, Da Costa Bechtold V, Fiorino M, Gibson JK, Haseler J, Hernandez A, Kelly GA et al (2005) The ERA-40 re-analysis. Q J Royal Meteorol Soc: J Atmos Sci Appl Meteorol Phys Oceanogr 131(612):2961–3012 257. Vallis GK (2017) Atmospheric and oceanic fluid dynamics. Cambridge University Press 258. Van Der Merwe R, Wan EA (2001) The square-root unscented Kalman filter for state and parameter-estimation. In: 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221). IEEE, vol 6, pp 3461–3464 259. Van Leeuwen PJ (2012) Particle filters for the geosciences. In: Advanced Data Assimilation for Geosciences: Lecture Notes of the Les Houches School of Physics: Special Issue, pp 291 260. Vihola M (2012) Robust adaptive Metropolis algorithm with coerced acceptance rate. Stat Comput 22(5):997–1008 261. Wand MP, Chris Jones M et al (1994) Multivariate plug-in bandwidth selection. Comput Stat 9(2):97–116 262. Wang W-X, Yang R, Lai Y-C, Kovanis V, Grebogi C (2011) Predicting catastrophes in nonlinear dynamical systems by compressive sensing. Phys Rev Lett 106(15):154101 263. Wang X (2004) Infinite Prandtl number limit of Rayleigh-Bénard convection. Commun Pure Appl Math: J Issued Courant Inst Math Sci 57(10):1265–1282 264. Weaver AT, Vialard J, Anderson DLT (2003) Three-and four-dimensional variational assimilation with a general circulation model of the tropical Pacific ocean. Part I: Formulation, internal diagnostics, and consistency checks. Mon Weather Rev 131(7):1360–1378 265. Webster PJ, Oo Magana V, Palmer TN, Shukla J, Tomas RA, Yanai MU, Yasunari T (1998) Monsoons: Processes, predictability, and the prospects for prediction. J Geophys Res: Oceans 103(C7):14451–14510 266. Weiss K, Khoshgoftaar TM, Wang D (2016) A survey of transfer learning. J Big Data 3(1):1–40 267. Wenzel TA, Burnham KJ, Blundell MV, Williams RA (2006) Dual extended Kalman filter for vehicle state and parameter estimation. Vehicle Syst Dyn 44(2):153–171 268. Whitaker JS, Hamill TM (2002) Ensemble data assimilation without perturbed observations. Mon Weather Rev 130(7):1913–1924 269. Wiesenfeld K, Pierson D, Pantazelou E, Dames C, Moss F (1994) Stochastic resonance on a circle. Phys Rev Lett 72(14):2125 270. Wilks DS (2005) Effects of stochastic parametrizations in the Lorenz’96 system. Q J Royal Meteorol Soc: J Atmos Sci Appl Meteorol Phys Oceanogr 131(606):389–407 271. Williams D et al (2001) Weighing the odds: a course in probability and statistics. Cambridge University Press 272. Xu Q (2007) Measuring information content from observations for data assimilation: Relative entropy versus Shannon entropy difference. Tellus A: Dyn Meteorol Oceanogr 59(2):198–209 273. Yang LM, Grooms I (2021) Machine learning techniques to construct patched analog ensembles for data assimilation. J Comput Phys 443:110532 274. Zeng Y, Janji´c T, Ruckstuhl Y, Verlaan M (2017) Ensemble-type Kalman filter algorithm conserving mass, total energy and enstrophy. Q J Royal Meteorol Soc 143(708):2902–2914

Index

A Adaptive MCMC, 147, 148 Autocorrelation function (ACF), 52, 88, 101

B Block decomposition, 136 Boussinesq equation, 122

C Causation entropy, 166 Coarse graining, 29 Complex OU process, 100, 101, 129 Conditional distribution, 2 Conditional distribution of Gaussian, 3 Conditional Gaussian nonlinear system (CGNS), 119, 155 Conditional moment, 64 Conditional sampling, 126 Correlated additive and multiplicative (CAM) noise, 57 Cross correlation function, 101

D Damping, 51 Data assimilation, 67, 155 Decorrelation time, 52, 101

E EM algorithm, 149 Ensemble forecast, 20, 83, 174 Ensemble Kalman filter (EnKF), 76 Ensemble mean, 84 Ensemble transform Kalman filter (ETKF), 77, 78 ENSO, 6, 107 Ergodicity, 46 Euler-Maruyama scheme, 43 Extended Kalman filter, 75 F Filtering, 68, 82, 112, 113, 125 Fisher information, 95 Fluctuation-dissipation theorem (FDT), 91, 138 Fokker-Planck equation, 11, 52, 89, 106 Forecast model, 69 G Gamma distribution, 106, 107 Gaussian (or normal) distribution, 2 Gaussian mixture, 135 I Indices, 99, 107, 140 Intermittent time series, 14, 112, 113, 140 Internal predictability, 85

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 N. Chen, Stochastic Methods for Modeling and Predicting Complex Dynamical Systems, Synthesis Lectures on Mathematics & Statistics, https://doi.org/10.1007/978-3-031-22249-8

197

198 Itô’s formula, 10, 50, 54 K Kalman filter, 68 Kalman gain, 69 Kalman-Bucy filter, 80 Kernel density estimation (KDE), 47 Kurtosis, 5 L Lagrangian data assimilation, 128 LASSO regression, 165 Lead time, 84 Linear Gaussian system, 49 M Machine learning, 171 Marginal density, 2 Marginal density of Gaussian, 3 Markov jump process, 14 Markov property, 8 Master equation, 16 Maximum entropy principle, 27 Maximum likelihood, 74, 151, 166 MCMC, 143, 146 Mean stochastic model (MSM), 112 Milstein scheme, 44 Model error, 85, 86 Model fidelity, 88 Model response, 89 Moment equation, 51, 55, 56, 59, 174 Moments, 4 Monsoon intraseasonal oscillation (MISO), 139 Monte Carlo method, 39 Multiplicative noise, 53, 106, 108 Mutual information, 33 N Non-Gaussian distribution, 6, 55 P Partial observations, 73, 149, 168 Particle filter, 79 Pattern correlation, 35

Index Physics constraint, 115, 153 Posterior distribution, 32, 67 Prior distribution, 32, 67 Probability density function (PDF), 2

Q Quasi-Gaussian (qG) FDT, 91 Quasi-Gaussian closure, 60 Quasi-geostrophic (QG) model, 102, 130, 131

R Random variable, 1 Recurrent neural network, 175 Relative entropy, 31, 86, 132 Reynolds decomposition, 49 Riccati equation, 81, 124 Root-mean-square error (RMSE), 35 Rossby number, 104, 131

S Sea ice, 102 Shallow water equation, 103, 104, 131 Shannon’s entropy, 24 Skewness, 5 Smoothing, 82, 125, 151 Solve-the-equation plug-in approach, 48, 135 Sparse identification, 164 SPEKF, 92, 111 Standard deviation, 4, 7 Statistical average, 3, 4, 50 Statistical equilibrium state, 51 Statistical symmetry, 137 Stochastic differential equation (SDE), 10 Stochastic parameterization, 14, 110 Stochastic process, 7 Switching time, 18

T Two-state Markov jump process, 14, 113

W White noise, 10 Wiener process, 8