Patterns Identification and Data Mining in Weather and Climate [1 ed.] 3030670724, 9783030670726

Advances in computer power and observing systems has led to the generation and accumulation of large scale weather &

272 92 21MB

English Pages 624 [625] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
C1
Preface
Acknowledgements
Contents
1 Introduction
2 General Setting and Basic Terminology
3 Empirical Orthogonal Functions
4 Rotated and Simplified EOFs
5 Complex/Hilbert EOFs
6 Principal Oscillation Patterns and Their Extension
7 Extended EOFs and SSA
8 Persistent, Predictive and Interpolated Patterns
9 Principal Coordinates or Multidimensional Scaling
10 Factor Analysis
11 Projection Pursuit
12 Independent Component Analysis
13 Kernel EOFs
14 Functional and Regularised EOFs
15 Methods for Coupled Patterns
16 Further Topics
17 Machine Learning
A Smoothing Techniques
B Introduction to Probability and Random Variables
C Stationary Time Series Analysis
D Matrix Algebra and Matrix Function
E Optimisation Algorithms
F Hilbert Space
G Systems of Linear Ordinary Differential Equations
H Links for Software Resource Material
References
Index
Recommend Papers

Patterns Identification and Data Mining in Weather and Climate [1 ed.]
 3030670724, 9783030670726

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Springer Atmospheric Sciences

Abdelwaheb Hannachi

Patterns Identification and Data Mining in Weather and Climate

Springer Atmospheric Sciences

The Springer Atmospheric Sciences series seeks to publish a broad portfolio of scientific books, aiming at researchers, students, and everyone interested in this interdisciplinary field. The series includes peer-reviewed monographs, edited volumes, textbooks, and conference proceedings. It covers the entire area of atmospheric sciences including, but not limited to, Meteorology, Climatology, Atmospheric Chemistry and Physics, Aeronomy, Planetary Science, and related subjects.

More information about this series at http://www.springer.com/series/10176

Abdelwaheb Hannachi

Patterns Identification and Data Mining in Weather and Climate

Abdelwaheb Hannachi Department of Meteorology, MISU Stockholm University Stockholm, Sweden

ISSN 2194-5217 ISSN 2194-5225 (electronic) Springer Atmospheric Sciences ISBN 978-3-030-67072-6 ISBN 978-3-030-67073-3 (eBook) https://doi.org/10.1007/978-3-030-67073-3 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To the memory of my father and mother who taught me big principles, and to my little family Houda, Badr, Zeid and Ahmed for their patience.

Preface

Weather and climate is a fascinating system, which affects our daily lives, and is closely interlinked with the environment, society and infrastructure. They have large impact on our lives and activities, climate change being a typical example. It is a high-dimensional highly complex system involving nonlinear interactions between very many modes or degrees of freedom. This made weather and climate look mysterious in ancient societies. Complex high-dimensional systems are difficult to comprehend by our three-dimensional concept of the physical world. Humans have sought out patterns in order to describe the working world around us. This task, however, proved difficult and challenging. In the climate context, the quest to identify patterns is driven by the desire to find structures embedded in state space, which can lead to a better understanding of the system dynamics, and eventually learn its behaviour and predict its future state. With the advent of computers and observing systems, massive amounts of data from the atmosphere and ocean are obtained, which beg for exploration and analysis. Pattern identification in atmospheric science has a long history. It began in the 1920s with Gilbert Walker, who identified the southern oscillation and the atmospheric component of ENSO teleconnection, although the latter concept seems to have been mentioned for the first time by Ångström in the mid-1930s. The correlation analysis of Gilbert Walker used to identify the southern oscillation is akin to the iterative algorithm used to compute empirical orthogonal functions. However, the earliest known eigenanalysis in atmospheric science goes back to the time of the previous USSR school by Obukhov and Bagrov around the late 1940s and early 1950s, respectively. But it was Ed. Lorenz who coined the term ‘empirical orthogonal functions’ (EOFs) in the mid-1950s. Since then, research on the topic has been expanding, and a number of textbooks have been written, notably by Preisendorfer in the late 1980s, followed by texts by Thiebaux, and von Storch and Zwiers about a decade later, and another one by Jolliffe in the early 2000s. These texts did an excellent job in presenting the theory and methods, particularly those related to eigenvalue problems in meteorology and oceanography. Weather and climate data analysis has witnessed a fast growth in the last few decades both in terms of methods and applications. This growth was driven by the vii

viii

Preface

need to analyse and interpret the fast-growing volume of climate data using both linear and nonlinear methods. In this book, I attempt to give an up-to-date text by presenting linear and nonlinear methods that have been developed in the last two decades, in addition to including conventional ones. The text is composed of 17 chapters. Apart from the first two introductory and setting up chapters, the remaining 15 chapters present the different methods used to analyse spatio-temporal data from atmospheric science and oceanography. The EOF method, a cornerstone of eigenvalue problems in meteorology and oceanography, is presented in Chap. 3. The next four chapters present derivatives of EOFs, including eigenvalue problems involved the identification of propagating features. A whole chapter is devoted to predictability and predictable patterns, and another one on multidimensional scaling, which discusses various dissimilarity measures used in pattern identification, followed by a chapter on factor analysis. Nonlinear methods of space-time pattern identification, with different perspectives, are presented in the next three chapters. The previous chapters deal essentially with discrete gridded data, as is usually the case, with no explicit discussion of the continuous case, such as curves or surfaces. This topic is presented and discussed in the next chapter. Another whole chapter is devoted to presenting and discussing the topic of coupled patterns using conventional and newly developed approaches. A number of other methods are not presented in the previous chapters. Those methods are collected and presented in the penultimate chapter. Finally, and to take into account the recent interest in automatic methods, the last chapter presents and discusses few commonly used methods in machine learning. To make it as a stand-alone text, a number of technical appendices are given at the end of the book. This book can be used in teaching data analysis in atmospheric science, or other topics such as advanced statistical methods in climate research. Apart from Chap. 15, in the context of coupled patterns and regression, and Appendix C, I did not discuss explicitly statistical modelling/inference. This topic of statistical inference in climate science is covered in a number of other books reported in the reference list. To help students and young researchers in the field explore the topics, I have included a number of small exercises, with hints, embedded within the different chapters, in addition to some basic skeleton Matlab codes for some basic methods. Full Matlab codes can be obtained from the author upon request. A list of software links is also given at the end of the book. Stockholm, Sweden

Abdelwaheb Hannachi

Pattern Identification and Data Mining in Weather and Climate

Complexity, nonlinearity and high-dimensionality constitute the main characteristic features of the weather and climate dynamical system. Advances in computer power and observing systems have led to the generation and accumulation of large-scale weather and climate data, which beg for exploration and analysis. Pattern Identification and Data Mining in Weather and Climate presents, from different perspectives, most available, novel and conventional, approaches used to analyse multivariate time series in atmospheric and oceanographic science to identify patterns of variability, teleconnections, and reduce dimensionality. The book discusses in detail linear and nonlinear methods to identify stationary and propagating patterns of spatio-temporal, single and combined fields. The book also presents machine learning with a particular focus on the main methods used in climate science. Applications to real atmospheric and oceanographic examples are also presented and discussed in most chapters. To help guide students and beginners in the field of weather and climate data analysis, basic Matlab skeleton codes are given is some chapters, complemented with a list of software links towards the end of the textbook. A number of technical appendices are also provided, making the text particularly suitable for didactic purposes. Abdelwaheb Hannachi is associate professor in the Department of Meteorology at Stockholm University, MISU. He currently serves as editor-in-chief of Tellus A: Dynamic Meteorology and Oceanography. Abdel teaches a number of undergraduate and postgraduate courses, including dynamic meteorology, statistical climatology, numerical weather prediction and data assimilation, and boundary layer turbulence. His main research interests are large-scale dynamics, teleconnections, nonlinearity in weather and climate, in addition to extremes and forecasting.

ix

Over the last few decades, we have amassed an enormous amount of weather and climate data of which we have to make sense now. Pattern identification methods and modern data mining approaches are essential in better understanding how the atmosphere and the climate system works. These topics are not traditionally taught in meteorology programmes. This book will prove a valuable source for students as well as active researchers interested in these topics. The book provides a broad overview over modern pattern identification methods and an introduction to machine learning. – Christian Franzke, ICCP, Pusan National University The topic of EOFs and associated pattern identification in space-time data sets has gone through an extraordinary fast development, both in terms of new insights and the breadth of applications. For this reason, we need a text approximately every 10 years to summarize the fields. Older texts by, for instance, Jolliffe and Preisendorfer need to be succeeded by an up-to-date new text. We welcome this new text by Abdel Hannachi who not only has a deep insight in the field but has himself made several contributions to new developments in the last 15 years. – Huug van den Dool, Climate Prediction Center, NCEP, College Park, MD Now that weather and climate science is producing ever larger and richer data sets, the topic of pattern extraction and interpretation has become an essential part. This book provides an up-to-date overview of the latest techniques and developments in this area. – Maarten Ambaum, Department of Meteorology, University of Reading, UK The text is very ambitious. It makes considerable effort to collect together a number of classical methods for data analysis, as well as newly emerging ones addressing the challenges of the modern huge data sets. There are not many books covering such a wide spectrum of techniques. In this respect, the book is a valuable companion for many researchers working in the field of climate/weather data analysis and mining. The author deserves congratulations and encouragement for his enormous work. – Nickolay Trendafilov, Open University, Milton Keynes This nicely and expertly written book covers a lot of ground, ranging from classical linear pattern identification techniques to more modern machine learning methodologies, all illustrated with examples from weather and climate science. It will be very valuable both as a tutorial for graduate and postgraduate students and as a reference text for researchers and practitioners in the field. – Frank Kwasniok, College of Engineering, Mathematics and Physical Sciences, University of Exeter

xi

We will show them Our signs in the horizons and within themselves until it becomes clear to them that it is the truth Holy Quran Ch. 51, V. 53

Acknowledgements

This text is a collection of work I have been conducting over the years on weather and climate data analysis, in collaboration with colleagues, complemented with other methods from the literature. I am especially grateful to all my teachers, colleagues and students, who contributed directly or indirectly to this work. I would like to thank, in particular, Zoubeida Bargaoui, Bernard Legras, Keith Haines, Ian Jolliffe, David B. Stephenson, Nickolay Trendafilov, Christian Franzke, Thomas Önskog, Carlos Pires, Tim Woollings, Klaus Fraedrich, Toshihiko Hirooka, Grant Branstator, Lesley Gray, Alan O’Neill, Waheed Iqbal, Andrew Turner, Andy Heaps, Amro Elfeki, Ahmed El-Hames, Huug van den Dool, Léon Chafik and all my MISU colleagues, and many other colleagues I did not mention by name. I acknowledge the support of Stockholm University and the Springer team, in particular Robert Doe, executive editor, and Neelofar Yasmeen, production coordinator, for their support and encouragement.

xiii

Contents

1

2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Complexity of the Climate System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Data Exploration, Data Mining and Feature Extraction. . . . . . . . . . . 1.3 Major Concern in Climate Data Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Characteristics of High-Dimensional Space Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Curse of Dimensionality and Empty Space Phenomena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Dimension Reduction and Latent Variable Models . . . . 1.3.4 Some Problems and Remedies in Dimension Reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Examples of the Most Familiar Techniques . . . . . . . . . . . . . . . . . . . . . . . General Setting and Basic Terminology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Simple Visualisation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Data Processing and Smoothing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Preliminary Checking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Smoothing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Simple Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Data Set-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Basic Notation/Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Centring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Covariance Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.3 Scaling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.4 Sphering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.5 Singular Value Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Stationary Time Series, Filtering and Spectra . . . . . . . . . . . . . . . . . . . . . 2.6.1 Univariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.2 Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 5 5 9 11 12 13 15 15 16 17 17 18 20 21 23 23 24 25 25 26 26 26 28

xv

xvi

Contents

3

Empirical Orthogonal Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Eigenvalue Problems in Meteorology: Historical Perspective . . . . 3.2.1 The Quest for Climate Patterns: Teleconnections . . . . . . 3.2.2 Eigenvalue Problems in Meteorology. . . . . . . . . . . . . . . . . . . 3.3 Computing Principal Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Basis of Principal Component Analysis . . . . . . . . . . . . . . . . 3.3.2 Karhunen–Loéve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Derivation of PCs/EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Computing EOFs and PCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Sampling, Properties and Interpretation of EOFs . . . . . . . . . . . . . . . . . 3.4.1 Sampling Variability and Uncertainty . . . . . . . . . . . . . . . . . . 3.4.2 Independent and Effective Sample Sizes . . . . . . . . . . . . . . . 3.4.3 Dimension Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Properties and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Covariance Versus Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Scaling Problems in EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 EOFs for Multivariate Normal Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Other Procedures for Obtaining EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Other Related Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Teleconnectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.2 Regression Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9.3 Empirical Orthogonal Teleconnection . . . . . . . . . . . . . . . . . . 3.9.4 Climate Network-Based Methods . . . . . . . . . . . . . . . . . . . . . . .

31 31 33 33 34 35 35 36 38 40 45 45 50 53 54 61 61 62 63 65 65 66 67 67

4

Rotated and Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Rotation of EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Background on Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Derivation of REOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Computing REOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Simplified EOFs: SCoTLASS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 LASSO-Based Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Computing the Simplified EOFs . . . . . . . . . . . . . . . . . . . . . . . .

71 71 72 72 73 74 81 81 82 83

5

Complex/Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5.2 Conventional Complex EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.1 Pairs of Scalar Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2.2 Single Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Frequency Domain EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.3.2 Derivation of FDEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Complex Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.1 Hilbert Transform: Continuous Signals . . . . . . . . . . . . . . . . . 101

Contents

xvii

5.4.2 Hilbert Transform: Discrete Signals . . . . . . . . . . . . . . . . . . . . 5.4.3 Application to Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.4 Complex Hilbert EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rotation of HEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

103 105 106 115

6

Principal Oscillation Patterns and Their Extension . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 POP Derivation and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Spatial Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Time Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Relation to Continuous POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Basic Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Finite Time POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Cyclo-Stationary POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Other Extensions/Interpretations of POPs . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.1 POPs and Normal Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.2 Complex POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.3 Hilbert Oscillation Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5.4 Dynamic Mode Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 High-Order POPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Principal Interaction Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

117 117 119 119 122 125 127 127 130 132 134 134 135 136 138 138 139

7

Extended EOFs and SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Dynamical Reconstruction and SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Dynamical Reconstruction and SSA . . . . . . . . . . . . . . . . . . . . 7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 White Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Red Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 SSA and Periodic Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 Extended EOFs or Multivariate SSA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Definition and Computation of EEOFs . . . . . . . . . . . . . . . . . 7.5.3 Data Filtering and Oscillation Reconstruction . . . . . . . . . 7.6 Potential Interpretation Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Alternatives to SSA and EEOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.1 Recurrence Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7.2 Data-Adaptive Harmonic Decomposition . . . . . . . . . . . . . .

145 145 147 147 148 151 151 152 154 157 157 157 161 168 169 169 169

8

Persistent, Predictive and Interpolated Patterns . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Background on Persistence and Prediction of Stationary Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Decorrelation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.2 The Prediction Problem and Kolmogorov Formula . . . .

171 171

5.5

172 172 174

xviii

Contents

8.3

Optimal Persistence and Average Predictability. . . . . . . . . . . . . . . . . . . 8.3.1 Derivation of Optimally Persistent Patterns . . . . . . . . . . . . 8.3.2 Estimation from Finite Samples. . . . . . . . . . . . . . . . . . . . . . . . . 8.3.3 Average Predictability Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . Predictive Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2 Optimally Predictable Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3 Computational Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Optimally Interpolated Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.2 Interpolation and Pattern Derivation . . . . . . . . . . . . . . . . . . . . 8.5.3 Numerical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Forecastable Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

176 176 179 183 185 185 185 187 189 189 189 192 193 196

9

Principal Coordinates or Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Dissimilarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Metric Multidimensional Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 The Problem of Classical Scaling . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Principal Coordinate Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Case of Non-Euclidean Dissimilarity Matrix . . . . . . . . . . . 9.4 Non-metric Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Further Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Replicated and Weighted MDS . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.2 Nonlinear Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.3 Application to the Asian Monsoon. . . . . . . . . . . . . . . . . . . . . . 9.5.4 Scaling and the Matrix Nearness Problem . . . . . . . . . . . . . .

201 201 202 204 204 205 207 208 210 210 211 212 215

10

Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 The Factor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2 Model Definition and Terminology . . . . . . . . . . . . . . . . . . . . . 10.2.3 Model Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.4 Non-unicity of Loadings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Maximum Likelihood Estimates . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Expectation Maximisation Algorithm . . . . . . . . . . . . . . . . . . 10.4 Factor Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Oblique and Orthogonal Rotations . . . . . . . . . . . . . . . . . . . . . . 10.4.2 Examples of Rotation Criteria. . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Exploratory FA and Application to SLP Anomalies . . . . . . . . . . . . . . 10.5.1 Factor Analysis as a Matrix Decomposition Problem. . 10.5.2 A Factor Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

219 219 220 220 220 221 223 224 224 225 229 229 230 232 232 234

8.4

8.5

8.6

Contents

10.6

xix

Basic Difference Between EOF and Factor Analyses . . . . . . . . . . . . . 235 10.6.1 Comparison Based on the Standard Factor Model . . . . . 236 10.6.2 Comparison Based on the Exploratory Factor Analysis Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239

11

Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Definition and Purpose of Projection Pursuit . . . . . . . . . . . . . . . . . . . . . . 11.2.1 What Is Projection Pursuit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2.2 Why Projection Pursuit? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Entropy and Structure of Random Variables . . . . . . . . . . . . . . . . . . . . . . 11.3.1 Shannon Entropy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.2 Differential Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Types of Projection Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.1 Quality of a Projection Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.2 Various PP Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4.3 Practical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 PP Regression and Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.1 PP Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5.2 PP Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Skewness Modes and Climate Application of PP . . . . . . . . . . . . . . . . .

241 241 242 242 243 244 244 244 246 246 247 255 257 257 259 260

12

Independent Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Background and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Blind Deconvolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Blind Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Definition of ICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Independence and Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Statistical Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.2 Non-normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Information-Theoretic Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.3 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.4 Negentropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.5 Useful Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Independent Component Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Choice of Objective Function for ICA . . . . . . . . . . . . . . . . . . 12.5.2 Numerical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 ICA via EOF Rotation and Weather and Climate Application . . . 12.6.1 The Standard Two-Way Problem. . . . . . . . . . . . . . . . . . . . . . . . 12.6.2 Extension to the Three-Way Data . . . . . . . . . . . . . . . . . . . . . . . 12.7 ICA Generalisation: Independent Subspace Analysis . . . . . . . . . . . . .

265 265 266 266 268 268 269 269 270 271 271 272 273 274 275 276 276 281 284 284 291 293

xx

Contents

13

Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.1 Formulation of Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Practical Details of Kernel EOF Computation . . . . . . . . . 13.2.3 Illustration with Concentric Clusters. . . . . . . . . . . . . . . . . . . . 13.3 Relation to Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Spectral Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3.2 Modularity Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4 Pre-images in Kernel PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Application to An Atmospheric Model and Reanalyses . . . . . . . . . . 13.5.1 Application to a Simplified Atmospheric Model . . . . . . . 13.5.2 Application to Reanalyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Other Extensions of Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6.1 Extended Kernel EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6.2 Kernel POPs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

295 295 298 298 301 302 304 304 305 306 309 309 314 316 316 317

14

Functional and Regularised EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.1 Functional EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.2 Functional PCs and Discrete Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.3 An Example of Functional PCs from Oceanography . . . . . . . . . . . . . 14.4 Regularised EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.1 General Setting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.4.2 Case of Spatial Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14.5 Numerical Solution of the Full Regularised EOF Problem . . . . . . . 14.6 Application of Regularised EOFs to SLP Anomalies . . . . . . . . . . . . .

319 319 321 321 324 324 326 327 331

15

Methods for Coupled Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2 Canonical Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Formulation of CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.3 Computational Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.4 Regularised CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.5 Use of Correlation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3 Canonical Covariance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4 Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.1 Redundancy Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.4.2 Redundancy Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5 Application: Optimal Lag Between Two Fields and Other Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.1 Application of CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.5.2 Application of Redundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.6 Principal Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7 Extension: Functional Smooth CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

337 337 339 339 339 341 343 344 344 347 347 348 349 349 350 350 352 352

Contents

xxi

15.7.2 Functional Non-smooth CCA and Indeterminacy . . . . . . 15.7.3 Smooth CCA/MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.7.4 Application of SMCA to Space–Time Fields. . . . . . . . . . . Some Points on Coupled Patterns and Multivariate Regression . .

352 354 359 363

16

Further Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.2 EOFs and Random Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3 Cyclo-stationary EOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Theory of Cyclo-stationary EOFs . . . . . . . . . . . . . . . . . . . . . . . 16.3.3 Application of CSEOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4 Trend EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.2 Trend EOFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.4.3 Application of Trend EOFs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5 Common EOF Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.5.2 Formulation of Common EOFs . . . . . . . . . . . . . . . . . . . . . . . . . 16.6 Continuum Power CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.2 Continuum Power CCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.6.3 Determination of the Degree Parameter . . . . . . . . . . . . . . . . 16.7 Kernel MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7.2 Kernel MCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.7.3 An Alternative Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8 Kernel CCA and Its Regularisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.1 Primal and Dual CCA Formulation . . . . . . . . . . . . . . . . . . . . . 16.8.2 Regularised KCCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.8.3 Some Computational Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.9 Archetypal Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.9.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.9.2 Derivation of Archetypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.9.3 Numerical Solution of Archetypes . . . . . . . . . . . . . . . . . . . . . . 16.9.4 Archetypes and Simplex Visualisation. . . . . . . . . . . . . . . . . . 16.9.5 An Application of AA to Climate . . . . . . . . . . . . . . . . . . . . . . . 16.10 Other Nonlinear PC Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.10.1 Principal Nonlinear Dynamical Modes . . . . . . . . . . . . . . . . . 16.10.2 Nonlinear PCs via Neural Networks . . . . . . . . . . . . . . . . . . . .

367 367 368 370 370 372 373 374 374 376 378 383 383 384 388 388 389 390 392 392 392 393 394 394 396 397 397 397 398 399 403 404 410 410 412

17

Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.1 Background and Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.2 General Structure of Neural Networks. . . . . . . . . . . . . . . . . .

415 415 416 416 419

15.8

xxii

Contents

17.2.3 Examples of Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.4 Learning Procedure in NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.2.5 Costfunctions for Multiclass Classification. . . . . . . . . . . . . Self-organising Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.3.2 SOM Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.4.2 Random Forest: Definition and Algorithm . . . . . . . . . . . . . 17.4.3 Out-of-Bag Data, Generalisation Error and Tuning . . . . Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.1 Neural Network Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.2 SOM Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17.5.3 Random Forest Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

423 425 428 429 429 429 433 433 437 437 438 438 444 450

A

Smoothing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.1 More on Smoothing Splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1.2 Choice of the Smoothing Parameter . . . . . . . . . . . . . . . . . . . . A.2 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.1 Exact Interpolation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.2 RBF and Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A.2.3 Relation to PDEs and Other Techniques . . . . . . . . . . . . . . . . A.3 Kernel Smoother . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

453 453 455 458 459 459 463 464 465

B

Introduction to Probability and Random Variables . . . . . . . . . . . . . . . . . . . . B.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2 Sets Theory and Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.1 Elements of Sets Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.2.2 Definition of Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3 Random Variables and Probability Distributions . . . . . . . . . . . . . . . . . . B.3.1 Discrete Probability Distributions. . . . . . . . . . . . . . . . . . . . . . . B.3.2 Continuous Probability Distributions . . . . . . . . . . . . . . . . . . . B.3.3 Joint Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.4 Expectation and Covariance Matrix of Random Vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.5 Conditional Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4 Examples of Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.1 Discrete Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.4.2 Continuous Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.5 Stationary Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

467 467 468 468 469 470 470 471 473

Stationary Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.1 Autocorrelation Structure: One-Dimensional Case . . . . . . . . . . . . . . . C.1.1 Autocovariance/Correlation Function. . . . . . . . . . . . . . . . . . . C.1.2 Time Series Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

483 483 483 484

17.3

17.4

17.5

C

474 475 475 475 476 480

Contents

C.2 C.3

Power Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Multivariate Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3.1 Autocovariance Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C.3.2 Cross-Spectrum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Autocorrelation Structure in the Sample Space . . . . . . . . . . . . . . . . . . . C.4.1 Autocovariance/Autocorrelation Estimates . . . . . . . . . . . . . C.4.2 The Periodogram. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

487 490 490 491 492 492 493

Matrix Algebra and Matrix Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.1.1 Matrices and Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . D.1.2 Operation on Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.2 Most Useful Matrix Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3 Matrix Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.1 Vector Derivative. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.2 Matrix Derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4 Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4.1 MLE of the Parameters of a Multinormal Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.4.2 Estimation of the Factor Model Parameters . . . . . . . . . . . . D.4.3 Application to Results from PCA . . . . . . . . . . . . . . . . . . . . . . . D.5 Common Algorithms for Linear Systems and Eigenvalue Problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.5.1 Direct Methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . D.5.2 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

499 499 499 500 505 506 506 506 508 512

C.4

D

E

xxiii

Optimisation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2 Single Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2.1 Direct Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.2.2 Derivative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3 Direct Multivariate Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3.1 Downhill Simplex Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.3.2 Conjugate Direction/Powell’s Method . . . . . . . . . . . . . . . . . . E.3.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4 Multivariate Gradient-Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4.1 Steepest Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4.2 Newton–Raphson Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4.3 Conjugate Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4.4 Quasi-Newton Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.4.5 Ordinary Differential Equations-Based Methods. . . . . . . E.5 Constrained Minimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . E.5.2 Approaches for Constrained Minimisation . . . . . . . . . . . . .

512 513 515 517 517 517 521 521 522 522 523 524 524 524 525 526 526 527 528 529 530 530 530 532

xxiv

Contents

F

Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.1 Linear Vector and Metric Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.1.1 Linear Vector Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.1.2 Metric Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2 Norm and Inner Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2.1 Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2.2 Inner Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.2.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3 Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3.1 Completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3.2 Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . F.3.3 Application to Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

535 535 535 536 536 536 536 536 537 538 538 538 539

G

Systems of Linear Ordinary Differential Equations . . . . . . . . . . . . . . . . . . . . G.1 Case of a Constant Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.1.1 Homogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.1.2 Non-homogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2 Case of a Time-Dependent Matrix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2.1 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . G.2.2 Particular Case of Periodic Matrix A: Floquet Theory.

543 543 543 544 545 545 546

H

Links for Software Resource Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583

Chapter 1

Introduction

Abstract This chapter describes the characteristic features of high dimensionality and introduces the problem of dimensionality reduction in high-dimensional systems with a particular focus on the importance of its application to the highly complex climate system. Keywords Complexity of the climate system · High dimensionality · Curse of dimensionality · Dimension reduction · Data exploration

1.1 Complexity of the Climate System Our atmosphere is composed of the collection of an innumerable interacting molecules. For instance, the number of “particles” composing the Earth atmosphere is astronomic and is estimated to be of the order O(1045 ) molecules.1 This astronomical number of interacting molecules do not move randomly, but move coherently to some extent, giving rise to the atmospheric motion and weather systems. The climate system is the aggregation of daily weather. Put mathematically, the climate, as opposed to weather, may be defined as the collection of all long-term statistics of the state of the atmosphere. Heinlein (1973) quotes2 “climate is what we expect but weather is what we get.” Figure 1.1 shows a simple illustration of the weather/climate system. Small marbles are dropped from the top, which flow through the space between the pins and are collected in small containers. Each trajectory of one marble describes the daily weather and the collective behaviour

1 The total mass m of the atmosphere is of the order of O(1022 )gr. The total number of molecules m 45 ma Na is of the order O(10 ). The constants Na and ma are respectively the Avogadro number 23 6.023 × 10 and the molar air mass, i.e. the mass of Na molecules (29 gr). 2 The quotation appears in the section “More from the Notebooks of Lazarus Long” of Robert A.

Heinlein’s novel. Some sources, however, attribute the quotation to the American writer/lecturer Samuel L. Clemens known by pen name Mark Twain (1835–1910), although this seems perhaps unlikely, since this concept of climate as average of weather was not readily available around 1900. © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_1

1

2

1 Introduction

Fig. 1.1 Illustration of a highly simplified paradigm for the weather/climate system

represented by the shape of the marbles in the containers describes the probability density function of the system. The weather/climate system of our rotating Earth consists of the evolution of the coupled Atmosphere–Land–Ocean–Ice system driven by radiative forcing from the Sun as well as from the earth’s interior, e.g. volcanoes. The climate, as a complex nonlinear dynamical system varies on a myriad of interacting space/time scales. It is characterised by its high number of degrees of freedom (dof) and their complex nonlinear interactions. It also displays nontrivial sensitivity to initial as well as boundary conditions. Weather and climate at one location can be related to those at another distant location, that is, weather and climate are not local but have a global character. This is known as teleconnections in atmospheric science, and represents links between climate anomalies occurring at one specific location and at large distances. They can be seen as patterns connecting widely separated regions, such the El-Niño Southern Oscillation (ENSO), the North Atlantic Oscillation (NAO) and the Pacific North America (PNA) pattern. More discussions on teleconnections are given in the beginning of Chap. 3. In practice, various climate variables, such as sea level pressure, wind field and ozone concentrations are measured at different time intervals and at various spatial locations. In general, however, these measurements are sparse both is space and time. Climate models are usually used, via data assimilation techniques, to produce regular data in space and time, known as the “reanalyses”. The analysis of climate data is not solely restricted to reanalyses but also includes other observed records, e.g. balloon measurements, satellite irradiances, in situ recordings such as rain gauges, ice cores for carbon dating, etc. Model simulations are also extensively used for research purposes, e.g. for investigating/understanding physical mechanisms,

1.2 Data Exploration, Data Mining and Feature Extraction

3

studying anthropogenic climate change and climate prediction, and also for climate model validation, etc. The recent explosive growth in the amount of observed and simulated atmospheric data that are becoming available to the climate scientist has created an ever increasing need for new mathematical data analysis tools that enable researchers to extract the most out of these data in order to address key questions relating to major concerns in climate dynamics. Understanding weather and climate involves a genuine investigation of the nature of nonlinearities and causality in the system. Predicting the climate is also another important reason that drives climate research. Because of the high dimensionality involved in weather/climate system advanced tools are required to analyse and hence understand various aspects of the system. A major objective in climate research is the identification of major climate modes, patterns, regimes, etc. This is precisely one of the most challenging problems in data mining/feature extraction.

1.2 Data Exploration, Data Mining and Feature Extraction In climate research and other scientific fields we are faced with large datasets, typically multivariate time series with large dimensionality where the objective is to identify or find out interesting or more prominent patterns of variability. A basic step in the analysis of multivariate data is exploration (Tukey 1977). Exploratory data analysis (EDA) provides tools for hypothesis formulation and feature selection (Izenman 2008). For example, according to Seltman (2018) one reads: “Loosely speaking, any method of looking at data that does not include formal statistical modeling and inference falls under the term exploratory data analysis.” For moderately large to large multivariate data, simple EDA, such as scatter plots boxplots, may not be possible, and “advanced” EDA is needed. This includes for instance the identification (or extraction) of structures, patterns, trends, dimension reduction, etc. For some authors (e.g. Izenman 2008) this may be categorised as “descriptive data mining”, to distinguish it from “predictive data mining”, which is based on model building including classification, regression and machine learning. The following quotes show examples of excerpts of data mining from various sources. For instance, according to the glossary of data mining3 we read: • “data mining is an information extraction activity whose goal is to discover hidden facts contained in data bases. Using a combination of machine learning, statistical analysis, modelling techniques and database technology, data mining finds patterns and subtle relationships in data and infers rules that allow the prediction of future state”.

3 http://twocrows.com/data-mining/dm-glossary/.

4

1 Introduction

The importance of patterns is reflected through their interpretation into knowledge which can be used to help understanding, interpretation and decision-making. Technically speaking, the term “data mining” has been stretched beyond its limits to apply to any form of data analysis. Here are some more definitions of data mining/knowledge discovery from various sources:4 • “Data mining, or knowledge discovery in databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data. This encompasses a number of different technical approaches, such as clustering, data summarisation, learning classification rules, . . . ” (Frawley et al. 1992). • “Data mining is the search for relationships and global patterns that exist in large databases but are ‘hidden’ among the vast amount of data, such as relationship between patient data and their medical diagnosis. These relationships represent valuable knowledge about the database and the objects in the database and, if the database is a faithful mirror, of the real world registered by the database” (Holsheimer and Siebes 1994). • Also, according to the Clementine 11.1 User’s Guide,5 (Chap. 4), one reads: “Data mining refers to using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful” It is clear therefore that data mining is basically a process concerned with the use of a variety of data analysis tools and software techniques to identify/discover/find the main/hidden/latent patterns of variability and relationships in the data that may enhance our knowledge of the system. It is based on computing power and aims at explaining a large part of large-dimensional dataset through dimension reduction. In this context climate data constitute a fertile field for data analysis and data mining. The literature in climate research clearly contains various classical techniques that attempt to address some of the previous issues. Some of these tools have been adopted from other fields and others have been developed within a climate context. Examples of these techniques include principal component analysis/empirical orthogonal function (PCA/EOF), Rotated EOFs (REOF), complex EOFs (CEOFs), canonical correlation analysis (CCA), factor analysis (FA), blind separation/deconvolution techniques, regression models, clustering techniques, multidimensional scaling (MDS), . . . etc.

4 http://www.pcc.qub.ac.uk/tec/courses/datamining. 5 https://home.kku.ac.th/wichuda/DMining/ClementineUsersGuide_11.1.pdf.

1.3 Major Concern in Climate Data Analysis

5

1.3 Major Concern in Climate Data Analysis 1.3.1 Characteristics of High-Dimensional Space Geometry Volume Paradox of Hyperspheres A d-dimensional Euclidean space is a mathematical continuum that can be described by d independent coordinates. In this space a point A is characterised by a set of d numbers, or coordinates (a1 , a2 , . . . ad ),and the distance between any two points A and B is given by the square root of dk=1 (ak − bk )2 . An important concept in this space is the definition of hyperspheres. A d-dimensional hypersphere of radius r, centred at the origin, is the (continuum) set of all points X with coordinates x1 , x2 , . . . xd satisfying d 

xk2 = r 2 .

(1.1)

k=1

Note that in the Euclidean three-dimensional space the hypersphere is simply the usual sphere. The definition of hyperspheres allows another set of coordinates, namely spherical coordinates in which a point (x1 , x2 , . . . xn ) can be defined using an equivalent set of coordinates: (r, θ1 , . . . , θd−1 ), where r ≥ 0, − π2 ≤ θk ≤ π2 , for 1 ≤ k ≤ d − 2, and 0 ≤ θd−1 ≤ 2π . The relationship between the two sets of coordinates is given by: x1 = r cosθ1 cosθ2 . . . cosθd−1 x2 = r cosθ1 cosθ2 . . . cosθd−2 sinθd−1 .. . xk = r cosθ1 cosθ2 . . . cosθd−k sinθd−k+1 .. .

(1.2)

xd = r sinθ1 . It can be shown that the Jacobian J of this transformation6 satisfies |J | = r d−1 cosd−2 θ1 . . . cosθd−2 for d ≥ 3 and |J | = r for d = 2. The volume Vd◦ (r) of the hypersphere of radius r in d dimensions, d (r), can be calculated using transformation (1.2): 

 d (r)

6 That



r

1dx =

dρ 0

π 2

− π2

 dθ1 . . .

π 2

− π2

dθd−2

k is |( ∂x ∂θl )kl |, for k = 1, . . . d, and l = 0, . . . , d − 1 where θ0 = r.

6

1 Introduction





×

ρ d−1 cosd−2 θ1 . . . cos2 θd−3 cosθd−2 dθd−1

(1.3)

0

and yields Vd◦ (r) = C◦ (d) r d ,

(1.4)

where d

C◦ (d) =

π2 ( d2

+ 1)

=

2π C◦ (d − 2). d

(1.5)

Table 1.1 shows the values of C◦ (d) for the first 8 values of d. Comparing the volume Vd (2r) of the hypercube d (2r) of side 2r, i.e.  Vd (2r) = 2d r d , to that of the hypersphere Vd◦ (r) we see that both the hypervolumes depend exponentially on the linear scale r but with totally different coefficients. For instance the coefficient C◦ (d) in Eq. (1.5) is not monotonic and it tends to zero rapidly for large dimensions. The volume of a hypersphere of any radius will decrease towards zero when increasing the dimension d from the moment d > π r 2 . The decrease is sharper when the space dimension is even. The coefficient C (d) = 2d for the hypercube, on the other hand, increases monotonically with d. The previous result seems paradoxical, and one might ask what happens to the content of the hyperspheres in high dimensions, and where does it go? To answer this question, let us first look at the concentration of the hypercube content. Consider the d-dimensional hypercube of side 2r and the inscribed hypersphere of radius r (Fig. 1.2). The fraction of the residual volume Vd (2r) − Vd◦ (r) to the hypercube volume goes to one, i.e. Vd (2r) − Vd◦ (r) Vd (2r)

d

=1−

π2 1 →1 2d γ ( d2 + 1)

(1.6)

as d increases to infinity. Hence with increasing space dimension most of the hypercube volume concentrates in its 2d corners whereas the centre becomes less important or empty! This is one aspect of the empty space phenomenon in highdimensional spaces (Scott and Thompson 1983). Consider now what happens to the hyperspheres. The fraction of the volume of a spherical shell of thickness ε and radius r, Vd◦ (r) − Vd◦ (r − ε), to that of the whole hypersphere is

Table 1.1 Values of C◦ as a function of the space dimension d d C◦

1 2

2 3.1416

3 4.1888

4 4.9348

5 5.2638

6 5.1677

7 4.7248

8 4.0587

1.3 Major Concern in Climate Data Analysis

7

Fig. 1.2 Representation of the volume concentration in the hypercube corners in two dimensions

 Vd◦ (r) − Vd◦ (r − ε) ε d →1 = 1 − 1 − Vd◦ (r) r

as d → ∞.

(1.7)

Hence the content of the hypersphere becomes more and more concentrated close to its surface. A direct consequence of this is that a uniformly distributed data in the hypersphere or hypercube is mostly concentrated on their edges. In other words to sample uniformly in a hypercube or hypersphere (with large dimensions) we need extremely large sample sizes. Exercise Using Eq. (1.1) compute the area Sd◦ (a) of the hypersphere of dimension d and radius a. a  Answer Use the fact that Vd◦ (a) = 0 dρ dSd◦ (ρ) and, keeping in mind that Sd◦ (a) = a d−1 Sd◦ (1), yields Sd◦ (a) =

d

π 2 a d−1 d . ( d2 +1)

Two Further Paradoxical Examples • Inscribed spheres (Hamming 1980) We consider the square of side 4 divided into four squares of side 2 each. In each square we consider the inscribed circle of radius one. These circles are tangent to each other and also to the original square. Now we fit in the small space at the centre of the original square a small circle tangent to the four inscribed circles (Fig. 1.3). √ The radius of the small circle is r2 = 2 − 1 = 0.41. Doing the same thing with the 43 -cube and the corresponding eight unit spheres, the radius of the small inner

8

1 Introduction

Fig. 1.3 Inscribed spheres in two dimensions

√ sphere is r3 = 3 − 1. Extending the same procedure to the d-dimensional 4d √ d hypercube and its corresponding 2 unit hyperspheres yields rd = d − 1. Hence for d ≥ 10, the small inner hypersphere reaches outside the hypercube. Note also, as pointed out earlier that the volume of this inner hypersphere goes to zero as d increases. • Diagonals in hypercubes (Scott 1992) Consider the d-dimensional hypercube [−1, 1]d . Any diagonal vector v joining the origin to one of the corners is of the form (±1, ±1, . . . , ±1)T . Now the angle θ between v and any vector basis, i.e. coordinate axis i is given by: cos θ =

±1 v.i =√ , v d

(1.8)

which goes to zero as d goes to infinity. Hence the diagonals are nearly orthogonal to the coordinate axes in high-dimensional spaces.7 An important consequence is that any data that tend to cluster near the diagonals in hyperspaces will be mapped onto the origin in every paired scattered plot. This points to the importance of the choice of the coordinate system in multivariate data analysis. • Consequence for the multivariate Gaussian distribution (Scott 1992; CarreiraPerpiñán 2001)

7 Note

that because of the square root, this is true for very large dimensions d, e.g. d ≥ 103 .

1.3 Major Concern in Climate Data Analysis

9

Data scattered uniformly in high-dimensional spaces will always be concentrated on the edge, and this is the emptiness paradox mentioned earlier, i.e. spherical neighbourhood of uniformly distributed data on hypercubes is empty. Similar behaviour also occurs with the multivariate normal distribution. The standard multinormal probability density function is given by f (x) =

1 d

(2π ) 2

 1 exp − xT x . 2

(1.9)

Equiprobable sets form (1.9) are hyperspheres and the origin is the mode of the distribution. Now consider the set of points y within the hypersphere associated with a pdf of εf (0), i.e. points satisfying f (y) ≥ εf (0) for small ε. The probability of this set is     P r x 2 ≤ −2log ε = P r χd2 ≤ −2 log ε . (1.10) For a given ε this probability falls off rapidly as d increases. This decrease becomes sharp after d = 5. Hence the probability of points not in the tail decreases rapidly as d increases. Consequently, unlike our prediction from low-dimensional intuition, most of the mass of the multivariate normal in high-dimensional spaces is in the tail.

1.3.2 Curse of Dimensionality and Empty Space Phenomena Data is a valuable source of information which provide ultimately the way to knowledge and wisdom. The concept of flow diagram from data to wisdom through information and knowledge (DIKW) goes back to Zeleny (1987), and Ackoff (1989) and has become known as the DIKW hierarchy. This hierarchy can have various representations such as the one given by the understanding–context independence diagram schematised in Fig. 1.4. The figure clearly reflects the importance of data, which allow, through understanding relationships, access to information. This latter allows for pattern understanding to get knowledge, and ultimately wisdom through understanding principles. However, because of the nature of the problem in climate, and also various other fields, data analysis cannot escape the major challenges of the empty space phenomenon, discussed above, and what has become known as the “curse of dimensionality”. To grasp the idea in more depth let us suppose that we can be satisfied that a sample size of 10 data points in one-dimensional time series, e.g. 10 years of yearly North Atlantic Oscillation (NAO) index, are considered to be enough to analyse and learn something from the one-dimensional process. Now, in the two-dimensional case, the minimum or satisfactory sample size would be 102 . So we see that if we are to analyse for example two variables: the yearly NAO index and the Northern Hemisphere (NH) surface temperature average, one would require

10

1 Introduction

Fig. 1.4 Data–Information–Knowledge–Wisdom (DIKW) hierarchy

of the order of a century of data record. In various climate analysis studies we use at least 4 to 5 dimensions which translate into an astronomical sample size record, e.g. million years. The moral is that if n1 is the sample size required in one dimension, then in d dimensions we require nd = nd1 , i.e. a power law in the linear dimension. The curse of dimensionality, coined by Bellman (1961), refers to this phenomenon, i.e. the sample size needed to have an accurate estimate of a function in a highdimensional space grows exponentially with the number of variables or the space dimension. The curse of dimensionality also refers to the paradox of neighbourhood in high-dimensional spaces, i.e. empty space phenomenon (Scott and Thompson 1983). Local neighbourhoods are almost empty whereas nonempty neighbourhood are certainly nonlocal (Scott 1992). This is to say that high-dimensional spaces are inherently empty or sparse. Example For the uniform distribution in the unit 10-dimensional hypersphere the probability of a point falling in the hypersphere of radius 0.9 is only 0.35 whereas the remaining 0.65 probability is for points on the outer shell of thickness 0.1. The above example shows that density estimation in high dimensions can be problematic. This is because regions of relatively very low density can contain considerable part of the distribution whereas regions of apparently high density can be completely devoid of observations in samples of moderate sizes (Silverman 1986). For example, 70% of the mass in the standard normal distribution is within one standard deviation of the mean whereas the same domain contains only 0.02%

1.3 Major Concern in Climate Data Analysis

11

of the mass, in 10 dimensions, and has to take a radius of more than three standard deviations to reach 70%. Consequently, and contrary to our intuition, the tails are much more important in high dimensions than in low dimensions. This has a serious consequence, namely the difficulty in probability density function (pdf) estimation in high-dimensional spaces (see e.g. Silverman 1986). Therefore since most density estimation methods are based on local concepts, e.g. local averages (Silverman 1986), in order to find enough neighbours in high dimensions, the neighbourhood has to extend far beyond local neighbours and hence locality is lost (CarreiraPerpiñán 2001) since local neighbourhoods are mostly empty. The above discussion finds natural application in the climate system. If, for example, we are interested in studying a phenomenon, such as El-Nino Southern Oscillation (ENSO) using say observations from 40 grid points of monthly sea surface temperature (SST) data in the Tropical Pacific region, then theoretically one would necessarily need a sample size of O(1040 ). This metaphoric observational period is far beyond the age of the Universe.8 This is another facet of the emptiness phenomenon related to the inherent sparsity of high-dimensional spaces. As has been pointed out earlier, this has a direct consequence on the probability distribution of high-dimensional data. For example, in the one-dimensional case the probability density function of the uniform distribution over [−1, 1] is a box of height 2−1 , whereas in 10 dimensions the hyperbox height is only 2−10 ≈ 9.8 × 10−4 .

1.3.3 Dimension Reduction and Latent Variable Models Given the complexity, e.g. high dimensionality, involved in the climate system one major challenge in atmospheric data analysis is the reduction of the system size. The basic objective behind dimension reduction is to enable data mapping onto a lower-dimensional space where data analysis, feature extraction, visualisation, interpretation, . . ., etc. become manageable. Figure 1.5 shows a schematic representation of the target/objective of data analysis, namely knowledge or KDD. In probabilistic terminology, the previous concepts have become known as latent variable problems. Latent variable models are probabilistic models that attempt

Fig. 1.5 Knowledge gained via data reduction

8 In

practice, of course, data are not totally independent, and the sample size required is far less than the theoretical one.

12

1 Introduction

to explain processes happening in high-dimensional spaces in terms of a small number of variables or degrees of freedom (dof). This is based of course on the assumption that the observed high-dimensional data are the result of underlying lower-dimensional processes.9 The original concept of latent variable modelling appeared in psychometrics with Spearman (1904b). The technique eventually evolved into what is known as factor analysis. See, e.g. Bartholomew (1987) for historical details on the subject. Dimensionality reduction problems are classified into three classes (Carreira-Perpiñán 2001), namely: • Hard dimensionality reduction problems where the dimension d of the data is of the order 102 –105 . • Soft dimensionality reduction problems in which d ≈ (2 − 9) × 10. • Visualisation problems where the dimension d is small but reduction is required for visual purposes. Examples include Asimov’s (1985) grand tour and Chernoff’s (1973) faces, etc.

1.3.4 Some Problems and Remedies in Dimension Reduction Known Difficulties Various problems exist in dimension reduction, some of which are in general unavoidable due to the complex nature in the approaches used. The following list gives some examples: • Difficulty related to large dimensions—the larger the dimension the more difficult the problem. • Non-unicity—No single method exists to deal with the data. In general different approaches (exploratory and probabilistic) can give different results. For example, in probabilistic modelling the choice of the latent variable model is not unique. • Unknown intrinsic, latent or hidden dimension—There is indeed no effective and definitive way to know in general the minimum number of dimensions to represent the data. • Nonlinear association/relationship between the variables—This is a difficult problem since there is no systematic way to find out these associations. • Nature of uncertainties underlying the data, and the loss of information resulting from the reduction procedure.

9 This

high dimensionality can arise from various causes, such as uncertainty related for example to nonlinearity and stochasticity, e.g. measurement errors.

1.4 Examples of the Most Familiar Techniques

13

Some Remedies Some of the previous problems can be tackled using some or a combination of the following approaches: • • • •

Occam’s razor or model simplicity. Parsimony principle. Arbitrary choices of e.g. the latent (hidden) dimensions. Prior physical knowledge of the process under investigation. For example, when studying tropical climate one can make use of the established ENSO phenomenon linking the tropical Pacific SST to the sea saw in the large scale pressure system.

1.4 Examples of the Most Familiar Techniques Various techniques have been used/adapted/developed in climate analysis studies to find and identify major patterns of variability. It is fair to say that most of these techniques/tools are basically linear. These and various other techniques will be presented in more detail in the next chapters. Below we mention some of the most widely used methods in atmospheric science. • Empirical Orthogonal Functions (EOFs). EOFs are the most widely used methods in atmospheric science. EOF analysis is also known as principal component analysis (PCA), and is based on performing an eigenanalysis of the sample covariance matrix of the data. • Rotated EOFs (REOFs). EOFs are constrained to be orthogonal and as such problems related to physical interpretability may arise. REOF method is one technique that helps find more localised structures by rotating the EOFs in such a manner to get patterns with simple structures. • Complex EOFs (CEOFs). The CEOF method is similar to that of EOFs and aims at detecting propagating structures from the available data. • Principal Oscillation Patterns (POPs). As for CEOFs, POP method aims also at finding propagating patterns without recourse to using the complex space.

14

1 Introduction

• Canonical Correlation Analysis (CCA). Unlike the previous techniques where only one single field e.g. sea level pressure (SLP) is used, in CCA the objective is to find the most (linearly) correlated structures between two fields,10 e.g. SLP and SST. It is used to identify (linearly) “coupled modes”.

10 Although

it is also possible to perform a EOF analysis of more than one field, for example SST and surface air temperature combined (Kutzbach 1967). This has been labelled combined principal component analysis by Bretherton et al. (1992).

Chapter 2

General Setting and Basic Terminology

Abstract This chapter introduces some basic terminologies that are used in subsequent chapters. It also presents some basic summary statistics of data sets and reviews basic methods of data filtering and smoothing. Keywords Data processing · Smoothing · Scaling and sphering · Filtering · Stationarity · Spectra · Singular value decomposition

2.1 Introduction By its nature, the climate data analysis is a large multivariate (high-dimensional) problem par excellence. When atmospheric data started to accumulate since the beginning of the twentieth century, the first tools that atmospheric scientists tried were basically exploratory, and they included simple one-dimensional time series plots, then two-dimensional scatter plots and later contour plots. Fisher (1925) quotes, “The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove nothing, but bring outstanding features readily to the eye; they are therefore no substitute for such critical tests as may be applied to data, but are valuable in suggesting such tests, and in explaining conclusions founded upon them”. This same feeling is also shared with other scientists. Hunter (1988) also quotes “The most effective statistical technique for analysing environmental data are graphical methods. They are useful in the initial stage for checking the quality of the data, highlighting interesting features of the data, and generally suggesting what statistical analyses should be done. Interesting enough, graphical methods are useful again after intermediate quantitative analyses have been completed and again in the final stage for providing complete and readily understood summaries of the main findings of investigations”. Finally we quote Tukey’s (1977) declaration, also quoted in Spence and Garrison (1993) and Berthouex and Brown (1994): “The greatest value of a picture is when it forces us to notice what we never expected to see”.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_2

15

16

2 General Setting and Basic Terminology

It was not until the early part of the twentieth century that correlation started to be used in meteorology by Gilbert Walker1 (Walker 1909, 1923, 1924; Walker and Bliss 1932). It is fair to say that most of the multivariate climate data analyses are based mainly on the analysis of the covariance between the observed variables of the system. The concept of covariance in atmospheric science has become so important that it is routinely used in climate analysis. Data, however, have to be processed before getting to this slightly advanced stage. Some of the common processing techniques are listed below.

2.2 Simple Visualisation Techniques In their basic form multivariate data are normally composed of many unidimensional time series. A time series is a sequence of values x1 , x2 . . . xn , in which each datum represents a specific value of the variable x. In probabilistic terms x would represent a random variable, and the value xi is the ith realisation of x in some experimental set-up. In everyday language xt represents the observation at time t of the variable x. In order to get a basic idea about the data, one has to be able to see at least some of their aspects. Plotting some aspects of climate data is therefore a recommended first step in the analysis. Certainly this cannot be done for the entire sample; for example, simple plots for certain “key” variables could be very useful. Trends, for instance, are examples of aspects that are best inspected visually before quantifying them. Various plotting techniques exist for the purpose of visualisation. Examples include: • Simple time series plots—this constitutes perhaps the primary step to data exploration. • Single/multiple scatter plots between pairs of variables—these simple plots provide information on the relationships between various pairs of variables. • Histogram plots—they are a very useful first step towards exploring the distributions of individual variables (see e.g. Silverman 1986).

1 The

modern concept of correlation can be traced as far back as late seventeenth century with Galton (1885), see e.g. Stigler (1986). The use of the concept of correlation is actually older than Galton’s (1885) paper and goes back to 1823 with the German mathematician Carl Friedrich Gauss who developed the normal surface of N correlated variates. The term “correlation” appears to have been first quoted by Auguste Bravais, a French naval officer and astronomer who worked on bivariate normal distributions. The concept was also used later in 1868 by Charles Darwin, Galton’s cousin, and towards the end of the seventeenth century, Pearson (1895) defined the (Galton-) Pearson’s product-moment correlation coefficient. See Rodgers and Nicewander (1988) for some details and Pearson (1920) for an account on the history of correlation. Rodgers and Nicewander (1988) list thirteen ways to interpret the correlation coefficients.

2.3 Data Processing and Smoothing

17

• Contour plots of variables in two dimensions—contour plots are also very useful in analysing, for example, individual maps or exploring smoothed histograms between two variables. • Box plots—these are very useful visualisation tools (Tukey 1977) used to display and compare the distributions of an observed sample for a number of variables. Other useful methods, such as sunflower scatter plots (Cleveland and McGill 1984), Chernoff faces (Chernoff 1973), brushing (Venables and Ripley 1994; Cleveland 1993) and colour histograms (Wegman 1990) are also often used in highdimensional data exploration and visualisation. A list of these and other methods with a brief description and further references can be found in Martinez and Martinez (2002), see also Everitt (1978).

2.3 Data Processing and Smoothing 2.3.1 Preliminary Checking Climate data are the direct result of experimentation, which translate via our senses into observations or (in situ) measurements and represent therefore information. By its very nature, data are always subject to uncertainties and are hence deeply rooted in probabilistic concepts. It is in general recommended to process the data before indulging into advanced analyses techniques. The following list provides familiar examples of processing that are normally applied at each grid point. • Checking for missing/outlier values—these constitute simple processing techniques that are routinely applied to data. For example, unexpectedly large values or outliers can either be removed or replaced. Missing values are normally interpolated using observations from the neighbourhood. • Detrending—if the data indicate evidence of a trend, linear or polynomial, then it is in general recommended to detrend the data, by calculating the trend and then removing it. • Deseasonalising—seasonality constitutes one of the major sources of variability in climate data and is ubiquitous in most climate time series. For example, with monthly data a smooth seasonal cycle can be estimated by fitting a sine wave. Alternatively, the seasonal cycle can be obtained by the collection of the 12 monthly averages. A more advanced way is to apply Fourier analysis and considers the few leading sine waves as the seasonal cycle. The deaseasonalised data are then obtained by subtracting the seasonal cycle from the original (and perhaps detrended) data. If the seasonal component is thought to change over time, then techniques based on X11, for example, can be used. This technique is based on a local fit of a seasonal component using a simple moving average. Pezzulli et al. (2005) have investigated the spectral properties of the X11 filter and applied it to sea surface temperature.

18

2 General Setting and Basic Terminology

The method uses a Henderson filter for the moving average and provides a more flexible alternative to the standard (constant) seasonality.

2.3.2 Smoothing Smoothing is the operation that allows removing “irregular” or more precisely, spiky features and sudden changes from a time series, which otherwise will hamper any possibility from recognising and identifying special features. In the process of smoothing the data are implicitly assumed to be composed2 of a smooth component plus an irregular component. The smoothed data are generally easier to interpret. Below is a list of the most widely used smoothing techniques applied in climate research.

Moving Average It is a simple local average using a sliding window. If we designate by xt , t = 1, 2, . . . n, the sample of the time series and M the length of the window, then the smoothed time series is given by yk =

1 M

k+M−1 

xk ,

(2.1)

i=k

for k = 1, 2, . . . n − M + 1. Note that to estimate the seasonal cycle, a nonoverlapping 30-day3 moving average is normally applied to the data. Note also that instead of simple average, one can have a weighted moving average.

Exponential Smoothing Unlike the simple moving average where the weights are uniform, the exponential smoothing uses an exponentially decaying weighting function of the past observations as yk = (1 − φ)

∞ 

φ i xk−i ,

(2.2)

i=0

2 Similar

to fitting a probability model where the data is decomposed as data = fit + residuals. on the calendar month; 28, 29, 30 or 31.

3 Depending

2.3 Data Processing and Smoothing

19

for an infinite time series. The smoothing parameter φ satisfies φ < 1, and the smaller |φ|, the smoother the obtained curve. The coefficient (1 − φ) in Eq. (2.2) is introduced to make the weights sum to one. In practice for finite time series, the previous equation is truncated to yield yk =

m 1−φ  i φ xk−i , 1 − φ m+1

(2.3)

i=0

where m depends on k.

Spline Smoothing Unlike moving averages or exponential smoothing, which are locally linear, splines (Appendix A) provide a nonlinear smoothing using polynomial fitting. The most popular spline is the cubic spline based on fitting a twice continuously differentiable piece-wise cubic polynomial. If xk = x(tk ) is the observed time series at time tk , k = 1, . . . n, then the cubic spline f () is defined by (i) f (t) = fk (t) = ak + bk t + ck t 2 + dk t 3 for t in the interval [tk , tk+1 ], k = 1, . . . n − 1. α (ii) at each point tk , f () and its first two derivatives are continuous, i.e. atd α fk (tk ) = α d at α fk−1 (tk ), α = 0, 1, 2. Remark The cubic spline (Appendix A) can also be obtained from an optimisation problem.

Kernel Smoothing The kernel smoothing is a global weighted average and is often used to estimate pdfs, see Appendix A for more advanced smoothing methods. The weights are obtained as the value of a specific kernel function, e.g. exponential or Gaussian, applied to the target point. Designate again by xk = x(tk ), k = 1, . . . n the finite sample time series, and the smoothed time series is given by yl =

n 

κli xi ,

(2.4)

i=1

l where κli = K ti −t h , and K() is the smoothing kernel. The most widely used kernel is the Gaussian function:

1 K(x) = √ exp −x 2 /2 . 2π

20

2 General Setting and Basic Terminology

The parameter h in κli is a smoothing parameter and plays a role equivalent to that of a window width. The list provided here is not exhaustive, and other methods exist, see e.g. Chatfield (1996) or Tukey (1977). Once the gridded data have been processed, advanced techniques can be applied depending on the specific objective of the analysis. In general, the processed data have to be written as an array to facilitate computational procedures, and this is presented in the next section. But before that, let us define first a few characteristics of time series.

2.3.3 Simple Descriptive Statistics Given a time series x1 , x2 , . . . xn , the sample mean x is given by 1 xk n

(2.5)

1  (xk − x)2 . n−1

(2.6)

n

x=

k=1

and the sample variance sx2 by n

sx2 =

k=1

See Appendix B for the properties of these estimators. The standard sample deviation of the time series is sx . The time series is termed centred when the mean is removed. When the time series is scaled by its standard deviation, it is termed standardised and consequently has unit variance. Sometimes when the time series is centred and standardised, it is termed scaled. Often the time series is supposed to be a realisation of some random variable X with cumulative distribution function (cdf) FX () with finite mean μX and finite variance σX2 (Appendices B and C). In this case the sample mean and variance x and sx2 are regarded as estimates of the (population) mean and variance μX and σX2 , respectively. Now let the time series be sorted into x(1) ≤ x(2) ≤ . . . ≤ x(n) , and then the following function ⎧ ⎨ 0 if u < x(1) ˆ FX (u) = nk if x(k) ≤ u < x(k+1) ⎩ 1 if u > x(n)

(2.7)

provides an estimator of the cdf FX () and is referred to as the empirical distribution function (edf). Note that the edf can be smoothed to yield a smooth approximation of FX ().

2.4 Data Set-Up

21

Now let y1 , y2 , . . . , yn be another time series supposed to be also a realisation of another random variable Y with mean μY and variance σY2 . The sample covariance cxy between the two time series is given by 1  (xk − x)(yk − y). n−1 n

cxy =

(2.8)

k=1

Similarly, the sample correlation coefficient rxy between the two time series is the covariance between the corresponding scaled time series, i.e. rxy =

cxy . sx sy

(2.9)

Note that the correlation always satisfies −1 ≤ rxy ≤ 1. Now if both time series are sorted, then the rank correlation, ρxy , also known as Spearman’s rank correlation coefficient (Kendall and Stuart 1961, p. 476), is obtained as the ordinary (or product moment) correlation between the ranks of the corresponding time series instead of the actual values. This rank correlation can also be computed using the differences dt , t = 1, . . . n, between the ranks of the two sample time series and yields  6 dt2 . ρr = 1 − n(n2 − 1) n

(2.10)

t=1

It can be seen that the transform of the sample x1 , x2 , . . . , xn using the empirical distribution function (edf) FˆX () in (2.7) is precisely pn1 , pn2 , . . . , pnn , where p1 , p2 , . . . , pn are the ranks of the time series and similarly for the second time series y1 , y2 . . . , yn . Therefore, the rank correlation is an estimator of the correlation corr (FX (X), FY (Y )) between the transformed uniform random variables FX (X) and FY (Y ).

2.4 Data Set-Up Most analysis methods in climate are described in a matrix form, which is the essence of multivariate analyses. A given spatio-temporal field, e.g. sea level pressure, is composed of a multivariate time series, where each time series represents the values of the field X at a given spatial location, e.g. grid point s, taken at different times4 t noted by X(s, t). The spatial locations are often represented by grid points that are regularly spaced on the spherical earth at a given vertical level. For example,

4 It

could be daily, monthly, etc.

22

2 General Setting and Basic Terminology

a continuous time series at the jth grid point sj can be noted xj (t), where t spans a given period. The resulting continuous field represents then a multivariate, or vector, times series:

T x(t) = x1 (t), . . . , xp (t) . When the observations are sampled at discrete times, e.g. t = t1 , t2 , . . . tn , one gets a finite sample xk , k = 1, . . . n, of our field, where xk = x(tk ). In our set-up the j’th grid point sj represents the j’th variable. Now if we assume that we have p such variables, then the sampled field X can be represented by an array X = (xij ), or data matrix, as ⎛

x11 ⎜ x21 ⎜ X = (x1 , x2 , . . . , xn )T = ⎜ . ⎝ ..

x12 . . . x22 . . . .. .

⎞ x1p x2p ⎟ ⎟ .. ⎟ . . ⎠

(2.11)

xn1 xn2 . . . xnp In (2.11) n is the number of observations or sample size and p is the number of variables. The j’th column (x1j , . . . xnj )T is the time series at the grid point location sj , whereas the ith row (xi1 , . . . , xip ) is the observed field xTi at time t = ti , which is also known as a map at t = ti . One can also write (2.11) alternatively as   X = x1 , x2 , . . . , xp .

(2.12)

Suppose now that we have another observed field Y = (yij ), observed at the same times as X but perhaps at different grid points s∗k , k = 1, . . . q, think, for example, of sea surface temperature. Then one can form the combined field obtained by combining both data matrices Z = [X, Y] as ⎛

x11 . . . ⎜ x21 . . . ⎜ Z = (zij ) = ⎜ . ⎝ ..

x1p x2p .. .

y11 . . . y21 . . . .. .

⎞ y1q y2q ⎟ ⎟ .. ⎟ . . ⎠

(2.13)

xn1 . . . xnp yn1 . . . ynq This combination is useful when, for example, one is looking for combined patterns such as empirical orthogonal functions (Jolliffe 2002; Hannachi et al. 2007). The ordering or set-up of the data matrix shown in (2.11) or (2.13) where the temporal component is treated as observation and the space component as variable is usually referred to as S-mode convention. In the alternative convention, namely the T-mode, the previous roles are swapped (see e.g. Jolliffe 2002).

2.5 Basic Notation/Terminology

23

2.5 Basic Notation/Terminology In the general framework of multivariate analyses, each observation xj k = xk (tj ) is considered as a realisation of a random variable xk and therefore the observed vector xk = (xk1 , . . . xkp )T , for k = 1, . . . n as a realisation of the multivariate random variable x. We denote hereafter by p(x) the probability density function of the variable x (Appendix B). Various operations are often required prior to applying advanced mathematical methods to find patterns from our high-dimensional sampled field. Some of the preprocessing steps have been presented in the previous section for the unidimensional case.

2.5.1 Centring Since the observed field is a realisation of some multivariate random variable, one can speak of the mean μ of the field, also known as climatology, which is the expectation of x, i.e.  μ = E(x) =

xp(x)dx1 . . . dxp .

(2.14)

The mean is estimated using the observed sample by T  x = x1, x2, . . . , xp ,

(2.15)

where x k is the time mean of the observed k’th time series. The centring operation consists of transforming the data matrix X to have zero mean and yields the centred matrix Xc . The centred field is normally referred to as the anomaly field with respect to the time mean. This is to differentiate it from anomalies with respect to other components such as mean annual cycle. The centred matrix is then   1 T T (2.16) Xc = X − 1n x = In − 1n 1n X, n where 1n = (1, 1, . . . , 1)T is the column vector of length n containing only ones and In is the n × n identity matrix. The Matlab command to compute the mean of X and get the anomalies is >> [n p] = size(X); >> Xbar = mean(X,1); >> Xc = X-ones(n,1)*Xbar;

24

2 General Setting and Basic Terminology

2.5.2 Covariance Matrix The covariance (or variance–covariance) matrix is the second-order centred moment of the multivariate random variable x and is given by

  = cov(x, x) = var(x) = E (x − μ)(x − μ)T = (x − μ)(x − μ)T p(x) dx. (2.17) The (i, j)th element of  is simply the covariance γ (xi , xj ) between the i’th and the j’th variables, i.e. the time series at the i’th and the j’th grid points, respectively. Note that the diagonals of  are simply the variances σ12 , σ22 , . . . σp2 of the individual random variables. The sample estimate of the covariance matrix is given by5 1  1 XT Xc . (xk − x)(xk − x)T = n−1 n−1 c n

S=

(2.18)

k=1

The Matlab command is >> S = cov(X); As for , the diagonals s1 , . . . , sp of S are also the sample variances of the individual time series x1 (t), x2 (t), . . . xp (t). Remark The sample covariance matrix is sometimes referred to as the dispersion matrix, although, in general, the latter is taken to mean matrix of non-centred 1 second-order moments n−1 XT X. The correlation matrix  is the covariance matrix of the standardised (or scaled) random variables; here, each random variable xk is standardised by its standard deviation σk to yield unit variance. Hence the (i,j)’th element ρij of  is the correlation between the i’th and the j’th time series: ρij = ρ(xi , xj ).

(2.19)

If we designate ⎛

2 σ11 ⎜ 0 ⎜ D = Diag () = ⎜ . ⎝ ..

5 The coefficient

1 n−1

used in (2.15) instead of in practice is in general insignificant.

1 n

0 ... 2 ... σ22 .. . ... 0 0 ...

0 0 .. .

⎞ ⎟ ⎟ ⎟, ⎠

(2.20)

2 σpp

is to make the estimate unbiased, but the difference

2.5 Basic Notation/Terminology

25

then we get 1

1

 = D− 2 D− 2 .

(2.21)

The sample correlation matrix is also obtained in a similar way by standardising each variable.

2.5.3 Scaling This operation consists in dividing the variables x1 , x2 , . . . xp by their respective standard deviationsσ1 , σ2 , . . . σp . Using the matrix D = Diag () =  2 , σ 2 , . . . , σ 2 , the scaled data matrix takes the form: Diag σ11 pp 22 1

Xs = XD− 2 ,

(2.22)

so each variable in Xs is unit variance, but the correlation structure among the variables has not changed. Note that the centred and scaled data matrix is 1

Xcs = Xc D− 2 .

(2.23)

Note also that the correlation matrix  of X is the covariance of the scaled data matrix Xs .

2.5.4 Sphering It is an affine transformation by which the covariance matrix of the sample data becomes the identity matrix. Sphering destroys, therefore, all the first- and secondorder information of the sample. For our data matrix (2.11), the sphered data matrix X◦ takes the form:   1 1 (2.24) X◦ =  − 2 Xc =  − 2 X − 1n xT . 1

In (2.24)  − 2 represents the inverse of the square root6 of . The covariance matrix of X◦ is the identity matrix, i.e. n1 XT◦ X◦ = Ip . Because sphering destroys the first

square root of a symmetric matrix  is a matrix R such that RRT = . The square root of , however, is not unique since for any orthogonal matrix Q, i.e. QQT = QT Q = I, the matrix RQ is also square root. The standard square root is obtained via a congruential relationship with respect to orthogonality and is obtained using the singular value decomposition theorem.

6 The

26

2 General Setting and Basic Terminology

two moments of the data, it can be useful when the covariance structure in the data is not desired, e.g. when we are interested in higher order moments such as skewness.

2.5.5 Singular Value Decomposition The singular value decomposition (see also Appendix D) is a powerful tool that decomposes any n × p matrix X into the product of two orthogonal matrices and a diagonal matrix as X = UVT , which can be written alternatively as X=

r 

λk uk vTk ,

(2.25)

k=1

where uk and vk , k = 1, . . . r, are, respectively, the left and right singular vectors of X and r is the rank of X. The SVD theorem can also be extended to the complex case. If X is a n × p complex matrix, we have a similar decomposition to (2.25), i.e. X = UV∗T , where now U and V satisfy U∗T U = V∗T V = Ir and the superscript (∗ ) denotes the complex conjugate. Application If X = UVT is the SVD decomposition of the n ×p data matrix X, then  T 2 T k uk uk = In and that the covariance matrix is S = k λk uk uk . The Matlab routine is called SVD, which provides all the singular values and associated singular vectors. >> [u s v] = svd (X); The routine SVD is more economic and provides a preselected number of singular values (see Chap. 3).

2.6 Stationary Time Series, Filtering and Spectra 2.6.1 Univariate Case Let us consider a continuous stationary time series (or signal) x(t) with autocovariance function γx () and spectral density function fx () (see Appendix C). A linear filter L is a linear operator transforming x(t) into a filtered time series y(t) = Lx(t). This linear filter can be written formally as a convolution, i.e.

2.6 Stationary Time Series, Filtering and Spectra

27

 y(t) = Lx(t) =

h(u)x(t − u)du,

(2.26)

where the function h() is known as the transfer function of the filter or its impulse response function. The reason for this terminology is that if x(t) is an impulse, i.e. a Dirac delta function, then the response is simply h(t). From (2.26), the autocovariance function γy () of the filtered time series is   γy (τ ) =

h(u)h(v)γx (τ + u − v)dudv.

(2.27)

Taking the Fourier transform of (2.27), the spectral density function of the response y(t) is fy (ω) = fx (ω)|(ω)|2 ,

(2.28)

where  (ω) =

h(u)e−iuω du = |(ω)|eiφ(ω)

(2.29)

is the Fourier transform of the transfer function and is known as the frequency response function. Its amplitude |(ω)| is the gain of the filter, and φ(ω) is its phase. In the discrete case the transfer function is simply a linear combination of Dirac pulses as h(u) =



ak δk

(2.30)

ak xt−k .

(2.31)

k

giving as output yt =

 k

The frequency response function is then the discrete Fourier transform of h() and is given by (ω) =

1  ak e−iωk . 2π k

Exercise 1. Derive the frequency response function of the moving average filter (2.1). 2. Derive the same function for the exponential smoothing filter (2.2).

(2.32)

28

2 General Setting and Basic Terminology

Using (2.26) or (2.31), one can compute the cross-covariance function:  γxy (τ ) =

h(u)γx (τ − u).

(2.33)

The cross-covariance function satisfies γxy (τ ) = γyx (−τ ).

(2.34)

Note that the cross-covariance function is not symmetric in general. The Fourier transform of the cross-covariance function, i.e. the cross-spectrum fxy (ω) = 1  γxy (k)e−iωk , is given by 2π fxy (ω) = (ω)fx (ω).

(2.35)

Note that the cross-covariance function is not limited to time series defined, e.g. via Eq. (2.26), but is defined for any two time series.

2.6.2 Multivariate Case The previous concepts can be extended to the multivariate time series in the same manner (Appendix C). Let us suppose that xt , t = 1, 2, . . ., is a d-dimensional time series with zero mean (for simplicity). The lagged autocovariance matrix is   (τ ) = E xt xTt+τ .

(2.36)

 T (τ ) = (−τ ).

(2.37)

Using (2.34), we get

Exercise Derive (2.37). (Hint. Use stationarity). The cross-spectrum matrix F is given by F(ω) =

1  −iωk e (k). 2π

(2.38)

k

The inverse of this Fourier transform yields  (k) =

π −π

F(ω)eiωk dω.

(2.39)

2.6 Stationary Time Series, Filtering and Spectra

29

The cross-spectrum matrix F(ω) is Hermitian, i.e. F∗T (ω) = F(ω),

(2.40)

where the notation (∗ ) represents the complex conjugate. Note that the diagonal elements of the cross-spectrum matrix, [F]ii (ω), i = 1, . . . d, represent the individual power spectrum of the ith component xti , of xt . The real part FR = Re (F(ω)) is the co-spectrum, and its imaginary part FR = I m (F(ω)) is the quadrature spectrum. The co-spectrum is even and satisfies FR (ω) = (0) +



  cos (k) +  T (k) ,

(2.41)

k≥1

and the covariance matrix can also be written as  π  (0) = F(ω)dω = 2 −π

π

FR (ω)dω.

(2.42)

0

The relations (2.27) and (2.35) can also be extended naturally to the multivariate filtering problem. In fact if the multivariate signal x(t) is passed through a linear  filter L to yield y(t) = Lx(t) = H(u)x(t − u)du, then the covariance matrix of the output is   y (τ ) =

H(u) x (τ + u − v)HT (v)dudv.

Similarly, the cross-covariance between input and output is   xy (τ ) =

H(u) x (τ − u)du.

(2.43)

1  −iωτ , using By expanding the cross-spectrum matrix Fxy (ω) = 2π τ  xy (τ )e (2.43), and similarly for the output spectrum matrix Fy (ω), one gets

Fxy (ω) = (ω)Fx (ω) Fy (ω) = (ω)Fx (ω) ∗T (ω), where (ω) =



(2.44)

H(u)e−iωu du is the multivariate frequency response function.

Chapter 3

Empirical Orthogonal Functions

Abstract This chapter describes the idea behind, and develops the theory of empirical orthogonal functions (EOFs) along with a historical perspective. It also shows different ways to obtain EOFs and provides examples from climate and discusses their physical interpretation. Strength and weaknesses of EOFs are also mentioned. Keywords Empirical orthogonal functions · Teleconnection · Arctic oscillation · Sampling uncertainties · Teleconnectivity · Adjacency matrix

3.1 Introduction The inspection of multivariate data with a few variables can be addressed easily using the techniques listed in Chap. 2. For atmospheric data, however, where one deals with many variables, those techniques become impractical, see Fig. 3.1 for an example of data cube of sea level pressure. In the sequel, we adopt the notation and terminology presented in Chap. 2. In general, and before any advanced analysis, it is recommended to inspect the data using simple exploratory tools such as: • Plotting the mean field x of xk , k = 1, . . . n.

• Plotting the variance of the field, i.e. diag(S) = s11 , . . . spp , see Eq. (2.18). • Plotting time slices of the field, or the time evolution of the field at a given latitude or longitude, that is the Hovmöller diagram. • Computing and plotting one-point correlation maps between the field and a specific time series. This could be a time series from the same field at, say, a specific location sk in which case the field to be plotted is simply the k’th column of the correlation matrix  (Eq. (2.21)). Alternatively, the time series could be any

climate index zt , t = 1, 2, . . . n in which case the field to be plotted is simply ρ1 , ρ2 , . . . , ρp , where ρk = ρ(xk , zt ) is the correlation between the index and the k’th variable of the field. An example of one-point correlation map for DJF NCEP/NCAR sea level pressure is shown in Fig. 3.1 (bottom), the base

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_3

31

32

3 Empirical Orthogonal Functions

Fig. 3.1 An example of space–time representation of winter monthly (December–January– February) sea level pressure from the National Center for Atmospheric Research/National center for Environmental Prediction (NCAR/NCEP) reanalyses for the period Dec 1991–Dec 1995 (top and middle, unit: hPa), and one-point correlation map shown in percentage (bottom)

3.2 Eigenvalue Problems in Meteorology: Historical Perspective

33

point is also shown. This map represents the North Atlantic Oscillation (NAO) teleconnection pattern, discussed below. As stated in Chap. 1, when we have multivariate data the objective is often to find coherent structures or patterns and to examine possible dependencies and relationships among them for various purposes such as exploration, identification of physical and dynamical mechanisms and prediction. This can only be achieved through “simplification” or reduction of the data structure. The words simplification/reduction will become clear later. This chapter deals with one of the most widely used techniques to simplify/reduce and interpret the data structure, namely principal component analysis (PCA). This is an exploratory technique for multivariate data, which is in essence an eigenvalue problem, aiming at explaining and interpreting the variability in the data.

3.2 Eigenvalue Problems in Meteorology: Historical Perspective 3.2.1 The Quest for Climate Patterns: Teleconnections The climate system is studied using observations as well as model simulations. The weather and climate system is not an isolated phenomenon, but is characterised by high interconnections, namely climate anomalies at one location on the earth can be related to climate anomalies at other distant locations. This is the basic concept of what is known as teleconnection. In simple words, teleconnections represent patterns connecting widely separated regions (e.g. Hannachi et al. 2017). Typical examples of teleconnections include The El-Niño Southern Oscillation, ENSO (e.g. Trenberth et al. 2007), the North Atlantic Oscillation, NAO (Hurrell et al. 2003; Hannachi and Stendel 2016; Franzke and Feldstein 2005) and the Pacific-North American (PNA) patterns (Hannachi et al. 2017; Franzke et al. 2011). ENSO is a recurring ocean-atmosphere coupled pattern of interannual fluctuations characterised by changes in sea surface temperature in the central and eastern tropical Pacific Ocean associated with large scale changes in sea level pressure and also surface wind across the maritime continent. The Ocean part of ENSO embeds El-Niño and La-Ni`na, and the atmospheric part embeds the Southern Oscillation. An example of El-Niño is shown in Chap. 16 (Sect. 16.9). El-Niño and La-Niña represent, respectively, the warming (or above average) and cooling (or below average) phases of the central and eastern Pacific Ocean surface temperature. This process has a period of about three to 7 years where the sea surface temperature changes by about 1–3 ◦ C. The Southern Oscillation (SO) involves changes in pressure, and other variables such as wind, temperature and rainfall, over the tropical Indo-Pacific region, and is measured by the difference in atmospheric pressure between Australia/Indonesia and eastern South Pacific. An example of SO is discussed in Chap. 8 (Sect. 8.5). ENSO, as a teleconnection, has an impact

34

3 Empirical Orthogonal Functions

over considerable parts of the globe, especially North and South America and parts of east Asia and the summer monsoon region. Although ENSO teleconnection, precisely the Southern Oscillation, seems to have been discovered by Gilbert Walker (Walker 1923), through correlation between surface pressure, temperature and rainfall, the concept of teleconnection, however, seems to have been mentioned for the first time in Ångström (1935). The connection between the Southern Oscillation and El-Niño was only recognised later by Bjerknes in the 1960s, see, e.g. Bjerknes (1969). The NAO (Fig. 3.1, bottom) is a kind of see-saw in the atmospheric mass between the Azores and the extratropical North Atlantic. It is the dominant mode of near surface pressure variability over the North Atlantic, Europe and North Africa (Hurrell et al. 2003; Hannachi and Stendel 2016), and has an impact on considerable parts of the northern hemisphere (Hurrell et al. 2003; Hannachi and Stendel 2016). The two main centres of action of the NAO are located, respectively, near the Azores and Iceland. For example, in its positive phase the pressure difference between the two main centres of action is enhanced, compared to the climatology, resulting in stronger than normal westerly flow.

3.2.2 Eigenvalue Problems in Meteorology PCA has been used since the beginning of the twentieth century by statisticians such as Pearson (1901) and later Hotelling (1933, 1935). For statistical and more general application of PCA, the reader is referred, e.g., to the textbooks by Seal (1967), Morrison (1967), Anderson (1984), Chatfield and Collins (1980), Mardia et al. (1979), Krzanowski (2000), Jackson (2003) and Jolliffe (2002) and more references therein. In the atmospheric science, it is difficult to get the exact origin of eigenvalue problems. According to Craddock (1973), the earliest1 recognisable use of eigenvalue problems in meteorology seems to have been mentioned in Fukuoka (1951). The earliest known and comprehensive development of eigenvalue analyses and orthogonal expansion in atmospheric science are the works of Obukhov (1947) and Bagrov (1959) from the previous USSR and Lorenz (1956) from the US Weather Service. Fukuoka (1951) also mentioned the usefulness of these methods in weather prediction. Obukhov (1947) used the method for smoothing purposes whereas Lorenz (1956), who coined the name empirical orthogonal functions (EOFs), used it for prediction purposes. Because of the relatively large number of variables involved and the low speed/memory of computers that were available in the mid-1950s, Gilman (1957), for example, had to partition the atmospheric pressure field by slicing the northern hemisphere into slices, and the data matrix was thus reduced substantially and allowed an eigenanalysis. Later developments were conducted by various

1 Wallace

(2000) maintains the view that the way Walker (1923) computed the Southern Oscillation bears resemblance to iterative techniques used to compute empirical orthogonal functions.

3.3 Computing Principal Components

35

researchers: Obukhov (1960), Grimmer (1963), Craddock (1965, 1973), Kutzbach (1967), Wallace and Dickinson (1972). The question of mixed variables, for example, was raised by Kutzbach (1967), which has led to the issue of scaling the different variables. From then onwards, and with the advent of powerful computers and the increase of data leading to big data, the domain of eigenanalysis methods in atmospheric science has grown rapidly. The available large amounts of data in weather and climate need to be analysed in an efficient way. Empirical orthogonal functions provide one such tool to deal with such big data. Other extended forms of EOFs have been introduced later. These include, for example, rotated EOFs (Horel 1981; Richman 1981, 1986), complex EOFs (Horel 1984) and extended EOFs (Weare and Nasstrom 1982). Rotated EOFs for instance have been introduced to obtain more geographically compact patterns with more robustness (see e.g. Barnston and Livezey 1987) whereas complex EOFs, for example, are EOFs of complexified fields.

3.3 Computing Principal Components 3.3.1 Basis of Principal Component Analysis PCA aims to find a new set of variables that explain most of the variance observed in the data.2 Figure 3.2 shows the axes that explain most of the variability in the popular three-variable Lorenz model. It has been extensively used in atmospheric research to analyse particularly large scale and low frequency variability. The first seminal work on PCA in atmospheric science goes back to the mid-1950s with Ed. Lorenz. The method, however, has been used before by Obukhov (1947), see e.g. North (1984), and was mentioned later by Fukuoka (1951), see e.g. Craddock (1973). Here and elsewhere we will use both terminologies, i.e. PCA or EOF analysis, interchangeably. Among the very few earliest textbooks on EOFs in atmospheric science, the reader is referred to Preisendorfer and Mobley (1988), and to later textbooks by Thiebaux (1994), Wilks (2011), von Storch and Zwiers (1999), and Jolliffe (2002). The original aim of EOF analysis (Obukhov 1947; Fukuoka 1951; Lorenz 1956) was to achieve a decomposition of a continuous space–time field X(t, s), where t and s denote respectively time and spatial position, as X(t, s) =



ck (t)ak (s)

(3.1)

k≥0

using an optimal set of orthogonal basis functions of space ak (s) and expansion functions of time ck (t). When the field is discretised in space and/or time a similar 2 This

is based on the main assumption in data analysis, that is variability represents information.

36

3 Empirical Orthogonal Functions

Fig. 3.2 Empirical orthogonal functions of the Lorenz model attractor

expansion to (3.1) is also sought. For example, if the field is discretised in both time and space the expansion above is finite, and the obtained field can be represented by a data matrix X as in (2.11). In this case the sum extends to the rank r of X. The basis functions ak (s) and expansion coefficients ck (t) are obtained by minimising the residual:    R1 =

X(t, s) − T

S

M 

2 ck (t)ak (s)

dtds,

(3.2)

k=1

where the integration is performed over the time T and spatial S domains for the continuous case and for a given expansion order M. A similar residual is minimised for the discrete case except that the integrals are replaced by discrete sums.

3.3.2 Karhunen–Loéve Expansion In probability theory expansion (3.1), for a given s (and as such the parameter s is dropped here for simplicity) is known as Karhunen–Loève expansion associated with a continuous zero-mean3 stochastic process X(t) defined over an interval [a, b], and which is square integrable, i.e. E|X(t)|2 < ∞, for all t in the interval [a, b]. Processes having these properties constitute a Hilbert space (Appendix F) with the inner product < X1 (t), X2 (t) >= E (X1 (t)X2 (t)). The covariance function of X(t):

3 If

it is non-zero mean it can be centred by removing the stochastic process mean.

3.3 Computing Principal Components

37

γ (s, t) = E (X(t)X(s)) is symmetric and non-negative for a ≤ t1 , t2 ≤ b. This covariance function is at the root of expressing the stochastic process X(t), for t in [a, b], in terms of a sum of uncorrelated random variables. Let us consider the space of square integrable (real) functions defined over the interval [a, b], noted L2 ([a, b]). This functional space is a Hilbert space with the inner product: 

b

f (t)g(t)dt.

< f, g >= a

The linear transformation defined over L2 ([a, b]): 

b

Af (t) =

γ (t, s)f (s)ds a

is self-adjoint (Appendix F) because the covariance function is symmetric. The main consequence of this is that the kernel function γ (s, t) can be expanded into an absolutely and uniformly convergent series, i.e. γ (t, s) =

∞ 

λk φk (t)φk (s),

k=1

where λ1 , λ2 . . . and φ1 (), φ2 (), . . . are respectively the eigenvalues and associated orthonormal eigenfunctions of the Fredholm eigen problem:  Aφ(t) =

b

γ (t, s)φ(s)ds = λφ(t)

a

and satisfying < φi , φj >= δij . This result is due to Mercer (1909), see also Basilevsky and Hum (1979), and the covariance function γ (t, s) is known as Mercer kernel. Accordingly the stochastic process X(t) is then expanded as: X(t) =

∞ 

Xk φk (t),

k=1

where Xk , k = 1, 2 . . . are zero-mean uncorrelated random variables given by b

the stochastic integral Xk = a X(t)φk (t)dt and hence E Xi Xj = δij λi . The Karhunen–Loève expansion has the following two important properties (Loève 1963, p. 477; Parzen 1963, see also Basilevsky and Hum 1979) namely:  (i) it minimises the Shannon entropy − k λk ln λk , and (ii) it minimises the mean square error

38

3 Empirical Orthogonal Functions



b

|X(t) −

a

k 

Xi φi (t)|2 dt = 1 −

i=1

k 

λi

i=1

when the first k terms of the expansion are used. Note that when the stochastic process is stationary, i.e. γ (s, t) = γ (s − t) then the previous expansion becomes γ (s − t) =

∞ 

λk φk (s)φk (t)

k=1

and the integral operator A becomes a convolution. Now we come back to stochastic process X(t, s). The basis functions ak (s) in Eq. (3.1) are the empirical orthogonal functions (EOFs) and the expansion coefficients ck (t) are the principal components (PCs). For the discrete case, where the data matrix X has dimensions n × p the k’th EOF is a vector ak of length p, whereas the associated PC is a time series fk (t), t = 1, . . . n. Because the continuous case requires a special treatment, it will be discussed in a later chapter, and we focus here on the discrete case. In the literature, EOFs are also known as principal component loadings, or vector of loadings, and the PCs as EOF time series, EOF amplitudes, expansion coefficients and scores. In this and subsequent chapters we will reserve the term EOFs or EOF patterns for the spatial patterns and PCs for the corresponding time series. EOFs and PCs (and their derivatives) are multipurpose. They are used as an exploratory tool to analyse multivariate time series in climate or any other field and identify the dominant modes of variability (Jolliffe 2002; Hannachi et al. 2007). In weather and climate, in particular, they can be used in forecasting, downscaling, regression analysis, dimension reduction and analysing nonlinear features in state space (Tippett et al. 2008; Franzke et al. 2005; Kim and North 1999; Kim et al. 2015; Hannachi et al. 2017; Önskog et al. 2018, 2020).

3.3.3 Derivation of PCs/EOFs Given the (centred) data matrix (2.11), the objective of EOF/PC analysis is to find the linear combination of all the variables explaining maximum variance, that is to T

find a unit-length direction a = a1 , . . . , ap that captures maximum variability. The projection of the data onto the vector a yields the centred time series Xa, whose variance is simply the average of the squares of its elements, i.e. aT XT Xa/n. The EOFs are therefore obtained as the solution to the quadratic optimisation problem 

max F (a) = aT Sa subject to aT a = 1.

(3.3)

3.3 Computing Principal Components

39

Fig. 3.3 Illustration of a pair of EOFs: a simple monopole EOF1 and dipole EOF2

Eq. (3.3) can be solved using a Lagrange multiplier μ to yield

max aT Sa − μ(1 − aT a) a

which is also equivalent to maximising (aT Sa)/(aT a). The EOFs are therefore obtained as the solution to the eigenvalue problem: Sa = λ2 a.

(3.4)

The EOFs are the eigenvectors of the sample covariance matrix S arranged in decreasing order of the eigenvalues. The first eigenvector a1 gives the first principal component, i.e. the linear function Xa1 , with the largest variance; the second EOF a2 gives the second principal component with the next largest variance subject to being orthogonal to a1 as illustrated in Fig. 3.3, etc. Remark In PCA one usually defines the PCs first as linear combinations of the different variables explaining maximum variability from which EOFs are then derived. Alternatively, one can similarly define EOFs as linear combinations vT X, where v is a vector of weights, of the different maps of the field that maximise the norm squared. Applying this definition one obtains a similar equation to (3.3), namely: max

vT Pv , vT v

(3.5)

where P = XXT is the matrix of scalar product between the different maps. Equation (3.5) yields automatically the (standardised) principal components. Note that Eq. (3.5) is formulated using a duality argument to Eq. (3.3), and can be useful for numerical purposes when, for example, the sample size is smaller than the number of variables.

40

3 Empirical Orthogonal Functions

3.3.4 Computing EOFs and PCs Singular Value Decomposition and Similar Algorithms Since the covariance matrix is symmetric by construction, it is diagonalisable and the set of its eigenvectors forms an orthogonal basis of the p-dimensional Euclidean space, defined with the natural scalar product. This is a classical result in linear algebra, which is summarised by the following decomposition of the covariance matrix: S = U2 UT ,

(3.6)

4 matrix, i.e. UT U = UUT = I, and 2 is a diagonal where U is a p × p orthogonal   matrix, i.e. 2 = Diag λ21 , . . . , λ2p , containing the eigenvalues5 of S. The EOFs u1 , u2 , . . . up are therefore the columns of U. It is clear that if p < n, then there are at most p positive eigenvalues whereas if n < p there are at most n − 1 positive eigenvalues.6 To be more precise, if r is the rank of S, then there are exactly r positive eigenvalues. To be consistent with the previous maximisation problem, the eigenvalues are sorted in decreasing order, i.e. λ21 ≥ λ22 ≥ . . . ≥ λ2p , so that the first EOF yields the time series with maximum variance, the second one with the next largest variance, etc. The solution of the above eigenvalue problem, Eq. (3.6), can be obtained using either direct methods such as the singular value decomposition (SVD) algorithm or iterative methods based on Krylov subspace solvers using Lanczos or Arnoldi algorithms as detailed in Appendix D, see also Golub and van Loan (1996) for further methods and more details. The Krylov subspace method is particularly efficient for large and/or sparse systems. In Matlab programming environment, let X(n, p1, p2) designate the twodimensional (p1 × p2) gridded (e.g. lat-lon) field, where n is the sample size. The field is often assumed to be anomalies (though not required). The following simple code computes the leading 10 EOFs, PCs and the associated covariance matrix eigenvalues:

>> [n p1 p2] = size(X); p12=p1*p2; >> X = reshape(X, n, p12); >> if(n>p12) X = X’; end >> [PCs S EOFs] = svds(X, 10, ’L’); >> if(n>p12) PCs = A; PCs = EOFs; EOFs = A; end >> S = diag(diag(S).*diag(S))/n; >> EOFs = reshape (EOFs, p1, p2, 10).

is different from a normal matrix U, which commutes with its transpose, i.e. UT U = UUT . use squared values because S is semi-definite positive, and also to be consistent with the SVD

4 This 5 We

of S. 6 Why

n − 1 and not n?

3.3 Computing Principal Components

41

Note also that Matlab has a routine pca, which does PCA analysis of the data matrix X(n, p12) (see also Appendix H for resources): >> [PCs EOFs S] = pca (X). The proportion of variance explained by the kth principal component is usually given by the ratio: 100λ2 r k 2 %, j =1 λj

(3.7)

which is often expressed in percentage. An example of spectrum is displayed in Fig. 3.4, which shows the percentage of explained variance of the winter months Dec–Jan–Feb (DJF) sea level pressure. The vertical bars show the approximate 95% confidence limits discussed in the next section. The PCs are obtained by projecting the data onto the EOFs, i.e.: C = XU,

(3.8)

so the k’th PC ck = (ck (1), ck (2), . . . , ck (n)) is simply Xuk whose elements are ck (t) =

p 

xtj uj k

j =1

Fig. 3.4 Percentage of explained variance of the leading 40 EOFs of winter months (DJF) NCAR/NCEP sea level pressure anomalies for the period Jan 1940–Dec 2000. The vertical bars provide approximate 95% confidence interval of the explained variance. Adapted from Hannachi et al. (2007)

42

3 Empirical Orthogonal Functions

for t = 1, . . . , n and where uj k is the j’th element of the kth EOF uk . It is clear from (3.8) that the PCs are uncorrelated and that cov(ck , cl ) =

1 2 λ δkl . n k

(3.9)

Exercise Derive Eq. (3.9) Note that when using (3.5) instead we get automatically uncorrelated PCs, and then a similar relationship to (3.9) can be derived for the EOFs. There are various algorithms to obtain the eigenvalues and eigenvectors of S, see e.g. Jolliffe (2002). The most efficient and widely used algorithm is based on the SVD theorem (2.25), which, when applied to the data matrix X, yields 1 X = √ VUT , n

(3.10)

where  = Diag (λ1 , λ2 , . . . , λr ), and λ1 ≥ λ2 ≥ . . . λr ≥ 0 are the singular values of X. Note that the term √1n in (3.10) is used for consistency with (3.6), but the term is absorbed by the singular values and most of the time it does not appear. The SVD algorithm is a standard computing routine provided in most software, e.g. MATLAB (Linz and Wang 2003) and does not require computing the covariance matrix. To be efficient the SVD is applied to X if n < p otherwise XT is used instead. Using (3.8) the PCs are given by 1 C = √ V, n

(3.11)

hence the columns of V are the standardised, i.e. unit variance principal components. One concludes therefore that the EOFs and standardised PCs are respectively the right and left singular vectors of X. Figure 3.5 shows the leading two EOFs of DJF SLP anomalies (with respect to the mean seasonal cycle) from NCAR/NCEP. They explain respectively 21% and 13% of the total winter variability of the SLP anomalies, see also Fig. 3.4. Note, in particular, that the leading EOF reflects the Arctic Oscillation mode (Wallace and Thompson 2002), and shows the North Atlantic Oscillation over the North Atlantic sector. This is one of the many features of EOFs, namely mixing and is discussed below. The corresponding two leading PCs are shown in Fig. 3.6. The SVD algorithm is reliable, as pointed out by Toumazou and Cretaux (2001), and the computation of the singular values is governed by the condition number of the data matrix. Another strategy is to apply the QR algorithm (see Appendix D) to the symmetric matrix XT X or XXT , depending on the smallest dimension of X. The algorithm, however, can be unstable as the previous symmetric matrix has a larger condition number compared to that of the data matrix. In this regard, Toumazou and Cretaux (2001) suggest an algorithm based on a Lanczos eigensolver technique.

3.3 Computing Principal Components

43

Fig. 3.5 Leading two empirical orthogonal functions of the winter (DJF) monthly sea level pressure anomalies for the period Jan 1940–Dec 2000 (a) EOF1 (21%). (b) EOF2 (13%). Adapted from Hannachi et al. (2007)

The method is based on using a Krylov subspace (see Appendix D), and reduces to computing some eigen-elements of a small symmetric matrix.

Basic Iterative Approaches Beside the SVD algorithm, iterative methods have also been proposed to compute EOFs (e.g. van den Dool 2011). The main advantage of these methods is that they avoid computing the covariance matrix, which may be prohibitive at high resolution and large datasets, or even dealing directly with the data matrix as is the case with

44

3 Empirical Orthogonal Functions

Fig. 3.6 Leading two principal components of the winter (DJF) monthly sea level pressure anomalies for the period Jan 1940–Dec 2000 (a) DJF sea level pressure PC1. (b) DJF sea level pressure PC2. Adapted from Hannachi et al. (2007)

SVD. The iterative approach makes use of the identities linking EOFs and PCs. An  EOF Em (s) and corresponding PC cm (t) of a field X(t, s) satisfy Em (s) = t cm (t)X(t, s), and similarly for cm (t). The method then starts with an initial guess of a time series, say c(0) (t), scaled to unit variance, and obtains the associated pattern E (0) (s) following the previous identity. This pattern is then used to compute an updated time series c(1) (t), etc. As pointed out by van den Dool (2011), the process normally converges quickly to the leading EOF/PC. The process is then continued with the residuals, after removing the contribution of the leading mode, to get the subsequent modes of variability. The iterative method can be combined with spatial weighting to account for the grid (e.g. Gaussian grid in spherical geometry) and subgrid processes, and to maximise the signal-to-noise ratio of EOFs (Baldwin et al. 2009).

3.4 Sampling, Properties and Interpretation of EOFs

45

3.4 Sampling, Properties and Interpretation of EOFs 3.4.1 Sampling Variability and Uncertainty There are various ways to estimate or quantify uncertainties associated with the EOFs and corresponding eigenvalues. These uncertainties can be derived based on asymptotic approximation. Alternatively, the EOFs and associated eigenvalues can be obtained using a probabilistic framework where uncertainties are comprehensively modelled. Monte-Carlo method is another approach, which can be used, but can be computationally expensive. Cross-validation and bootstrap are examples of Monte-Carlo methods, which are discussed below. But other methods of MonteCarlo technique also exist, such as surrogate data, which will be commented on below.

Uncertainty Based on Asymptotic Approximation Since the data matrix is subject to sampling uncertainty, so do the eigenvalues and corresponding EOFs of the covariance matrix. Because of the existence of correlations among the different variables, the hypothesis of independence is simply not valid. One would expect for example that the eigenvalues are not entirely separated since each eigenvalue will have a whole uncertainty interval. So what we estimate is an interval rather than a single point value. Using asymptotic arguments based on the central limit theorem (CLT) in the limit of large samples (see e.g. Anderson 1963), it can be shown (Girshik 1939; Lawley 1956; Mardia et al. 1979; North et al. 1982) that if λˆ 21 , . . . , λˆ 2p denote the eigenvalues estimated from the sample covariance matrix S, obtained from a sample of size n, then we have the approximation:   2 ˆλ2k ∼ N λ2k , λ4k , n

(3.12)

where N (μ, σ 2 ) stands for the normal distribution with mean μ and variance σ 2 (see Appendix B), and λ2k , k = 1, . . . p are the eigenvalues of the underlying population covariance matrix . The standard error of the estimated eigenvalue λˆ 2k is then  δ λˆ 2k ≈ λ2k

2 . n

(3.13)

46

3 Empirical Orthogonal Functions

For a given significance level α the interval λˆ 2k ± δ λˆ 2k 1− α2 , where the notation a refers to the a’th quantile,7 provides therefore the asymptotic 100(1 − α)% confidence limits of the population eigenvalue λ2k , k = 1, 2, . . . p. For example, the 95% confidence interval is [λˆ 2k − 1.96δ λˆ 2k , λˆ 2k + 1.96δ λˆ 2k ]. Figure 3.4 displays these limits for the winter seal level pressure anomalies. A similar approximation can also be derived for the eigenvectors uk , k = 1, . . . p: δuk ≈

δ λˆ 2k λ2j − λ2k

uj ,

(3.14)

where λ2j is the closest eigenvalue to λ2k . Note that in practice the number n used in the previous approximation corresponds to the size of independent data also known as effective sample size, see below.

Probabilistic PCA A more comprehensive method to obtain EOFs and explained variances along with their uncertainties is to use a probability-based method. This has been done by Tipping and Bishop (1999), see also Goodfellow et al. (2016). In this case EOFs are computed via maximum likelihood. The model is based on a latent variable as in factor analysis discussed in Chap. 10. Note, however, that the method relies on multinormality assumption. More discussion on the method and its connection to factor analysis is discussed in Sect. 10.6 of Chap. 10.

Monte-Carlo Resampling Methods The asymptotic uncertainty method discussed above relies on quite large sample sizes. In practice, however, this assumption may not be satisfied. An attractive and easy to use alternative is Monte-Carlo resampling method, which has become invaluable tool in modern statistics (James et al. 2013; Goodfellow et al. 2016). The method involves repeatedly drawing subsamples from the training set at hand, refitting the model to each of these subsamples, and obtain thereafter an ensemble of realisations of the parameters of interest enabling the computation of uncertainties on those parameters. Cross-validation and bootstrap are the most commonly used resampling methods. The bootstrap method goes back to the late 1970s with Efron (1979). The textbooks by Efron and Tibshirani (1994), and also Young and Smith (2005) provide a detailed account of Monte-Carlo methods and their application. A summary of cross-validation and bootstrap methods is given below, and for deeper

= −1 (a) where () is the cumulative distribution function of the standard normal distribution (Appendix B).

7 a

3.4 Sampling, Properties and Interpretation of EOFs

47

analysis the reader is referred to the more recent textbooks of James et al. (2013), Goodfellow et al. (2016), Brownlee (2018) and Watt et al. (2020). Cross-Validation Cross-validation is a measure of performance, and involves splitting the available (training) data sample (assumed to have a sample size n) into two subsets, one is used for training (or model fitting) and the other for validation. That is, the fitted model (on the training set) is used to get responses via validation with the second sample, enabling hence the computation of the test error rate. In this way, cross-validation (CV) can be used to get the test error, and yields a measure of model performance and model selection, see, e.g., Sect. 14.5 of Chap. 14 for an application to parameter identification. There are basically two types of crossvalidation methods, namely, the leave-one-out CV and k-fold CV. The former deals with leaving one observation out (validation set), and fitting the statistical model on the remaining n − 1 observations (training set), and computing the test error at the left-one-out observation. This error is simply measured by the mean square error (test error) between the observation and the corresponding value given by the fitted model. This procedure is then repeated with every observation, and then the leave-one-out CV is estimated by the average of the obtained n test errors. In the k-fold CV, the dataset is first divided randomly into k subsamples of similar sizes. Training is then applied to one subsample and validation applied to the remaining k-1 subsamples, yielding one test error. The procedure is then repeated with each subsample, and the final k-fold CV is obtained as the average of the obtained k test errors. The leave-one-out approach is obviously a special case of the k-fold CV, and therefore the latter is advantageous over the former. For example, k-fold CV gives more accurate estimates of the test errors, a result that is related to the bias-variance trade-off. James et al. (2013) suggest the empirical values k = 5 or k = 10. The Bootstrap The bootstrap method is a powerful statistical tool used to estimate uncertainties on a given statistical estimator from a given dataset. The most common use of bootstrap is to provide a measure of accuracy of the parameter estimate of interest. In this context, the method is used to estimate summary statistics of the parameter of interest, but can also yield approximate distribution of the parameter. The bootstrap involves constructing a random subsample from the dataset, which is used to construct an estimate of the parameter of interest. This procedure is then repeated a large number of times, yielding hence an ensemble of estimates of the parameter of interest. Each sample used in the bootstrap is constructed from the dataset by drawing observations, one at a time, and returning the drawn sample to the dataset, until the required size is reached. This procedure is known as sampling with replacement and enables observations to appear possibly more than once in a bootstrap sample. In the end, the obtained ensemble of estimates of the parameter of interest is used to compute the statistics, e.g. mean and variance, etc., and quantify the uncertainty on the parameter (James et al. 2013). In summary, a bootstrap sample, with a chosen size, is obtained by drawing observations, one at a time, from

48

3 Empirical Orthogonal Functions

the pool of observations of the training dataset. In practice, the number of bootstrap samples should be large enough of the order of O(1000). Also, for reasonably large data, the bootstrap sample size can be of the order 50–80% of the size of the dataset. The algorithm of resampling with replacement goes as follows: (1) (2) (3) (4)

Select the number of bootstrap samples, and a sample size of these samples. Draw the bootstrap sample Compute the parameter of interest Go to (2) until the number of bootstrap samples is reached.

The application of the above algorithm yields a distribution, e.g. histogram, of the parameter. Remarks on Surrogate Data Method The class of Monte-Carlo method is quite wide and includes other methods than CV and bootstrap resampling. One particularly powerful method used in time series is that of surrogate data. The method of surrogate data (Theiler et al. 1992) involves generating surrogate datasets, which share some characteristics with the original time series. The method is mostly used in nonlinear and chaotic time series analysis to test linearity null hypotheses (e.g. autoregressive moving-average ARMA processes) versus nonlinearity. The most common algorithm for surrogate data is phase randomisation and amplitude adjusted Fourier transform (Theiler et al. 1992). Basically, the original time series is Fourier transformed, the amplitudes of this transform are then used with new uniformly distributed random phases, and finally an inverse Fourier transform is applied to get the surrogate sample. For real time series, the phases are constrained to be antisymmetric. By construction, these surrogates preserve the linear structure of the original time series (e.g. autocorrelation function and power spectrum). Various improvements and extensions have been proposed in the literature (e.g. Breakspear et al. 2003, Lucio et al. 2012). Surrogate data method has also been applied in atmospheric science and oceanography. For example, Osborne et al. (1986) applied it to identify signatures of chaotic behaviour in the Pacific Ocean dynamics.

Bootstrap Application to EOFs of Atmospheric Fields The bootstrap method can easily be applied to obtain uncertainties on the EOFs of a given atmospheric field, as shown in the following algorithm: (1) Select the number of bootstrap samples, and the sample size of these samples. (2) For each drawn bootstrap sample: (2.1) Compute the EOFs (e.g. via SVD) and associated explained variance. (2.2) Rank the explained variances (and associated EOFs) in decreasing order. (3) Calculate the mean, variance (and possibly histograms, etc.) of each explained variance and associated EOFs (at each grid point).

3.4 Sampling, Properties and Interpretation of EOFs

49

The application of this algorithm yields an ensemble of EOFs and associated eigenvalues, which can be used to quantify the required uncertainties. Remarks 1. It is also possible to apply bootstrap without replacement. This can affect probabilities, but experience shows that the difference with the bootstrap with replacement is in general not large. In a number of applications, other, not standard, sampling methods have also been used. An example would be to choose a subset of variables then scramble them by breaking the chronological order then apply EOFs, and so on. 2. Another test, also used in atmospheric science and oceanography, consists of generating random samples, e.g. red noise data (see Appendix C) having the same spectral (or autocorrelation) characteristics as the original data, then computing the eigenvalues and the eigenvectors from the various samples and obtain an uncertainty estimate for the covariance matrix spectra. The Monte-Carlo bootstrap method is commonly used in atmospheric science (e.g. von Storch and Zwiers 1999), and has been applied to study nonlinear flow regimes in atmospheric low frequency variability (e.g. Hannachi 2010), and climate change effect on teleconnection (e.g. Wang et al. 2014). For example, Wang et al. (2014) estimated uncertainties in the NAO and applied it to the winter sea level pressure from the twentieth century reanalyses. They used the bootstrap sampling to obtain spatial patterns of NAO uncertainties. The methodology is based on computing the standard deviation of the leading EOF of the sampled covariance matrix. Wang et al. (2014) used a slightly modified version of the bootstrap method. To replicate the correlation structure, they sampled blocks of data instead of individual observations. This is common practice in atmospheric science because of the non-negligible correlation structure in weather and climate data. In their analysis, Wang et al. (2014) used non-overlapping blocks of 20-yr winter time monthly SLP anomalies and 2- and 4-month bootstrap blocks. Their results indicate that the largest uncertainties are located between the centres of action of the NAO, particularly in the first half of the record. Figure 3.7 shows the evolution of the longitude of the northern (Fig. 3.7a) and southern (Fig. 3.7b) nodes of the NAO over 20-yr running windows for the period 1871–2008. There is a clear zonal shift of the NAO nodes across the record. For example, the poleward centre of action of the NAO shows an eastward shift during the last quarter of the record. Furthermore, the southern node shows larger uncertainties compared to the northern node.

50

3 Empirical Orthogonal Functions

Fig. 3.7 Evolution of the frequency distribution of the north (a) and south (b) centres of action of the NAO pattern computed over 20-yr running windows. The yellow line corresponds to the longitude of the original sample. Adapted from Wang et al. (2014). ©American Meteorological Society. Used with permission

3.4.2 Independent and Effective Sample Sizes Serial Correlation Given a univariate time series xt , t = 1, . . . n, with serial correlation, the number of degrees of freedom (d.o.f) n∗ is the independent sample size of the time series. Although it is not very easy to define exactly the effective sample size from a given sample, one can use approximations based on probabilistic models. For example, using a AR(1) process, with autocorrelation ρ(.), Leith (1973) suggested n∗ =

n n = − log(ρ(1)), 2T0 2

(3.15)

where T0 is the e-folding time of ρ(.). The idea behind Leith (1973) is that if xt , t = 1, 2, . . . n is a realisation of an independent and identically distributed (IID)  random variables X1 , . . . Xn , with variance σ 2 then the mean x = n−1 nt=1 xt has variance σx2 = n−1 σ 2 . Now, consider a continuous time series x(t) defined for all

3.4 Sampling, Properties and Interpretation of EOFs

51

values of t with lagged autocovariance γ (τ ) = E [x(t)x(t + τ )] = σ 2 ρ(τ ). An estimate of the mean x t for a finite time interval [0, T ] is 

1 xt = T

t− T2

t− T2

(3.16)

x(u)du.

The variance of (3.16) can easily be derived and (3.15) can be recovered from a red noise. Exercise 1. Compute the variance σT2 of (3.16). 2. Derive σT2 for a red noise and show that

σ2 σT2

=

T 2T0 .

Hint 1. From (3.16) we have  T 2 σT2

=E  =E 2



t− T2 T 2



T 2

− T2

T 2

− T2

 x(s1 )x(s2 )ds1 ds2 

x(t + s1 )x(t + s2 )ds1 ds2

− T2



T 2

t+ T2

t− T2

− T2

 =σ

t+ T2

ρ(s2 − s1 )ds1 ds2 .

This integral can be computed by a change of variable u = s1 and v = s2 − s1 , which transforms the square [− T2 , T2 ]2 into a parallelogram R (see Fig. 3.8) i.e. σ2 T 2 T2 σ

  =  =

Hence

σT2 σ2

 R

ρ(u)dudv = 

0 −T

=

dv

2 T

T 0

T 2

−v− T2

(1 −



0

−T

 ρ(v)dudv +

dv



T

dv

ρ(v)dudv

0





T

ρ(v)dv +

dv 0

−v+ T2 − T2

 du = 2T 0

T

(1 −

v )ρ(v)dv. T

v T )ρ(v)dv.

Remark Note that for a red noise or (discrete) AR(1) process, xt = φ1 xt−1 + εt , −τ |τ | ρ(τ ) = e T0 (= φ1 ), see Appendix C, and the e-folding time T0 is given by the ∞ integral 0 ρ(τ )dτ . In the above formulation, the time interval was assumed to be unity. If the time series is sampled every t then as T0 = − log t ρ(t) , and then one gets n n∗ = − log ρ(t). 2

52

3 Empirical Orthogonal Functions

Fig. 3.8 Domain change Fig. 3.9 Effective sample size n∗ vs ρ(1) for Eq. (3.17) (continuous) and Eq. (3.18) (dashed)

Jones (1975) suggested an effective sample size of order-1, which, for a red noise, boils down to n∗ = n

1 − ρ(1) 1 + ρ(1)

(3.17)

while Kikkawa and Ishida (1988) suggested n∗ = n

1 − ρ(1)2 1 + ρ(1)2

which can be twice as large compared to (3.17) (Fig. 3.9).

(3.18)

3.4 Sampling, Properties and Interpretation of EOFs

53

Time Varying Fields For time varying fields, or multivariate time series, with N grid points or variables x(t) = (x1 (t), . . . , xN (t)T observed over some finite time interval, Bretherton et al. (1999) discuss two measures of effective numbers of spatial d.o.f or number of independently varying spatial patterns. For example, for isotropic turbulence a similar equation to (3.15) was given by Taylor (1921) and Keller (1935). Using the “moment matching” (mm) method of Bagrov (1969), derived from a χ 2 distribution, an estimate of the effective number of d.o.f can be derived, namely 2

∗ = 2E /E 2 , Nmm

(3.19)

where () is a time mean and E is a quadratic measure of the field, e.g. the quadratic norm of x(t), E = x(t)T x(t). An alternative way was also proposed by Bagrov (1969) and TerMegreditchian (1969) based on the covariance matrix of the field. This estimate, which is also discussed in Bretherton et al. (1999) takes the form ∗ Neff

=

N 

2

λk

N

i=1 λi

k=1

=

tr()2 tr( 2 )

,

(3.20)

where λk , k = 1, . . . N are the eigenvalues of the covariance matrix  of the field ∗ and N ∗ x(t). Bretherton et al. (1999) investigated the relationship between Neff mm in connection to non-Gaussianity, yielding ∗ Neff =

κ −1 ∗ Nmm , 2

where κ is the kurtosis assumed to be the same for all PCs. This shows, in particular, that the two values can be quite different.

3.4.3 Dimension Reduction Since the leading order EOFs explain more variance than the lowest order ones, one would then be tempted to focus on the few leading ones and discard the rest as being noise variability. This is better assessed by the percentage of the explained variance by the first, say, m retained EOFs: m

2 k=1 λk 2 k=1 λk

p

m =

k=1 var (Xuk )

tr ()

.

(3.21)

54

3 Empirical Orthogonal Functions

In this way one can choose a pre-specified percentage of explained variance, e.g. 70%, then keep the first m EOFs and PCs that explain altogether this amount. Remark Although this seems a reasonable way to truncate the spectrum of the covariance matrix, the choice of the amount of explained variance remains, however, arbitrary. We have seen in Chap. 1 two different types of transformations: scaling and sphering. The principal component transformation, obtained by keeping a subset of EOFs/PCs, is yet another transformation that can be used in this context to reduce the dimension of the data. The transformation is given by Y = XU. To keep the leading EOFs/PCs that explain altogether a specific amount of variability, say β, one uses m

λ2

m−1

λ2

k k 100 trk=1 ≥ β and 100 trk=1 < β. () ()

The reduced data is then given by Ym = [y1 , y2 , . . . , ym ] = XUm ,

(3.22)

where Um = [u1 , u2 , . . . , um ] is the matrix of the leading m EOFs. Remark If some of the original variables are linearly dependent the data matrix cannot be of full rank, which is min(n, p). In this case the covariance matrix is not invertible, and will have zero eigenvalues. If p0 is the number of zero eigenvalues, then min(n, p) − p0 is the dimension of the space containing the observations. NB As it was mentioned earlier, the EOFs are also known as loadings, and the loading coefficients are the elements of the EOFs.

3.4.4 Properties and Interpretation The main characteristic features of EOF analysis is the orthogonality of EOFs and uncorrelation of PCs. These are nice geometric properties that can be very useful in modelling studies using PCs. For example, the covariance matrix of any subset of retained PCs is always diagonal. These constraints, however, yield partially predictable relationships between an EOF and the previous ones. For instance, as pointed out by Horel (1981), if the first EOF has a constant sign over its domain, then the second one will generally have both signs with the zero line going through

3.4 Sampling, Properties and Interpretation of EOFs

55

the maxima of the first EOF (Fig. 3.3 ). The orthogonality constraint also makes the EOFs domain-dependent and can be too non-local (Horel 1981; Richman 1986). Perhaps one of the main properties of EOFs is mixing. Assume, for example, that our signal is a linear superposition of signals, not necessarily uncorrelated, then EOF analysis tends to mix these signals in order to achieve optimality (i.e. maximum variance) yielding patterns that are mixture of the original signals. This is known as the mixing problem in EOFs. This problem can be particularly serious when the data contain multiple signals with comparable explained variance (e.g. Aires et al. 2002; Kim and Wu 1999). Figure 3.10 shows the leading EOF of the monthly sea surface temperature (SST) anomalies over the region 45.5◦ S–45.5◦ N. The anomalies are computed with respect to the monthly mean seasonal cycle. The data are on a 1◦ × 1◦ latitude-longitude grid and come from the Hadley Centre Sea Ice and Sea Surface Temperature8 spanning the period Jan 1870–Dec 2014 (Rayner et al. 2003). The EOF shows a clear signal of El-Niño in the eastern equatorial Pacific. In addition we also see anomalies located on the western boundaries of the continents related to the western boundary currents. These are discussed in more detail in Chap. 16 (Sect. 16.9). Problems related to mixing are conventionally addressed using, e.g. EOF rotation (Chap. 4), independent component analysis (Chap. 12) and also archetypal analysis (see Chap. 16). Furthermore, although the truncated EOFs may explain a substantial amount of variance, there is always the possibility that some physical modes may not be represented by these EOFs. EOF analysis may lead therefore to an underestimation of the complexity of the system (Dommenget and Latif 2002). Consequently, these constraints can cause limitations to any possible physical interpretation of the obtained patterns (Ambaum et al. 2001; Dommenget and Latif 2002; Jolliffe

-6

-4

-2

0

2

4

6

8

10

Fig. 3.10 Leading EOF of SST anomalies equatorward of 45◦

8 www.metoffice.gov.uk/hadobs/hadisst.

12

14

16

18

20

56

3 Empirical Orthogonal Functions

2003) because physical modes are not necessarily orthogonal. Normal modes derived for example from linearised dynamical/physical models, such as barotropic models (Simmons et al. 1983) are not orthogonal since physical processes are not uncorrelated. The Arctic Oscillation/North Atlantic Oscillation (AO/NAO) EOF debate is yet another example that is not resolved using (hemispheric) EOFs (Wallace 2000; Ambaum et al. 2001, Wallace and Thompson 2002). Part of the difficulty in interpretation may also be due to the fact that, although uncorrelated, the PCs are not independent and this is particularly the case when the data are not Gaussian, in which case other approaches exist and will be presented in later chapters. It is extremely difficult and perhaps not possible to get, using techniques based solely on purely mathematical/statistical concepts, physical modes without prior knowledge of their structures (Dommenget and Latif 2002) or other dynamical constraints. For example, Jolliffe (2002, personnel communication) points out that in general EOFs are unsuccessful to capture modes of variability, in case where the number of variables is larger than the number of modes, unless the latter are orthogonally related to the former. In this context we read the following quotation9 (Everitt and Dunn, 2001 p. 305; also quoted in Jolliffe 2002, personnel communication): “Scientific theories describe the properties of observed variables in terms of abstraction which summarise and make coherent the properties of observed variables. Latent variables (modes), are, in fact one of this class of abstract statements and the justification for the use of these variables (modes) lies not in an appeal to their “reality” or otherwise but rather to the fact that these variables (modes) serve to synthesise and summarise the properties of the observed variables”. One possible way to evaluate EOFs is to compare them with a first-order spatial autoregressive process (e.g. Cahalan et al. 1996), or more generally using a homogeneous diffusion process (Dommenget 2007; Hannachi and Dommenget 2009). The simplest homogeneous diffusion process is given by d u = −λu + ν∇ 2 u + f dt

(3.23)

and is used as a null hypothesis to evaluate the modes of variability of the data. The above process represents an extension of the simple spatial first-order autoregressive model. In Eq. (3.23) λ and μ represent, respectively, damping and diffusion parameters, and f is a spatial and temporal white noise process. Figure 3.11 shows the leading EOF of SST anomalies along with its PC and the time series of the data at a point located in the south western part of the Indian Ocean. The data span the period 1870–2005. Figure 3.12 compares the data covariance matrix spectrum with that of a fitted homogeneous diffusion process and suggests consistency with the null hypothesis,

9 Attributed

to D.M. Fergusson, and L. J. Harwood.

3.4 Sampling, Properties and Interpretation of EOFs

57

PC1

2 0 −2 Jan80

Jan00

Jan20

Jan40

Jan60

Jan80

Jan00

Indian Ocean SSTA at (0.5S, 56.5E)

2

o

SSTA ( C)

4

0 −2 Jan80

Jan00

Jan20

Jan40

Jan60

Jan80

Jan00

Fig. 3.11 Leading EOF of the Indian Ocean SST anomalies (top), the associated PC (middle) and the SST anomaly time series at the centre of the domain (0.5◦ S, 56.5◦ E) (bottom). Adapted from Hannachi and Dommenget (2009)

58

3 Empirical Orthogonal Functions

Eigenvalue spectrum 60

Eigenvalue (%)

50 40 30 20 10 0 0

5

10

15

Rank Fig. 3.12 Spectrum of the covariance matrix of the Indian Ocean SST anomalies, with the approximate 95% confidence limits, along with the spectrum of the fitted homogeneous diffusion process following (3.17). Adapted from Hannachi and Dommenget (2009)

particularly for the leading few modes of variability. The issue here is the existence of a secular trend, which invalidates the test. For example, Fig. 3.13 shows the time series distribution of the SST anomalies averaged over the Indian Ocean, which shows significant departure from normality. This departure is ubiquitous in the basin as illustrated in Fig. 3.14. Hannachi and Dommenget (2009) applied a differencing operator to the data to remove the trend. Figure 3.15 shows the spectrum compared to that of similar diffusion process of the differenced fall SST anomalies. The leading EOF of the differenced data (Fig. 3.16), reflecting the Indian Ocean dipole, can be interpreted as an intrinsic mode of variability. Another possible geometric interpretation of EOFs is possible with multinormal data. In fact, if the underlying probabilistic law generating the data matrix is the multivariate Gaussian, or multinormal, i.e. the probability density function of the vector x is  1 1 T −1 f (x) = (3.24) exp − (x − μ)  (x − μ) , p 1 2 (2π ) 2 || 2 where μ and  are respectively the mean and the covariance matrix of x and || is the determinant of , then the interpretation of the EOFs is straightforward. Indeed, in this case the EOFs represent the principal axes of the ellipsoid of the distribution.

3.4 Sampling, Properties and Interpretation of EOFs

59

o

SST anomalies ( C)

a) SST anomalies 4 2 0 −2 Jan80

Jan00

Jan20

Jan40

Jan00

5 SST quantiles

Frequency

Jan80

c) Quantile−quantile plot

b) Histogram 1.5 1 0.5 0

Jan60

−1 0 1 o SST anomalies ( C)

0

−5 −4

−2 0 2 4 Standard Normal Quantiles

Fig. 3.13 Time series of the SST anomalies averaged over the box (0–4◦ S, 62–66◦ E) (a), its histogram (b) and its quantile-quantile (c). Adapted from Hannachi and Dommenget (2009)

Fig. 3.14 Grid points where the detrended SST anomalies over the Indian Ocean are nonGaussian, based on a Lilliefors test at the 5% significance level. Adapted from Hannachi and Dommenget (2009)

60

3 Empirical Orthogonal Functions

Eigenvalue spectrum 60

Eigenvalue (%)

50 40 30 20 10 0 0

5

10

15

Rank Fig. 3.15 Same as Fig. 3.12 but for the detrended Indian Ocean fall SST anomalies. Adapted from Hannachi and Dommenget (2009)

Fig. 3.16 Leading EOF of the detrended fall Indian Ocean SST anomalies. Adapted from Hannachi and Dommenget (2009)

3.6 Scaling Problems in EOFs

61

This is discussed below in Sect. 3.7. The ellipsoid is given by the isolines10 of (3.24). Furthermore, the PCs in this case are independent.

3.5 Covariance Versus Correlation EOFs from the covariance matrix find new variables that successively maximise variance. By contrast the EOFs from the correlation matrix C, i.e. the sample version of , attempt to maximise correlation instead. The correlation-based EOFs are obtained using the covariance matrix of the standardised or scaled data matrix (2.22) Xs = XD−1/2 , where D = diag(S). Therefore all the variables have the same weight as far as variance is concerned. The correlation-based EOFs can also be obtained by solving the generalised eigenvalue problem: D−1 Sa = λ2 a,

(3.25)

then u = D1/2 a is the correlation-based EOF corresponding to the eigenvalue λ2 . Exercise Derive the above Eq. (3.25). The individual eigenvalues of the correlation matrix cannot be interpreted in a simple manner like the case with the covariance matrix. Both analyses yield in general different information and different results. Consequently, there is no genuine and systematic way of choosing between covariance and correlation, and the choice remains a matter of individual preference guided, for example, by experience or driven by a particular need or focus. For example, Overland and Preisendorfer (1982) found, by analysing cyclone frequencies, that the covariance-based EOFs provide a better measure for cyclonic frequency variability whereas correlationbased EOFs provide a better measure to identify storm tracks, see also Wilks (2011) for more discussions.

3.6 Scaling Problems in EOFs One of the main features of EOFs is that the PCs of a set of variables depend critically on the scale used to measure the variables, i.e. the variables’ units. PCs change, in general, under the effect of scaling and therefore do not constitute a unique characteristic of the data. This problem does not occur in general when all the variables have the same unit. Note also that this problem does not occur when the

10 This

interpretation extends also to a more general class of multivariate distributions, namely the elliptical distributions. These are distributions whose densities are constant on ellipsoids. The multivariate t-distribution is an example.

62

3 Empirical Orthogonal Functions

correlation matrix is used instead. This is particularly useful when one computes for example EOFs of combined fields such as 500 mb heights and surface temperature. Consider for simplicity two variables: geopotential height x1 at one location, and zonal wind x2 at another location. The variables x1 and x2 are expressed respectively in P a and ms−1 . Let z1 and z2 be the obtained PCs. The PCs units will depend on the original variables’ units. Let us assume that one wants the PCs to be expressed in hP a and km/ h, then one could think of either premultiply x1 and x2 respectively by 0.01 and 3.6 then apply EOF analysis or simply post-multiply the PCs z1 and z2 respectively by 0.01 and 3.6. Now the question is: will the results be identical? The answer is no. In fact, if C is the diagonal scaling matrix containing the scaling constants, the scaled variables are given by the data matrix Xs = XC whose PCs are given by the columns of Z obtained from a SVD of the scaled data, i.e. Xs = As ZT . Now one can post-multiply by C the SVD decomposition of X to yield: XC = U (CV)T . What we have said above is that Z = CV, which is true since U (CV)T is no longer a SVD of XC. This is because CV is no longer orthogonal unless C is of the form aIp , i.e. isotropic. This is known as the scaling problem in EOF/PCA. One simple way to get around the problem is to use the correlation matrix. For more discussion on the scaling problem in PCA refer, for example, to Jolliffe (2002), Chatfield and Collins (1980), and Thacker (1996).

3.7 EOFs for Multivariate Normal Data EOFs can be difficult to interpret in general. However, there are cases in which EOFs can be understood in a geometric sense, and that is when the data come from a multivariate normal random variable, e.g. Y , with distribution N (μ, ) with probability density function given by (3.24). Let λk , and ak , k = 1, . . . , p, be the eigenvalues and associated (normalised) matrix ,

eigenvectors of the covariance i.e.  = AAT , with A = a1 , . . . , ap ,  = diag λ1 , . . . , λp and AT A = Ip . ˆ and sample covariance matrix Now from a sample data matrix X, the sample mean μ S are maximum likelihood estimates of μ and  respectively. Furthermore, when the eigenvalues of  are all distinct the eigenvalues and EOFs of S are also maximum likelihood estimate (MLE) of λk , and ak , k = 1, . . . , p, respectively (see e.g. Anderson 1984; Magnus and Neudecker 1995; Jolliffe 2002). Using the pdf f (y) of Y , see Eq. (3.24), the joint probability density function of the PCs Z = AT (Y − μ) is given by  − p2

f (z) = (2π )

p !

k=1

− 12

λk



 p 1  zk2 exp − , 2 λk

(3.26)

k=1

which is the product of p independent normal probability density functions. The multivariate probability density function is constant over the ellipsoids

3.8 Other Procedures for Obtaining EOFs

63

Fig. 3.17 Illustration of a two-dimensional Gaussian distribution along with the two EOFs

(y − μ)T  −1 (y − μ) = α for a given positive constant α. Using the PCs coordinates the above equation p simplifies to k=1 zk2 /λk = α. These ellipsoids have therefore the eigenvalues λk and EOFs ak , k = 1, . . . , p, as the length and directions, respectively, of their principal axes. Figure 3.17 shows an illustration of a two-dimensional Gaussian distribution with the two EOFs. EOFs constitute therefore a new rotated coordinate system going through the data mean and directed along the principal axes of the distribution ellipsoid. Note that the PCs of a multivariate normal are independent because they are uncorrelated. This is not true in general, however, if the data are not Gaussian and other techniques exist to find independent components (see Chap. 12.)

3.8 Other Procedures for Obtaining EOFs It is shown above that EOFs are obtained as the solution of an eigenvalue problem. EOFs can also be formulated through a matrix optimisation problem. Let again X be p √ a n × p data matrix which is decomposed using SVD as X = k=1 λk uk vTk . The 1 p T sample covariance matrix is also written as S = n−1 k=1 λk uk uk . Keeping the first r < p EOFs is equivalent to truncating the √ sum by keeping the first r  previous terms to yield the filtered data matrix Xr = rk=1 λk uk vTk , and similarly for the associated covariance matrix Sr . The covariance matrix Sr of the filtered data can also be obtained as the solution to the following optimisation problem:

min φ (Y) = tr (S − Y)2 Y

(3.27)

over the set of positive semi-definite matrices Y of rank r (see Appendix D). So Sr provides the pbest approximation to S in the above sense, and the minimum is in fact φ (Sr ) = k=r+1 λk . The expression of the data matrix as the sum of the contribution from different EOFs/PCs provides a direct way of filtering the data. The idea of filtering the data

64

3 Empirical Orthogonal Functions

using PCs can also be formulated through finding an approximation of X of the form: X = ZAT + E

(3.28)

for a n × r data matrix Z and a p × r semi-orthogonal matrix A of rank r, i.e. Z and A can be obtained indeed by minimising the error AT A = Ir . The

matrices variance: tr EE T from (3.27), i.e.  T  T T . X − ZA min φ (Z, A) = tr X − ZA 

(3.29)

The solution to (3.29) is obtained (see Appendix D) for A = (u1 , . . . , ur ), and Z = XA. The minimum of (3.29) is then given by p 

min φ (Z, A) =

λk .

k=r+1

In other words A is the matrix of the first r EOFs, and Z is the matrix of the associated PCs. This way of obtaining Z and A is referred to as the one-mode component analysis (Magnus and Neudecker 1995), and attempts to reduce the number of variables from p to r. Magnus and Neudecker (1995) also extend it to two- and more mode analysis. Remark Let xt , t = 1, . . . , n, be the data time series that we suppose to be centred, 2 T and define zt = AT xt , where A is a p × m matrix. Let also S = Uλ

U be the decomposition of the samplecovariance matrix into eigenvectors U = u1 , . . . , up ,  and eigenvalues λ2 = diag λ21 , . . . , λ2p , and where the eigenvalues are arranged in decreasing order. Then the following three optimisation problems are equivalent in that they yield the same solution: • Least square sum of errors of reconstructed data (Pearson 1901), i.e. min A

 n 

 xt − AA xt T

2

=n

t=1

m 

λ2k .

k=1

• Maximum variance in the projected space, subject to orthogonality (Hotelling 1933), i.e.  max

AT A=Ip

tr

 n  t=1

 xt xTt

=n

m 

λ2k .

k=1

• Maximum mutual information, based on normality, between the original random variables x and their projection z, generating respectively xt and zt , t = 1, . . . , n,

3.9 Other Related Methods

65

(Kapur 1989, p. 502; Cover and Thomas 1991), i.e. m ! 1 max [I (z, x)] = Log 2π eλ2k , 2 A k=1

  (x,z) is the mutual information of z and where I (z, x) = E Log fxf(x)f (z) z x (Chap. 12), fx () and fz () are respectively the marginal probability density functions of x and z, and f (x, z) is the joint pdf of x and z. The application to the sample is straightforward and gives a similar expression. All these optimisations yield the common solution, namely, the leading m EOFs, A = (u1 , . . . , um ), of S.

3.9 Other Related Methods 3.9.1 Teleconnectivity Teleconnection maps (Wallace and Gutzler 1981) are obtained using one-point correlation where a base point is correlated to all other points. A teleconnection map is simply a map of row (or column) of the correlation matrix C = (cij ) and is characterised particularly by a (nearly) elliptical region of positive correlation around the base point with correlation one at the base point, featuring a bullseye to use the term of Wallace and Gutzler (1981). Occasionally, however, this main feature can be augmented by another centre with negative correlations forming hence a dipolar structure. It is this second centre that makes a big difference between base points. Figure 3.18 shows an example of correlation between All

45° N

30° N

−30

−25



−20

−15

30° E

−10

−5

0

5

10

15

20

25

30

35

40

45

Fig. 3.18 Correlation between All India Rainfall (AIR) and September Mediterranean evaporation

66

3 Empirical Orthogonal Functions

India monsoon Rainfall (AIR) index, a measure of the Asian Summer Monsoon strength, and September Mediterranean evaporation. AIR11 is an area-averaged of 29 subdivisional rainfall amounts for all months over the Indian subcontinent. The data used in Fig. 3.18 is for Jun–Sep (JJAS) 1958–2014. There is a clear teleconnection between Monsoon precipitation and Mediterranean evaporation with an east–west dipole. Stronger monsoon precipitation is normally associated with stronger (weaker) evaporation over the western (eastern) Mediterranean and vice versa. There will be always differences between teleconnection patterns even without the second centre. For instance some will be localised and others will be spread over a much larger area. One could also have more than one positive centre. Using the teleconnection map, one can define the teleconnectivity Ti at the ith grid point and is defined by Ti = − min cij . j

The obtained teleconnectivity map is a special pattern and provides a simple way to locate regions that are significantly inter-related in the correlation context. The idea of one-point correlation can be extended to deal with linear relationships between fields such as SST variable at a given grid point, or even any climate index, correlated with another field such as geopotential height. These simple techniques are widely used in climate research and do reveal sometimes interesting features.

3.9.2 Regression Matrix Another alternative way to using the correlation or the covariance matrices is to use the regression matrix R = (rij ), where rij is the regression coefficient obtained from regressing the jth grid point onto the ith grid point. The “EOFs” of the regression matrix R = D−1 S are the solution to the generalised eigenvalue problem: D−1 Sv = λ2 v

(3.30)

The regression matrix is no more symmetric; however, it is still diagonalisable. Furthermore R and S have the same spectrum. Exercise Show that R and S have the same spectrum and compute the eigenvectors of R. Answer The above generalised eigenvalue problem can be transformed (see, e.g. 1 1 1 Hannachi (2000)) to yield D− 2 SD− 2 a = λ2 a, where a = D 2 v. Hence the spectrum

11 http://www.m.monsoondata.org/india/allindia.html.

3.9 Other Related Methods

67 1

1

of R is the same as that of the symmetric matrix D− 2 SD− 2 . Furthermore, from 1 T 1 1 1 1 the SVD of S we get D− 2 SD− 2 = D− 2 U2 D− 2 U . Now since T = D− 2 U 1

1

is orthogonal (but not unitary), 2 provides the spectra of D− 2 SD− 2 , and the 1 eigenvectors of R are vk = D− 2 ak where ak , k = 1, . . . p are the eigenvectors 1 1 of D− 2 SD− 2 . Remark The EOFs of the correlation matrix C are linearly related to the regression1 1 based EOFs. The correlation matrix is C = D− 2 SD− 2 , and therefore the eigenvec1 tors vk of R are related to the eigenvectors ak of C by vk = D− 2 ak , k = 1, . . . p.

3.9.3 Empirical Orthogonal Teleconnection Empirical orthogonal teleconnection (EOT) is reminiscent of the teleconnectivity map for a chosen base point. The method finds that specific base point in space that explains as much as possible of the variance of all other points (van den Dool et al. 2000). For example, the first EOT consists of the regression coefficients between the selected base point and all other points. The remaining EOTs are obtained similarly after removing the effect of the base point by linearly regressing every grid point onto it. The EOT algorithm is iterative and chooses such base grid points successively based on how well the grid point can explain residual variations at all other grid points. As pointed out by Jolliffe (2002, personal communication) the results are difficult to interpret because the regressions are made on non-centred data.

3.9.4 Climate Network-Based Methods The analysis based on similarity matrices, such as covariance or correlation matrices for single or coupled (see Chap. 15) fields, can be transformed into binary similarity or adjacency matrices, which are interpreted in terms of climate networks (Tsonis and Roebber 2004; Tsonis et al. 2006; Donges et al. 2009, 2015). In this framework, the grid points of the analysed field are taken as nodes of a network. Compared to linear methods, which rely effectively on dimensionality reduction, the wisdom behind these network techniques is that they allow full exploration of the complexity of the inter-dependencies in the data. Donges et al. (2015), for example, argue that the climate networks can provide additional information to those given by standard linear methods, particularly on the higher order structure of statistical interrelationships in climate data. If S = (Sij ) designates the pairwise measure of a given statistical association, e.g. correlation, the climate network adjacency matrix A = (Aij ) is given by

68

3 Empirical Orthogonal Functions

Aij = 1{Sij −Tij ≥0} (1 − δij ), where 1X is the indicator function of set X, δij is the Kronecker symbol and Tij is a threshold parameter, which may be constant. Note that self-interactions are not included in the adjacency matrix. A number of parameters are then defined from this adjacency matrix, such as closeness and betweenness, which can be compared to EOFs, for example, and identify hence processes and patterns which are not accessible from linear measures of association. Examples of those processes include synchronisation of climatic extreme events (Malik et al. 2012; Boers et al. 2014), and reconstruction of causal interactions, from a statistical information perspective, between climatic sub-processes (e.g. EbertUphoff and Deng 2012; Runge et al. 2012, 2014). More discussion is given in Chap. 7 (Sect. 7.7) in relation to recurrence networks. An example of connection is shown in Fig. 3.19 (left panel) based on monthly sea level pressure field (1950–2015) from NCEP/NCAR reanalysis. The figure shows connections between two locations, one in Iceland (60N, 330E) and the other in north east Pacific (30N, 220E), and all other grid points. A connection is defined when the correlation coefficient is larger than 0.3. Note, in particular, the connections between the northern centre of the NAO (around Iceland) and remote places in the Atlantic, North Africa and Eurasia. The middle panel of Fig. 3.19 shows a measure of the total number of connections at each grid point. It is essentially proportional to the fraction of the total area that a point is connected to (Tsonis et al. 2008, 2008). This is similar to the degree defined in climate network, see e.g. Donges et al. (2015). High connections are located in the NAO and PNA regions, and also Central Asia. Note that if, for example, the PNA is removed from the SLP field (e.g. Tsonis et al. 2008), by regressing out the PNA time series, then the total number of connections (Fig. 3.19, left panel) mostly features the NAO pattern. We note here that the PNA pattern is normally defined, and better obtained with, the geopotential height anomalies at mid-tropospheric level, and therefore results with, say 500-hPa geopotential heights, gives clearer pictures (Tsonis et al. 2008).

Fig. 3.19 Connections between two points (one in Iceland and one in north east Pacific) and all other gridpoints for which monthly SLP (1950–2015) correlation is larger than 0.3 superimposed on the SLP climatology (left), total number of connections (see text for details) defined at each grid point (middle), and same as middle but when the PNA time series was regressed out from the SLP field (right). Units (left) hPa

3.9 Other Related Methods

69

Another network-related method was also applied by Capua et al. (2020), based on causal inter-dependencies, using the so-called causal effect networks (CEN). Broadly speaking, CEN generalises correlation analysis by removing the confounding autocorrelation and common source effects. Capua et al. (2020) applied CEN to the analysis of tropical-midlatitude relationships. They point out, in particular, the general two-way causal interaction, with occasional time scale dependency of the causal effect.

Chapter 4

Rotated and Simplified EOFs

Abstract This chapter describes further the drawbacks of EOFs mentioned in Chap. 3. It also provides different ways to overcome those drawbacks, including EOF rotation and simplified EOFs. A number of applications to climate data are also provided. Keywords Simplification · Rotation · Varimax · Quartimin · LASSO-algorithm · Ordinary differential equations · North Atlantic Oscillation

4.1 Introduction In the previous chapter we have listed some problems that can be encountered when working with EOFs, not least the physical interpretation caused mainly by the geometric constraint imposed upon EOFs and PCs, such as orthogonality, uncorrelatedness, and domain dependence. Physical modes are inter-related and tend to be mostly non-orthogonal, or correlated. As an example, normal modes derived from linearised physical models (Simmons et al. 1983) are non-orthogonal, and this does not apply to EOFs. Furthermore, EOFs tend to be size and shape domain-dependent (Horel 1981; Richman 1986, 1993; Legates 1991, 1993). For instance, the first EOF pattern tends to have wavenumber one sitting on the whole domain. The second EOF, on the other hand, tends to have wavenumber two and be orthogonal to EOF1 regardless of the nature of the physical process involved in producing the data, and this applies to subsequent EOFs. In his detailed review, Richman (1986) maintains that EOFs exhibit four characteristics that hamper their utility to isolate individual modes of variation. These are • • • •

domain dependence, subdomain instability, sampling problems and inaccurate relationship to physical phenomena.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_4

71

72

4 Rotated and Simplified EOFs

If the objective of EOFs is to reduce the data dimension, then the analysis can be acceptable. If, however, one is looking to isolate patterns for physical interpretation, then clearly as stated above EOFs may not be the best choice. To overcome some of the drawbacks caused by the geometric constraints, researchers have looked for an alternative through linear transformation of the EOFs. The concept of rotation emerged in factor analysis and has been proposed since the work of Thurstone (1947) in social science. In atmospheric science, rotated EOFs (REOFs) have been applied nearly three decades later and continue to be widely used (Horel 1981; Richman 1981, 1986; Preisendorfer and Mobley 1988; Cheng et al. 1995). The review of Richman (1986) provides a particularly detailed discussion of the characteristics of unrotated EOFs. REOFs yield simpler structures, compared to EOFs, by rotating the vector of loadings or EOFs hence losing some of the nice geometric properties of EOFs, in favour of yielding better interpretation. REOFs, however, have their own shortcomings such as how to choose the number of EOFs to be rotated and the rotation criteria that specify the simplicity. The objective of pattern simplicity is manifold. Most important is perhaps the fact that simple patterns avoid the trap of mixing, which is a main feature of EOFs. Simple patterns and their time amplitude series cannot be spatially orthogonal and temporally uncorrelated simultaneously. Furthermore, propagating planetary waves Hoskins and Karoly 1981 tend to follow wave guides (Hoskins and Ambrizzi (1993); Ambrizzi et al. (1995)) because of the presence of critical lines (Held 1983; Killworth and McIntyre 1985). Physically relevant patterns are therefore expected to be more local or simple, i.e. with zeros outside the main centres of action. A number of alternatives have been developed to construct simple structure patterns without compromising the nice properties of EOFs, namely variance maximisation and space–time orthogonality (Jolliffe et al. 2002; Trendafilov and Jolliffe 2006; Hannachi et al. 2006). This chapter discusses these methods and their usefulness in atmospheric science.

4.2 Rotation of EOFs 4.2.1 Background on Rotation Horel (1981) and Richman (1981, 1986) argued that EOFs can be too non-local and dependent on the size and the shape of the spatial domain. Thurstone (1947, p. 360–61) applied rotated factors and pointed out that, invariance or constancy of a solution, e.g. factors or EOFs, when the domain changes is a fundamental necessity if the solution is to be physically meaningful (see also Horel 1981). The previous problems encountered with EOFs have led atmospheric researchers to geometrically transform EOFs by introducing the concept of rotation in EOF analysis. Rotated EOF (REOF) technique is based on rotating the EOF patterns or the PCs, and has been adopted by atmospheric scientists since the early 1980s (Horel 1981;

4.2 Rotation of EOFs

73

Richman 1981, 1986). The technique, however, is much older and goes back to the early 1940s when it was first suggested and applied in the field of social science1 (Thurstone 1940, 1947; Carroll (1953)). The technique is also known in factor analysis as factor rotation and aims at getting simple structures. In atmospheric science the main objective behind rotated EOFs is to obtain • • • •

a relaxation of some of the geometric constraints simple and more robust spatial patterns simple temporal patterns an easier interpretation.

In this context simplicity refers in general to patterns with compact/confined structure. It is in general accepted that simple/compact structures tend to be more robust and more physically interpretable. To aid interpretation one definition of simplicity is to drive the EOF coefficients (PC loadings) to have either small or large magnitude with few or no intermediate values. Rotation of EOFs, among other approaches, attempts precisely to achieve this.

4.2.2 Derivation of REOFs Simply put, rotated EOFs are obtained by applying a rotation to a selected set of EOFs explaining say a given percentage of the total variance. Rotation has been applied extensively in social science and psychometry, see for example Carroll (1953), Kaiser (1958), and Saunders (1961), and later in atmospheric science (e.g. Horel 1981; Richman 1986). Let us denote by Um the p × m matrix containing the first m EOFs u1 , u2 , . . . um that explain a given amount of variance, i.e. Um = (u1 , u2 , . . . um ). Rotating these EOFs yields m rotated patterns Bm given by Bm = Um R = (b1 , b2 , . . . , bm ) ,

(4.1)

 where R = (rij ) is a m×m rotation matrix. The obtained patterns bk = m j =1 rj i uj , k = 1, . . . m are the rotated EOFs (REOFs). In (4.1) the rotation matrix R has to satisfy various constraints that reflect the simplicity criterion of the rotation, which will be discussed in the next section. As for EOFs, the amplitudes or the time series associated with the REOFs are also obtained by projecting the data onto the REOFs, or equally by similarly rotating the PCs matrix using the same rotation matrix R. The rotated principal components C = (c1 , c2 , . . . , cm ) are given by

1 Before

the availability of high speed computers, pattern rotation used to be done visually, which made it somehow subjective because of the lack of a quantitative measure and the possibility of non-reproducibility of results.

74

4 Rotated and Simplified EOFs

C = XBm = VUT Um R = Vm m R,

(4.2)

where Vm is the matrix of the leading (standardised) PCs, and m is the diagonal matrix containing the leading m singular values. It is also clear from (4.1) that Bm BTm = RRT

(4.3)

and therefore the rotated patterns will be orthonormal if and only if R is unitary, i.e. RRT = Im . In this case the rotation is referred to as orthogonal, otherwise it is oblique. From Eq. (4.2) we also get a similar result for the rotated PCs. The covariance matrix of the rotated PCs is proportional to C T C = RT 2m R.

(4.4)

Equation (4.4) shows that if the rotation is orthogonal the rotated PCs (RPCs) are no longer uncorrelated. If one choses the RPCs to be uncorrelated, then the REOFs are non-orthogonal. In conclusion REOFs and corresponding RPCs cannot be simultaneously orthogonal and uncorrelated respectively. In summary rotation compromises some of the nice geometric properties of EOFs/PCs to gain perhaps a better interpretation.

4.2.3 Computing REOFs Rotation or Simplicity Criteria Rotation of the EOF patterns can systematically alter the structures of EOFs. By constraining the rotation to maximise a simplicity criterion the rotated EOF patterns can be made simple in the literal sense. Given a p ×m matrix Um = (u1 , u2 , . . . um ) of the leading m EOFs (or loadings), the rotation is formally achieved by seeking a m × m rotation matrix R to construct the rotated EOFs B given by Eq. (4.1): The criterion for choosing the rotation matrix R is what constitutes the rotation algorithm or the simplicity criterion, and is expressed by the maximisation problem: max f (Um R)

(4.5)

over a specified subset or class of m × m square rotation matrices R. The functional f () represents the rotation criterion. Various rotation criteria exist in the literature (Harman, 1976; Reyment and Jvreskog 1996). Richman (1986), for example, lists five simplicity criteria. Broadly speaking there are two large families of rotation: orthogonal and oblique rotations. In orthogonal rotation (Kaiser 1958; Jennrich 2001) the rotation matrix R in (4.1) is chosen to be orthogonal, and the problem is to solve (4.5) subject to the condition:

4.2 Rotation of EOFs

75

RRT = RT R = Im ,

(4.6)

where Im is the m × m identity matrix. In oblique rotation (Jennrich, 2001; Kiers 1994) the rotation matrix R is chosen to be non-orthogonal. Various simplicity criteria exist in the literature such as the VARIMAX and QUARTIMAX discussed below. Chapter 10 contains more rotation criteria. The most well known and used rotation algorithm is the VARIMAX criterion (Kaiser 1958, see also Krzanowski and Marriott 1994). Let us designate by bij , i = 1, . . . p, and j = 1, . . . m, the elements of the rotated EOFs matrix B in (4.1), i.e. bij = [B]ij , then the VARIMAX orthogonal rotation maximises a simplicity criterion according to: ⎡ ⎛ ⎛ ⎞2 ⎤⎞ p p m    ⎢ ⎥⎟ ⎜ (4.7) bj4k − ⎝ bj2k ⎠ ⎦⎠ , max ⎝f (B) = ⎣p k=1

j =1

j =1

where m is the number of EOFs chosen for rotation. The quantity inside the square brackets in (4.7) is proportional to the (spatial) variance of the square of the rotated T

vector bk = b1k , . . . , bpk . Therefore the VARIMAX attempts to simplify the structure of the patterns by tending the loadings coefficients towards zero, or ±1. In various cases, the loading of the rotated EOFs B are weighted by the communalities of the different variables (Walsh and Richman 1981). The communalities h2j ,  2 j = 1, . . . p, are directly proportional to m k=1 aj k , i.e. the sum of squares of the

−1/2 loadings for a particular variable (grid point). Hence if C = Diag Um UTm , then in the weighted or normalised VARIMAX, the matrix B in (4.7) is simply replaced by BC. This normalisation is generally used to reduce the bias toward the first EOF with the largest eigenvalue. Another familiar orthogonal rotation method is based on the QUARTIMAX simplicity criterion (Harman 1976). It seeks to maximise the variance of the patterns, i.e. ⎡ ⎤2 m p m p 1 ⎣ 2 1  2 ⎦ f (B) = bj k . bj k − mp mp k=1 j =1

(4.8)

k=1 j =1

Because of the orthogonality property (4.6) required by R, the rotated EOFs matrix also satisfies BT B = Im . Therefore the sum of the squared elements of B is constant, and the QUARTIMAX simply boils down to maximising the fourth-order moment of the loadings, hence the term QUARTIMAX, and is based on the following maximisation problem:

76

4 Rotated and Simplified EOFs



⎤ p m   1 max ⎣f (B) = .bj4k ⎦ . mp

(4.9)

k=1 j =1

Equations (4.7) or (4.9) are then to be optimised subject to the orthogonality constraint (4.6). The VARIMAX is in general preferred to the QUARTIMAX because it is slightly less sensitive to changes in the number of variables (Richman 1986), although the difference in practice is not significant. In oblique rotation the matrix R need not be orthogonal and in general the problem to be solved is max [f (Bm = Um R)]

subject to Diag RT R = Im ,

(4.10)

where f () is the chosen simplicity criterion. Most rotations used in atmospheric science are orthogonal. Few oblique rotations (Richman 1981) have been used in atmospheric science perhaps because the algorithms involved are slightly more complex. The QUARTIMIN (Harman 1976) is the most widely used oblique rotation particularly in fields such as psychometry. In QUARTIMIN the simplicity criterion is directly applied not to the rotated patterns themselves but to the transformed EOFs using the inverse matrixR−T of RT , i.e. Um R−T . Denoting again  2 ), i.e. b = U R−T by b1 , . . . , bm the rotated patterns (using m a , then in ij m k=1 j k ij this case the rotation matrix R is obtained using the following optimisation criterion: ⎡ min ⎣f (Um R−T ) =

 r=s

⎤ b2ir b2is ⎦

(4.11)

i

subject to the 2nd equation in (4.10).

Computation of REOFs Since the criterion to be optimised is non-quadratic and cannot be performed analytically, numerical methods have to be applied. There are various algorithms to minimise (or equivalently maximise) a multivariate function f (x1 , x2 , . . . , xm ). There are two classes of minimisations: constrained and unconstrained. Any constrained minimisation problem can be transformed, using e.g. Lagrange multipliers, to yield an unconstrained problem. Appendix E reviews various algorithms of optimisation. For the rotation problem the constraints are relatively simple since they are equalities. Our problem is of the following form:

4.2 Rotation of EOFs

77

min f (x) s.t. gk (x) = ck , k = 1, . . . p,

(4.12)

where f (.) and gk (.), k = 1, . . . , p, are multivariate function. To solve the above problem one first introduces a new extended multivariate function, called Lagrangian, given by H (x, λ) = f (x) +

p 

λk (gk (x) − ck ) = f (x) + λT g,

(4.13)

k=1

where g(x) = (g1 (x) − c1 , . . . , gp (x) − cp )T and λ = (λ1 , . . . , λp )T , which represents the Lagrange multipliers. Next the optimisation of the unconstrained problem (4.13) is carried out with respect to x and λ, see Appendix E for details on how to solve (4.13). An example of the application of VARIMAX to winter monthly mean SLP over the northern hemisphere, north of 20◦ N is shown in Fig. 4.1. The data come from NCEP/NCAR reanalyses and are described in Chap. 3. Figure 4.1 shows three REOFs obtained by rotating m = 6 SLP EOFs. It can be noted that REOF1 and REOF2 reflect features of the NAO. The northern centre of action is better reflected with REOF1 and the southern centre of action is better reflected with REOF2. The Pacific pattern is shown in REOF3, which is quite similar to EOF2 (Fig. 3.5). Another example, with m = 20, is shown in Fig. 4.2. The sensitivity to the parameter m can be seen. The reason behind this sensitivity is that the EOFs have all the same weight when the rotation is applied. A simple way out is to weigh the EOFs by the square root of the corresponding eigenvalues. The obtained rotated patterns are shown in Fig. 4.3, which is quite different. Figures 4.3a,b,c show, respectively, clear signals of the NAO, the Pacific pattern and the Siberian high (Panagiotopoulos et al. 2005). These rotated patterns are also quite robust to changes in m. It is noted by Hannachi et al. (2006, 2007) that orthogonal rotation is computationally more efficient than oblique rotation to matrix inversion. These authors also found that orthogonal and oblique rotations of (non-weighted or non-scaled) EOFs produce quite similar results. Figure 4.4 shows a scatter plot of rotated loadings using VARIMAX versus rotated loadings using QUARTIMIN, with m = 30. A similar feature is also obtained using other rotation criteria. Matlab computes EOF rotation using different rotation criteria. If EOFs is a matrix containing say m EOFs, i.e. EOF s(p12, m), then the varimax rotated EOFs are given in REOFs: >> REOFs = rotatefactors (EOFs, ’Method’,’varimax’);

78 Fig. 4.1 VARIMAX rotated EOFs using the leading m = 6 winter SLP EOFs. The order shown is based on the variance of the corresponding time series. Positive contours solid, and negative contours dashed. (a) REOF1. (b) REOF3. (c) REOF4. Adapted from Hannachi et al. (2007)

4 Rotated and Simplified EOFs

4.2 Rotation of EOFs Fig. 4.2 Same as Fig. 4.1 but using m = 20. (a) REOF1. (b) REOF6. (c) REOF10. Adapted from Hannachi et al. (2007)

79

80 Fig. 4.3 Leading three VARIMAX rotated EOFs obtained based on the leading m = 20 EOFs weighted by the square root of the corresponding eigenvalues. (a) REOF1 (m = 20). (b) REOF2 (m = 20). (c) REOF3 (m = 20). Adapted from Hannachi et al. (2007)

4 Rotated and Simplified EOFs

4.3 Simplified EOFs: SCoTLASS

81

Fig. 4.4 Scatter plot of VARIMAX REOFs versus QUARTIMIN REOFs using m = 30 EOFs. Note that scatter with negative slopes correspond to similar REOFs but with opposite sign. Adapted from Hannachi et al. (2007)

4.3 Simplified EOFs: SCoTLASS 4.3.1 Background REOFs have been introduced mainly to improve interpretation through obtaining simpler patterns than EOFs. Building objective simplicity criteria, however, turns out to be a difficult problem. In fact, Jolliffe et al. (2002) point out that concentrating the EOF coefficients close to 0 or ±1 is not the only possible definition of simplicity. For example, a pattern with only ones is simple though it could rarely be of much interest in atmospheric science. Although REOFs attempt to achieve this using a simple and practical criterion they have a number of difficulties which make the method quite controversial (Richman 1986, 1987; Jolliffe 1987, 1995; MestasNuñez 2000). When we apply the rotation procedure we are usually faced with the following questions: • • • •

How to fix the number of EOFs or PCs to be rotated? What type of rotation, e.g. orthogonal or oblique, should be used? Which of the large number of simplicity criteria should be used? How to choose the normalisation constraint (Jolliffe 1995)?

82

4 Rotated and Simplified EOFs

Another problem in REOFs, not often stressed in the literature, is that after rotation the order is lost, and basically all REOFs become equivalent2 in that regard. It is clear that addressing some of these concerns will depend to some extent on what the rotated patterns will be used for. A simplification technique that can overcome most of these problems, and which in the meantime retains some of the nice properties of EOFs, is desirable. Such a technique is described next.

4.3.2 LASSO-Based Simplified EOFs Various simplification techniques have been suggested, see also Jolliffe (2002, chapter 11). Most of these techniques attempt to reduce the two-step procedure of rotated PCA into just one step. Here we discuss a particularly interesting method of simplicity that is rooted in regression analysis. A common problem that arises in multiple linear regression is instability of regression coefficients because of colinearity or high dimensionality. Tibshirani (1996) has investigated this problem and proposed a technique known as the Least Absolute Shrinkage and Selection Operator (LASSO). In a least-square multiple linear regression: y = Xβ + ε, where the parameters β

= (β1 , . . . , βp )T are estimated by minimising 2  n  y − β x t j j tj , the additional constraint t=1

(y − Xβ)T (y − Xβ) = p j =1 |βj | ≤ τ has, for suitable choices of τ , the property of shrinking some of the regression coefficients to zero. The LASSO approach attempts to shrink some regression coefficients exactly to zero, hence implicitly selecting variables. The same idea was adapted in PCA later by Jolliffe et al. (2003) who used it to shrink loadings to zero. They label it Simplified Component Technique-LASSO (SCoTLASS). For notational convenience, however, and to keep the acronym short we refer to the SCoTLASS EOF method as simplified3 EOFs (SEOFs). The SEOF method attempts to use the main properties of EOFs and REOFs simultaneously by successively maximising variance and constraining the patterns to be orthogonal and simple. Hence the objective of SEOFs is to seek directions T

ak = ak1 , ak2 , . . . , akp , k = 1, . . . , p, maximising the quadratic function: F (ak ) = aTk Sak

(4.14)

subject to

2 An

exception is with weighted EOFs, see Sect. 4.2.3 reader should note the use of adjectives “simple” and “simplified” to describe other different techniques in the literature.

3 The

4.3 Simplified EOFs: SCoTLASS

83

aTk al = δkl .

(4.15)

To achieve simplicity the LASSO technique requires the following extra constraint to be satisfied (Jolliffe et al. 2003): ak 1 =

d 

|akj | = aTk sign(ak ) ≤ τ

(4.16)

j =1

for some tuning parameter τ . In Eq. (4.16) sign(ak ) = (sign(ak1 ), . . . , sign(akp ))T  2 p p 2 = 1 ≤ is the sign of ak . Because j =1 akj j =1 |akj | , it is clear that the optimisation problem (4.14–4.16) is only ppossible for τ ≥ 1. Furthermore, since a 1 is maximised over the unit sphere, i=1 ai2 = 1, only when all the components √ √ are equal we get a 1 ≤ p. Hence if τ ≥ p we regain conventional EOFs. Consequently EOFs can be regarded as a particular case of SEOFs. Figure 4.5 shows an example of the leading two SEOFs obtained with a threshold parameter τ = 8. These patterns are orthogonal and they represent respectively the NAO and the Pacific patterns. The centres of action are quite localised. These centres get broader as τ increases. This is discussed in the next section.

4.3.3 Computing the Simplified EOFs Since the optimisation problem (4.14–4.16) is non-quadratic and nondifferentiable due to the LASSO condition (4.16), the solution can only be obtained numerically using a suitable descent algorithm. The nondifferentiability condition (4.16) is a particular nuisance for the optimisation, and it is desirable to smooth it out. Trendafilov and Jolliffe (2006) used the fact that tanh(x) ∼ |x| x = sign(x) for large values of |x|, and transformed (4.16) to yield a smooth constraint aTk tanh (γ ak ) − τ =

d 



akj tanh γ akj − τ ≤ 0

(4.17)

j =1

for some fixed large number γ . The problem (4.14–4.16) was solved by Trendafilov and Jolliffe (2006), see also Hannachi et al. (2005), using the projected gradient approach (Gill et al. 1981). To ease the problem further and to make it look like the standard EOF problem (4.14–4.15), the nonlinear condition (4.17) is incorporated into the function F () in Eq. (4.14) as an exterior penalty function, see e.g. Gill et al. (1981). This means that this condition will be explicitly taken into account only if it is violated. Hence if we designate by Pe (x) = max(0, x),

84

4 Rotated and Simplified EOFs

Fig. 4.5 Leading two simplified EOFs of the winter SLP anomalies using τ = 8. (a) SEOF1 (τ = 8). (b) SEOF2 (τ = 8). Adapted from Hannachi et al. (2006)

the exterior penalty function, then condition (4.17) can be incorporated into (4.14) to yield the extended objective function: Fμ (ak ) =

  1 T ak Sak − μPe aTk tanh(γ ak ) − τ 2

(4.18)

to be maximised, and where μ is a large positive number. It is now clear from (4.18) that (4.17) is not taken into account if it is satisfied, but when it is positive it is penalised and is sought to be minimised. Note again that (4.18) is not differentiable, and to make it so we use the fact that max(x, y) = 12 (x + y + |x − y|), and hence

4.3 Simplified EOFs: SCoTLASS

85

the exterior penalty function is replaced by P (x) = 12 x [1 + tanh(γ x)]. Hence the smooth objective function to maximise becomes: Fμ (ak ) =

  1 T ak Sak − μP aTk tanh(γ ak ) − τ 2

(4.19)

subject to the orthogonality condition (4.15). Figure 4.6 shows a plot of Fμ (EOF 1) versus γ for μ = 1000, where EOF1 is the leading EOF of the winter SLP field. The function becomes independent of γ for large values of this parameter. Hannachi et al. (2006) found that the solution is invariant to changes in μ (for large values). Various methods exist to solve the nonlinear constrained maximisation problem (4.15) and (4.19), such as steepest ascent, and projected/reduced gradient methods (Gill et al. 1981). These methods look for linear directions of ascent to achieve the optimum solution. In various problems, however, the search for suitable step sizes (in line search) can be problematic particularly when the objective function to be maximised is not quadratic, for which the algorithm can converge to the wrong local maximum. An elegant alternative approach to the linear search method is to look for a smooth curvilinear trajectory to achieve the optimum. For instance the minimum of an objective function F (x) can be achieved by integrating the system of ordinary differential equations (ODE) dx = −∇F (x) dt

(4.20)

Fig. 4.6 Function Fμ (EOF 1) versus γ for μ = 1000. EOF1 is the leading EOF of winter SLP anomalies. Adapted from Hannachi et al. (2006)

86

4 Rotated and Simplified EOFs

forward in time for a sufficiently long time using a suitably chosen initial condition (Evtushenko 1974; Botsaris 1978; Brown 1986). In fact, if x∗ is an isolated local minimum of F (x), then x∗ is a stable fixed point of the dynamical system (4.20), see e.g. Hirsch and Smale (1974), and hence can be reached by integrating (4.20) from some suitable initial condition. Such methods have been around since the mid 1970 (Evtushenko 1974; Botsaris and Jacobson 1976, 1978) and can make use of efficient integration algorithms available for dynamical systems. Trajectories defined by second-order differential equations have also been suggested (Snyman 1982). In the presence of constraints the gradient of the objective function to be minimised (or maximised) has to be projected onto the tangent space of the feasible set, i.e. the manifold or hypersurface satisfying the constraints (Botsaris 1979, 1981; Evtushenko and Zhadan 1977; and Brown 1986). This is precisely what projected gradient stands for (Gill et al. 1981). Now if Ak−1 = (a1 , a2 , . . . , ak−1 ), k ≥ 2, is the set of the first k − 1 SEOFs, then the next kth SEOF ak has to satisfy the following orthogonality constraints: aTk al = δkl , l = 1, . . . k.

(4.21)

Therefore the feasible set is simply the orthogonal complement to the space spanned by the columns of Ak−1 . This can be expressed conveniently using projection operators. In fact, the following matrix: πk = Id −

k−1 

al aTl

(4.22)

l=1

provides the projection operator onto this space. Furthermore, the condition aTk ak = 1 is equivalent to (Id − ak aTk )ak = 0. Therefore the projection onto the feasible set is achieved by applying the operator πk (Id − ak aTk ) to the gradient of the objective function (4.19). Hence the solution to the SEOF problem (4.14–4.16) is provided by the solution to the following system of ODEs:   d ak = πk Id − ak aTk ∇Fμ (ak ) = πk+1 ∇Fμ (ak ). dt

(4.23)

The kth SEOF ak is obtained as the limit, when t → ∞, of the solution to Eq. (4.23). This approach has been successfully applied by Jolliffe et al. (2003) and Trendafilov and Jolliffe (2006) to a simplified example, and by Hannachi et al. (2005) to the sea level pressure (SLP) field. Figure 4.7 shows the leading SLP SEOF for τ = 18. The patterns get broader as τ increases. The SEOF patterns depend on τ and they converge to the EOFs as τ increases as shown in Fig. 4.8. Figure 4.9 shows the third SLP EOF pattern for τ = 12 and τ = 16 respectively. For the latter value the pattern becomes

4.3 Simplified EOFs: SCoTLASS

87

Fig. 4.7 As in Fig. 4.5 but with τ = 18. (a) SEOF1. (b) SEOF2. Adapted from Hannachi et al. (2006)

nearly hemispheric. The convergence of SEOF1 to EOF1, as shown in Fig. 4.7, starts √ around τ = 23 p Hannachi et al. (2006) modified slightly the above system of ODEs. The kth SEOF is obtained after removing the effect of the previous k − 1 SEOFs by computing the residuals:  Yk = X Id −

k−1  l=0

al aTl

= Xπk

(4.24)

88

4 Rotated and Simplified EOFs

Fig. 4.8 Variance ratio of simplified PC1 to that of PC1 versus parameter τ . Adapted from Hannachi et al. (2007)

with the convention a0 = 0. The covariance matrix of the residuals is  k−1  1 T al aTl Sk = Yk Yk = Id − n l=0

 S Id −

k−1 

al aTl

.

(4.25)

l=0

The k’th SEOF ak is then obtained as an asymptotic limit when t tends to infinity , i.e. stationary solution to the dynamical system:   d ak = Id − ak aTk ∇Fμ(k) (ak ), dt

(4.26)

where Fμ(k) is defined as in (4.19) except that S is replaced by Sk . Remark The variety of simplicity criteria that can be used in REOFs is appealing from a conceptual point of view. However, and unless the simplicity criteria are chosen to reflect physical significance of the patterns the approach remains ad hoc. Examples of criteria that are physically relevant include flow tendency induced by the patterns using e.g. various simplified dynamical models (e.g. Haines and Hannachi 1995, Hannachi 1997).

4.3 Simplified EOFs: SCoTLASS Fig. 4.9 SEOF3 of winter SLP anomalies using τ = 12 (top) and τ = 16 (bottom). Adapted from Hannachi et al. (2007)

89

Chapter 5

Complex/Hilbert EOFs

Abstract Weather and Climate data contain a myriad of processes including oscillating and propagating features. In general EOF method is not suited to identify propagating patterns. In this chapter describes a spectral method based on Hilbert transform to identify propagating features, with application to the stratospheric quasi-biennial oscillation. Keywords Propagating patterns · Quasi-biennial oscillation · Cross-spectra · Complex EOFs · Hilbert transformation · Hilbert EOFs · Phase portrait

5.1 Background The introduction of EOF analysis into meteorology since the late 1940s (Obukhov 1947; Fukuoka 1951; Lorenz 1956) had a strong impact on the course of weather and climate research. This is because one major concern in climate research is the extraction of patterns of variability from observations or model simulations, and the EOF method is one such technique that provides a simple tool to achieve this. The EOF patterns are stationary patterns in the sense that they do not evolve or propagate but can only undergo magnitude and sign change. This is certainly a limitation if one is interested in inferring the space–time characteristics of weather and climate, since EOFs or REOFs, for example assume a time–space separation as expressed by the Karhunen–Loéve expansion (3.1). For instance, one does not expect in general EOFs to reveal to us the structure of the space–time characteristics of propagating phenomena1 such as Madden–Julian oscillation (MJO) or quasi-biennial oscillation (QBO), etc. The QBO, for example, represents a clear case of oscillating phenomenon that takes place in the stratosphere, which can be identified using stratospheric zonal

1 In

reality all depends on the variance explained by those propagating patterns. If they have substantial variance these propagating patterns can actually be revealed by a EOF analysis when precisely they appear as degenerate pair of eigenvalues and associated eigenvectors in quadrature.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_5

91

92

5 Complex/Hilbert EOFs

wind. This wind is nearly zonally symmetric. Figure 5.1 shows the climatology of the zonal wind for January and July from the surface to 1 mb level using the European Reanalyses (ERA-40) from the European Centre for Medium Range Weather Forecasting (ECMWF), for the period January 1958–December 2001. A number of features can be seen. The tropospheric westerly jets are located near 250-mb height near 30–35◦ latitude. In January the Northern Hemisphere (NH) jet is only slightly stronger than the Southern Hemisphere (SH) counterpart. In July,

Fig. 5.1 Climatology of the ERA-40 zonal mean zonal wind for January (a) and July (b) for the period Jan 1958–Dec 2001. Adapted from Hannachi et al. (2007)

5.1 Background

93

however, the NH jet is weaker than the SH jet. This latter is stronger due in part to the absence of boundary layer friction caused by mountains and land masses. In the stratosphere, on the other hand, both easterly and westerly flows are present. Stratospheric westerlies (easterlies) exist over most winter (summer) hemispheres. The stratospheric westerly flow represents the polar vortex, which is stronger on the winter time because of the stronger equator-pole temperature gradient. Note also the difference in winter stratospheric wind speed between the northern hemisphere, around 40–50 m/s at about 1-mb and the southern hemisphere, around 90 m/s at the same height. The above figure refers mainly to the seasonality of the zonal flow. The variability of the stratospheric flow can be analysed after removing the seasonality. Figure 5.2 shows the variance of the zonal wind anomalies over the ERA-40 period. Most of the variance is concentrated in a narrow latitudinal band around the region equatorward of 15◦ and extends from around 70-mb up to 1-mb. Figure 5.3 shows a time–height plot of the zonal wind anomalies at the equator over the period January 1994–December 2001. A downward propagating signal is identified between about 3 and 70-mb. The downward propagating speed is around 1.2 km/month. The period at a given level varies between about 24 and 34 months, yielding an average of 28 months, hence quasi-biennial periodicity, see e.g. Baldwin et al. (2001) and Hannachi et al. (2007) for further references. To get better insight into space–time characteristics of various atmospheric processes one necessarily has to incorporate time information into the analysis. This is backed by the fact that atmospheric variability has significant auto- and cross-

Fig. 5.2 Variance of monthly zonal mean zonal wind anomalies, with respect to the mean seasonal cycle, over the ERA-40 period. Adapted from Hannachi et al. (2007)

94

5 Complex/Hilbert EOFs

Fig. 5.3 Time–height plot of equatorial zonal mean zonal wind anomalies for the period January 1992–December 2001. Adapted from Hannachi et al. (2007)

correlations in time (and space). Among the earliest contributions in meteorology along this line one finds methods based on cross-spectral analysis (Kao 1968; Madden and Julian 1971; Hayashi 1973), complex cross-covariances obtained using complex wind time series whose zonal and meridional components constitute respectively the real and imaginary parts of the complex field (Kundu and Allen 1976), and complex principal components in the frequency domain using crossspectral covariances (Wallace and Dickinson 1972; Wang and Mooers 1977). Extended EOFs (Weare and Nasstrom 1982) is another time domain method that incorporates the lagged information in the data matrix before computing EOFs and is discussed in the next chapter. Complex EOFs (CEOFs) in the time domain was introduced as an alternative to the CEOFs in the frequency domain (Brillinger 1981; Rasmusson et al. 1981; Horel 1984). Principal oscillation patterns (Hasselmann 1976) is another widely used method based also on the lagged covariance matrix and finds propagating structures in a quasi-linear system. The time domain complex EOFs is conceptually close to EOFs, except that the field is complex or complexified. Conventional EOF analysis can be applied to a single space–time field or a combination of fields. EOF analysis finds “stationary” patterns in the sense that they are not evolving. It yields a varying time series for any obtained EOF pattern, which means that the spatial EOF pattern will only decrease or increase in magnitude whereas the spatial structure remains unchanged. Because EOFs are based on (simultaneous) covariances, the way time is arranged is irrelevant. In fact, if xt and

5.2 Conventional Complex EOFs

95

yt , t = 1, . . . n, are two univariate time series, then any permutation of xt and yt will yield the same covariance, i.e. ρxy = cov(xt , yt ) = cov(xπ(t) , yπ(t) ),

(5.1)

where π is any permutation of {1, 2, . . . n}. This can lead sometimes to difficulties capturing propagating structure by EOFs. Extended EOFs and principal oscillation pattern (POP) analysis (e.g. von Storch and Zwiers 1999) can in general easily extract these structures. These methods will be discussed in the coming chapters. Here we discuss another method similar to EOF analysis, which is based on the complexified field. The method does not involve explicitly the lagged information, hence avoiding the use of large (extended) data matrices as in EEOFs. It is known that any wave can be expressed using a complex representation as x(t) = aeiωt+φ ,

(5.2)

where a is the wave amplitude and ω and φ are respectively its frequency and phase shift (at the origin). Complex EOFs (CEOFs) are based on this representation. There are, in principle, two ways to perform complex EOFs, namely “conventional” complex EOFs and “Hilbert” EOFs. When we deal with a pair of associated climate fields then conventional complex EOFs are obtained. Hilbert EOFs correspond to the case when we deal with a single field, and where we are interested in finding propagating patterns. In this case the field has to be complexified by introducing an imaginary part, which is a transform of the actual field.

5.2 Conventional Complex EOFs 5.2.1 Pairs of Scalar Fields The method is similar to conventional EOFs except that it is applied to the complex field obtained from a pair of variables such as the zonal and meridional components u and v of the wind field U = (u, v) (Kundu and Allen 1976; Brink and Muench 1986; von Storch and Zwiers 1999; Preisendorfer and Mobley 1988). The wind field Utl = U (t, sl ), defined at each location sl , l = 1, . . . p, and time t, t = 1, . . . n, can be written using a compact complex form as Utl = u(t, sl ) + iv(t, sl ) = utl + ivtl .

(5.3)

The complex covariance matrix is then obtained using the data matrix U = (Utl ) by S=

1 U ∗T U , n−1

(5.4)

96

5 Complex/Hilbert EOFs

and the elements skl , k, l = 1, . . . p, of S in (5.4) are given by 1 ∗ Utk Utl , n n

skl =

t=1

where (∗ ) is the complex conjugate operator. The (complex) covariance matrix, Eq. (5.4), is Hermitian, i.e. S∗T = S, and is therefore diagonalisable. The matrix

has therefore a set of orthonormal complex eigenvectors U = u1 , . . . up and a real non-negative2 eigenspectrum λ21 , . . . λ2p . The complex amplitude of the kth EOF is the kth complex principal component (CPC) ek and is given by ek = U u∗k

(5.5)

This immediately yields non-correlation of the CPCs: 2 e∗T k el = λk δkl .

(5.6)

The complex EOFs and associated complex PCs are also obtained using the singular value decomposition of U . Any CEOF uk has a pattern amplitude and phase. The pattern of phase information is given by 

I m(uk ) φ k = arctan , Re(uk )

(5.7)

where Re() and I m() stand respectively for the real and imaginary parts, and where the division is performed componentwise. The pattern amplitude of uk is given by its componentwise amplitudes. This method of doing CEOFs seems to have been originally applied by Kundu and Allen (1976) to the velocity field of the Oregon coastal current. The conventional CEOFs are similar to conventional EOFs in the sense that time ordering is irrelevant, and hence the method is mostly useful to capture covarying spatial patterns between the two fields.

5.2.2 Single Field If one is dealing with a single field xt = (xt1 , . . . , xtp )T , t = 1, 2 . . . n, such as sea surface temperature, and one is interested in propagating patterns one can still use the conventional complex EOFs applied to the complexified field obtained from

2 Since

2 u∗T k Suk = λk =

1 n−1

[Uuk ]∗T [Uuk ] ≥ 0.

5.3 Frequency Domain EOFs

97

the pair of lagged variables (xt , xt+τ ) for some chosen lag τ . The complex field is defined by yt = xt + ixt+τ .

(5.8)

This is a natural way to define a homogeneous complexified field using lagged information. The corresponding complex data matrix defined from (5.8) is then given at each grid point sl and each time t by (Y)tl = (xtl + ixt+τ,l ). The obtained complex data matrix Y = (ytl ) can then be submitted to the same complex EOF analysis as in the previous section. The obtained CEOFs provide the propagating structures and the corresponding CPCs provide the phase information. This procedure is based on the choice of the time lag τ , which reflects the characteristic time of the propagating feature. In general, however, this parameter is not precisely known, and requires some experience. The choice of this parameter remains, in practice, subject to some arbitraryness. One way to determine the approximate value of τ is to compute the CEOFs for many values of the lag τ then plot the leading eigenvalue vs lag and look for the lag corresponding to the maximum value. An elegant alternative to choosing the lag in the time domain is to use the Hilbert transform, which is based on phase shift in the frequency domain and is discussed in the coming sections.

5.3 Frequency Domain EOFs 5.3.1 Background Complex EOFs in spectral or time domain is a natural extension to EOFs and aims at finding travelling patterns. In spectral domain, the method is based on an eigendecomposition of the cross-spectral matrix and therefore makes use of the whole structure of the (complex) cross-spectral matrix. Ordinary EOFs method is simply an application of frequency domain EOFs (FDEOFs) to contemporaneous data only. It appears that the earliest introduction of complex frequency domain EOFs (FDEOFs) in atmospheric context dates back to the early 1970s with Wallace and Dickinson. Their work has stimulated the introduction of Hilbert EOFs, and we start by reviewing FDEOFs first. The spectrum gives a measure of the contribution to the variance across the whole frequency range. EOF analysis in the frequency domain (Wallace and Dickinson 1972; Wallace 1972; Johnson and McPhaden 1993), see also Brillinger (1981) for details, attempts to analyse propagating disturbances by concentrating on a specific frequency band allowing thus the decomposition of variance in this band while retaining phase relationships between locations.

98

5 Complex/Hilbert EOFs

5.3.2 Derivation of FDEOFs For a univariate stationary time series xt , t = 1, 2, . . . , we know the relationship between the auto-covariance function γ () and the spectral density function f (), i.e. (see Appendix C): f (ω) =

1  −iωk e γ (k), 2π

(5.9)

k

and  γ (τ ) =

π

−π

eiτ ω f (ω)dω.

(5.10)

T

For a multivariate time series xt = xt1 , xt2 , . . . xtp , t = 1, 2, . . . , the previous equations extend to yield respectively the cross-spectrum matrix F and the autocovariance or lagged covariance matrix  given by F(ω) =

1  −iωk e (k) 2π

(5.11)

k

and  (τ ) =

π

−π

eiτ ω F(ω)dω.

(5.12)

The elements ij (τ ), i, j = 1, . . . p, of the lagged covariance matrix are given by

ij (τ ) = cov xti , xt+τ,j

(5.13)

and gives the lagged covariance between the ith and jth variables. Because the crossspectrum matrix is Hermitian it is therefore diagonalizable, and can be factorised as F = EDE∗T ,

(5.14)

where E is a unitary complex matrix containing the (complex) eigenvectors, and D is a diagonal matrix containing the real positive eigenvalues of F. The idea behind spectral EOFs of a multivariate time series xt , t = 1, . . . , n, is to find a linear transformation that has the diagonal matrix D as cross-spectra. Wallace and Dickinson (1972) filtered3 the time series by keeping only a given frequency ω using a narrow band pass filter that retains only the frequencies in ω ± dω. A 3 To

have a spectral representation of a continuous time series x(t), Wallace and Dickinson (1972) used the concept of stochastic integrals as

5.3 Frequency Domain EOFs

99

complexified time series y(t) is then obtained, which involves the filtered time series and its time derivative as its real and complex parts respectively. The EOFs and PCs are then obtained from the real time series: zt = Re [E(yt )] ,

(5.15)

where E() and Re[] stand, respectively, for the expectation and the real part operators. In practice, FDEOFs are based on performing an eigenanalysis of the crossspectrum matrix calculated in a small frequency band. Let u(ω) be the Fourier transform (FT) of the (centred) field xt , t = 1, . . . n at frequency ω, i.e. u(ω) =

n 

xt e−iωt .

(5.16)

t=1

The cross-spectral matrix at ω is (ω) = u(ω)u(ω)T , and can be written in terms of the lagged covariance matrix Sτ =

n−τ 1  xt xTt+τ n−τ t=1

as (ω) =



Sτ e−iωτ .

τ

π Note that the covariance matrix satisfies S = −π (ω)dω, and therefore the spectrum gives a measure of the contribution to the variance across the whole frequency range. The average of the cross-spectral matrix over the frequency band [ω0 , ω1 ], i.e.  ω1 (ω)dω (5.17) C= ω0

provides a measure of the contribution to the covariance matrix in that frequency band. The spectral domain EOFs are given by the complex eigenvectors of F. The 



x(t) = Re

eiωt dε(ω) ,

0

where ε is an independent random noise and Re[.] stands for the real part. The filtered time f series components outside [ω, ω + dω])  (i.e. spectral   is thend obtained  by defining first x (t) = Re eiωt dε(ω) dω, from which they get z(t) = Re (1 − ωi dt )E(xf ) . This new time series then   satisfies E z(t)zT (t + τ ) = cos(ωτ )D(ω)dω.

100

5 Complex/Hilbert EOFs

“principal components” resulting from the FDEOF are obtained by projecting the complexified time series onto the spectral domain EOFs. Now since waves are coherent structures with consistent phase relationships at various lags, and given that FDEOFs represent patterns that are uniform across a frequency band, the leading FDEOF provides coherent structures with most wave variance. The FDEOFs are then obtained as the EOFs of C (Brillinger 1981). Johnson and McPhaden (1993) have applied FDEOFs to study the spatial structure of intraseasonal Kelvin wave structure in the Equatorial Pacific Ocean. They identified coherent wave structures with periods 59–125 days. Because most climate data spectra look reddish, FDEOF analysis may be cumbersome in practice (Horel 1984). This is particularly the case if the power spectrum of an EOF, for example is spread over a wide frequency band, requiring an averaging of the crossspectrum over this wide frequency range, where the theory behind FDEOFs is no longer applicable (Wallace and Dickinson 1972). To summarise the following bullet points provide the essence of FDEOFs: • Conventional EOFs are simply frequency domain EOFs applied to contemporaneous data only. • FDEOFs are defined as the eigenvectors of the cross-spectrum matrix defined at a certain frequency band ω ± dω. • This means that all frequencies outside an infinitesimal interval around ω have to be filtered. The method, however is difficult to apply in practice. For instance, if the power in the data is spread over a wide range, it is not clear how FDEOFs can be applied.4 There is also the issue related to the choice of the “right” frequency. Averaging the cross-spectrum over a wider range is desirable but then the theory is no longer valid (Wallace and Dickinson 1972). Note that averaging the cross-spectrum matrix over the whole positive/negative frequency domain simply yields ordinary EOFs. In addition to the previous difficulties there is also the problem of estimating the power spectrum at a given frequency, given that the spectrum estimate is in general highly erratic (see Chatfield 1996). Also and as pointed out by Barnett (1983), the interactions between various climate components involve propagation of information and irregular short term as well as cyclo-stationary, e.g. seasonality, interactions. This complicated (non-stationary) behaviour cannot be analysed using spectral techniques. These difficulties have led to the method being abandoned. Many of the above problems can be handled by Hilbert EOFs discussed next.

4 For

example, Horel (1984) suggests that many maps, one for each spectral estimate may be studied.

5.4 Complex Hilbert EOFs

101

5.4 Complex Hilbert EOFs An elegant alternative to FDEOFs is the complex EOFs in the time domain introduced into atmospheric science by Rasmusson et al. (1981), see also Barnett (1983) and Horel (1984), using Hilbert singular decomposition. The method has been refined later by Barnett (1983) and applied to the monsoon (Barnett 1984a,b), atmospheric angular momentum (Anderson and Rosen 1983), the QBO in northern hemispheric SLP (Trenberth and Shin 1984) and coastal ocean currents (Merrifield and Winant 1989). The method is based on Hilbert transform and is therefore referred to as Hilbert EOFs (HEOFs).

5.4.1 Hilbert Transform: Continuous Signals Let x(t) be a continuous time series, the integral

∞

may not exist in the λ ordinary sense, i.e. it may be divergent. However, one can calculate e.g. −λ x(t)dt for any finite value of λ. The limit of this integral, when λ → ∞,  ∞exists in many cases and is known as the Cauchy principal value of the integral −∞ x(t)dt, ∞ denoted as P −∞ x(t)dt, i.e.  P



−∞

 x(t)dt = lim

−∞ x(t)dt

λ

λ→∞ −λ

x(t)dt.

(5.18)

∞ λ For example, P −∞ tdt = limλ→∞ −λ tdt = 0. Note that when the integral is already convergent then it is identified to its Cauchy principal value. A direct application of this integral is the Hilbert transform (Thomas 1969; Brillinger 1981). Definition The Hilbert transform Hx(t), of the continuous signal x(t), is defined by Hx(t) =

1 P π



x(s) ds. t −s

(5.19)

Note that the inverse of this transform is simply its opposite. This transform is defined for every  ∞signal x(t) in Lp , the space of functions whose pth power is integrable, i.e. −∞ |x(t)|p dt < ∞. This result derives from Cauchy’s integral formula5 and the function z(t) = x(t) − iHx(t) = a(t)eiθ(t) is analytic. In fact, the Hilbert transform is the unique transform that defines an imaginary part so that

5 If

f () is analytic over a domain D in the complex plane containing a simple path C0 , then  1 f (u) f (z) = du 2iπ C0 u − z

102

5 Complex/Hilbert EOFs

the result is analytic. The Hilbert transform is defined as a convolution of x(t) with the function 1t emphasizing therefore the local properties of x(t). Furthermore, using the polar expression, it is seen that z(t) provides the best local fit of a trigonometric function to x(t) and yields hence an instantaneous frequency of the signal and provides information about the local rate of change of x(t). The Hilbert transform y(t) = Hx(t) is related to the Fourier transform Fx(t) by y(t) = Hx(t) =

1 Im π





 Fx(s)e−ist ds ,

(5.20)

0

where I m() stands for the imaginary part. ∞ Exercise Derive (5.20) keeping in mind that 0 sinu u du = π2 . N Hint Use the fact that x(t) = limN →∞ −N e−ist Fx(s)ds. The result is then 

N  λ  N x(s) −ist λ i(u−t)s e dtds= −N Fx(s)e−ius −λ e u−t dt obtained from the equality −λ −N Fu−t ds after tending N and λ to ∞. Remark Based on Eq. (5.20), the analytic Hilbert transform y(t) of x(t) can be obtained using: (i) Fourier transforming x(t), (ii) substituting the amplitude of negative frequencies with zero, and doubling the amplitude of positive frequencies and (iii) taking inverse Fourier transform. In the language of signals the Hilbert transform is a linear filter that removes precisely the zero frequency from the spectrum and has the simplest response function. In fact, the transfer, or frequency, response function (Appendix C) of the Hilbert filter is given by 1 1 h(λ) = P π π





−∞

eisλ ds = s



i sign(λ) if λ = 0 0 if λ = 0.

(5.21)

Equation (5.21) is obtained after remembering that the Hilbert transform is a simple convolution of the signal x(t) with the function 1t . The filter transfer function is therefore 1t , and the frequency response function is given by the principal value of the Fourier transform of 1t . It is therefore clear from (5.21) that the Hilbert filter6 precisely removes the zero frequency but does not affect the modulus of all others. The analytic signal z(t) has the same positive frequencies as x(t) but zero negative frequencies.

for any z inside C0 . Furthermore, the expression f (t) = 12 (f (t) + i H[f (t)]) + 1 2 (f (t) − i H[f (t)]) provides a unique decomposition of f (t) into the sum of two analytic functions, see, e.g., Polya and Latta (1974). 6 This transform is used also in other circumstances to define the envelop of a time series, (Hannan 1970), given by |x(t) − i Hx(t)|.

5.4 Complex Hilbert EOFs

103

Table 5.1 Examples of time series and their Hilbert transforms x(t)

sin t

cos t

Hx(t)

cos t

− sin t

sin t t −1+cos t t

δ(t) − π1t

1 1+t 2 t − 1+t 4

e−x

2

− √2π D(t)

Remark: Difference Between Fourier and Hilbert Transform Like Fourier transform the Hilbert transform provides the energy 12 a 2 (t) = |z(t)|2 , and frequency ω = − dθ(t) dt . In Hilbert transform these quantities are local (or instantaneous) where at any given time the signal has one amplitude and one frequency. In Fourier transform, however, the previous quantities are global in the sense that each component in the Fourier spectrum covers the whole time span uniformly. For instance, a spike in Fourier spectrum reflects a sine wave in a narrow frequency band in the whole time span. So if we represent this in a frequency-time plot one gets a vertical narrow band. The same spike in the Hilbert transform, however, indicates the existence of a sine wave somewhere in the time series, which can be represented by a small square in the frequency-time domain if the signal is short lived. The Hilbert transform is particularly useful for transient signals, and is similar to wavelet transform in this respect.7 Table 5.1 gives examples of familiar functions and their Hilbert transforms. The function D(t) in Table 5.1 is known as Dawson’s integral defined by D(t) = 2 t 2 e−t 0 ex dx, and the symbol δ() refers to the spike function or Dirac delta function, i.e. δ(0) = 1 and δ(t) = 0 for non-vanishing values of t.

5.4.2 Hilbert Transform: Discrete Signals We suppose here that our continuous signal x(t) has been discretised at various times tk = kt to yield the discrete time series xk , k = 0, ±1, ±2, . . . , where xk = x(kt). To get the Hilbert transform of this time series, we make use of the transfer function of the filter in (5.21) but now defined over [−π, π ] (why?), i.e.

7

The wavelet transform of a signal x(t) is given by    1 t −b x(t)ψ W (a, b, ψ, x(t)) = |a|− 2 dt, a

where ψ() is the basis wavelet, a is a dilation factor and b is the translation of the origin. In this transform higher frequencies are more localised and there is a uniform temporal resolution for all frequency scales. A major problem here is that the resolution is limited by the basis wavelet. Wavelets are particularly useful for characterising gradual frequency changes.

104

5 Complex/Hilbert EOFs

⎧ ⎨ −1 for −π ≤ λ < 0 h(λ) = i sign(λ)I[−π,π ] (λ) = 0 for λ = 0 ⎩ 1 for 0 < λ < π.

(5.22)

To get the filter coefficients (in the time domain) we compute the frequency response function, which is then expanded into Fourier series (Appendix C). The transfer function (5.22) can now be expanded into Fourier series, after extending it by periodicity to the real line then applying the discrete  π Fourier transform (e.g. Stephenson 1973). The Fourier coefficients, ak = π1 −π h(λ)eikλ dλ, k = 0, 1, 2, . . ., become:8  ak =

0 if k = 2p 4 if k = 2p + 1. − kπ

(5.23)

The frequency response function is therefore (λ) = −

 ak 2

k

eikλ

and hence the time series response is given by yk = −

 ak k

2

xt−k = −

2  x((t − (2k + 1))t) π 2k + 1 k

that is, yt = −

2 1 2  xt−(2k+1) = (xt+2k+1 − xt−2k−1 ) . π 2k + 1 π 2k + 1 k

(5.24)

k≥0

The discrete Hilbert transform formulae (5.24) was also derived by Kress and Martensen (1970), see also Weideman (1995), using the rectangular rule of integration applied to (5.19). Now the time series is expanded into Fourier series as xt =



a(ω)e−2iπ ωt

(5.25)

ω

8 Note

that this yields the following expansion of the transfer function into Fourier series as −ih(λ) =

4  sin (2k + 1)x 2  sin (2k + 1)x = . π 2k + 1 π 2k + 1 k≥0

k

5.4 Complex Hilbert EOFs

105

then its Hilbert transform is obtained by multiplying (5.25) by the transfer function to yield: Hxt = yt =

1 h(ω)a(ω)e−2π iωt . π ω

(5.26)

Note that in (5.26) the Hilbert transform has removed the zero frequency and phase rotated the time series by π2 . The analytic (complex) Hilbert transform zt = xt − iHxt =

 ω

 i a(ω) 1 − h(ω) e−2iπ ωt π 

(5.27)

has the same positive frequencies as xt and zero negative spectrum.

5.4.3 Application to Time Series Let xt , t = 1, 2, . . . , be a univariate stationary signal, and yt = Hxt , t = 1, 2, . . ., its Hilbert transform. Then using (5.21), and (5.22) we get respectively fx (ω) = fy (ω)

(5.28)

fxy (ω) = i sign(ω)fx (ω)

(5.29)

and

for ω = 0. Note that the first of these two equations is already known. It is therefore clear that the co-spectrum is zero. This means that the cross-correlation between the signal and its transform is odd, hence γxy (−τ ) = −γxy (τ ) = γyx (τ ).

(5.30)

Note in particular that (5.30) yields γxy (0) = 0, hence the two signals are uncorrelated. For the multivariate case similar results hold. Let us designate by xt , t = 1, 2, . . . a d-dimensional signal and yt = Hxt , its Hilbert transform. Then the cross-spectrum matrix, see Eqs. (5.28)–(5.29), is given by Fy (ω) = Fx (ω) Fxy (ω) = i sign(ω)Fx (ω),

(5.31)

for ω = 0. Note that the Hilbert transform for multivariate signals is isotropic. Furthermore, since the inverse of the transform is its opposite we get, using again

106

5 Complex/Hilbert EOFs

Eqs. (5.30) and (5.31), the following identity: Fyx (ω) = −isign(ω)Fy (ω) = −Fxy (ω).

(5.32)

Using (5.31), the latter relationship yields, for the cross-covariance matrix, the following property:   xy =

π −π



π

Fxy (ω)dω = −2 0

FIx = − yx .

(5.33)

Exercise Derive the above equation (5.33). (Hint Use (5.31) plus the fact that FR is even and FI is odd.). Let us now consider the complexified multivariate signal: zt = xt − iHxt = xt − iyt

(5.34)

then the Hermitian covariance matrix  z of zt is  

=  x +  y + i  xy −  yx = 2 x + 2i xy  z = E zt z∗T t

(5.35)

The covariance matrix of the complexified signal is also related to the cross-spectra Fx via  π  π   1 + sign(ω) Fx (ω)dω = 4 F∗x (ω)dω, (5.36) z = 2 −π

0

where (∗ ) stands for the complex conjugate. Exercise Derive (5.36). Hint Use (5.31), (5.33) and (5.35), in addition to the fact that the real and imaginary parts of Fx are respectively even and odd. π Exercise If in (5.34) zt is defined instead by xt +iyt , show that  z = 4 0 Fx (ω)dω

5.4.4 Complex Hilbert EOFs Complexified Field Given a scalar field xt = (xt1 , . . . , xtp )T , t = 1, . . . n, with Fourier representation: xt =

 ω

a(ω) cosωt + b(ω) sinωt,

(5.37)

5.4 Complex Hilbert EOFs

107

where a(ω) and b(ω) are vector Fourier coefficients, and since propagating disturbances require complex representation as in (5.2), Eq. (5.37) can be transformed to yield the general (complex) Fourier decomposition: zt =



c(ω) e−iωt ,

(5.38)

ω

where precisely Re(zt ) = xt , and c(ω) = a(ω) + ib(ω). The new complex field T

zt = zt1 , . . . ztp can therefore be written as zt = xt − iH(xt ).

(5.39)

The imaginary part of zt H(xt ) =



b(ω) cos ωt − a(ω) sin ωt,

(5.40)

ω

is precisely the Hilbert transform, or quadrature function of the scalar field xt and is seen to represent a simple phase shift by π2 in time. In fact, it can be seen that the Hilbert transform, considered as a filter, removes the zero frequency without affecting the modulus of all the others, and is as such a unit gain filter. Note that if the time series (5.37) contains only one frequency, then the Hilbert transform is simply proportional to the time derivative of the time series. Therefore, locally in the frequency domain H(xt ) provides information about the rate of change of xt with respect to time t.

Computational Aspects In practice, various methods exist to compute the finite Hilbert transform. For a scalar field xt of finite length n, the Hilbert transform H(xt ) can be estimated using   the discrete Fourier transform (5.40) in which ω becomes ωk = 2πn k , k = 1, . . . n2 . Alternatively, H(xt ) can be obtained by truncating the infinite sum in Eq. (5.24). This truncation can also be written using a convolution or a linear filter as (see e.g. Hannan 1970): H(xt ) =

L 

αk xt−k

k=−L

with the filter weights αk =

πk 2 sin2 . kπ 2

(5.41)

108

5 Complex/Hilbert EOFs

Barnett (1983) found that 7 ≤ L ≤ 25 provides adequate values for L. For example for L = 23 the frequency response function (Appendix C) is a band pass filter with periods between 6 and 190 time units with a particular excellent response obtained between 19 and 42 time units (Trenberth and Shin 1984). The Hilbert transform has also been extended to vector fields, i.e. two or more fields, through concatenation of the respective complexified fields (Barnett 1983). Another interesting method to compute Hilbert transform of a time series is presented by Weideman (1995), using a series expansion in rational eigenfunctions of the Hilbert transform operator (5.19). The HEOFs uk , k = 1, . . . p, are then obtained as the eigenvectors of the Hermitian covariance matrix

1  ∗T zt zt = Sxx + Syy + i Sxy − Syx , n−1 n

S=

(5.42)

t=1

where Sxx and Syy are covariance matrices of xt and yt respectively, and Sxy is the cross-covariance between xt and yt and similarly for Syx . Alternatively, these HEOFs can also be obtained as the right complex singular vectors of the data matrix Z = (ztk ) using SVD, i.e. Z = UV∗T =

p 

λk uk v∗T k .

k=1

Note that (5.33) is the main difference between FDEOFs, where the integration is performed over a very narrow (infinitesimal) frequency range ω0 ± dω, and Hilbert EOFs, where the whole spectrum is considered. The previous SVD decomposition T

expresses the complex map zt = zt1 , zt2 , . . . , ztp at time t as zt =

r 

λk vtk u∗k ,

(5.43)

k=1

where vtk is the value of the k’th complex PC (CPC) vk at time t and r is the rank of X. The computation of Hilbert EOFs is quite similar to conventional EOFs. Given the gridded data matrix X(n, p12), the Hilbert transform in Matlab is given by  XH = hilbert(X); and the Hilbert EOFs are given by the EOFs of XH . Figure 5.4 shows the leading 30 eigenvalues, expressed in percentage of explained variance, of the spectrum of the Hermitian matrix S, Eq. (5.42), using the zonal mean zonal wind anomalies over the ERA-40 period. The vertical bars refer to the approximate 95% confidence interval based on the rule-of-thumb given by Eq. (3.13). The spectrum has an outstanding feature reflected by the high percentage of the leading eigenvalue, with a substantial amount of variance of the order of 70%.

5.4 Complex Hilbert EOFs

109

Fig. 5.4 Spectrum of the Hermitian covariance matrix given by Eq. (5.42), of the zonal mean zonal wind anomalies of the ERA-40 data. Vertical bars represent approximate 95% confidence limits. Adapted from Hannachi et al. (2007)

The leading Hilbert EOF (real and complex parts) is shown in Fig. 5.5. The patterns are in quadrature reflecting the downward propagating signal. The time and spectral structure can be investigated further using the associated complex (or Hilbert) PC. Figure 5.6 shows a plot of the real and complex parts of the Hilbert PC1 along with the power spectrum of the former. The period of the propagating signal comes out about 30 months. In a conventional EOF analysis the real and complex parts of Hilbert EOF1 would come out approximately as a pair of degenerate EOFs. Table 5.2 shows the percentage of the cumulative explained variance of the leading 5 EOFs and Hilbert EOFs. The leading EOF pair explain respectively about 39% and 37% whereas the third one explains about 8% of the total. Table 5.2 also shows the efficiency of Hilbert EOFs in reducing the dimensionality of the data compared to EOFs. This comes of course at a price, namely the double size of the Hilbert covariance matrix. Remark The real and imaginary parts of the CPC’s are Hilbert transform of p ∗ each other. In fact using the identity λk vk = j =1 ukj zj , where zk is the k’th complexified variable (or time series at the kth grid point) and uk = (uk1 , . . . , ukp ) is the kth HEOF, we can apply the Hilbert transform to yield λk Hvk =

p 

p      ukj Hx ∗j + iHy ∗j = ukj y ∗j − ix ∗j = iλk vk ,

j =1

which completes the proof.

j =1

110

5 Complex/Hilbert EOFs

Fig. 5.5 Real (a) and imaginary (b) parts of the leading Hilbert EOF of the ERA-40 zonal mean zonal wind anomalies. Adapted from Hannachi et al. (2007)

From this decomposition we get the spatial amplitude and phase functions respectively:

ak = uk  u∗k = Diag uk u∗T k  I m(uk ) . θ k = arctan Re(uk )

(5.44)

5.4 Complex Hilbert EOFs

111

Fig. 5.6 Time series of the Hilbert PC1 (a), phase portrait of Hilbert PC1 and the power spectrum of the real part of Hilbert PC1 of the Era-40 zonal mean zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Complex PC1: Real and imaginary parts. (b) Phase portrait of CPC1. (c) Spectrum of real (CPC1) Table 5.2 Percentage of explained variance of the leading 5 EOFs and Hilbert EOFs of the ERA40 zonal mean zonal wind anomalies Eigenvalue rank EOFs Hilbert EOFs

1 39.4 71.3

2 37.4 10.0

3 7.7 7.7

4 5.5 2.8

5 2.3 1.9

Similarly, one also gets the temporal amplitude and phase functions as bk = vk  v∗k = Diag(vk v∗T k )  I m(vk ) , φ k = arctan Re(vk )

(5.45)

where the vector product and division in (5.44) and (5.45) are performed componentwise. For each eigenmode, the amplitude map can be interpreted as a variability map as in ordinary EOFs. The function θ k gives information on the relative phase. For “simple” fields, its spatial derivative provides a measure of the local wavenumber. Its interpretation for moderately complex fields/waves can be difficult

112

5 Complex/Hilbert EOFs

(Wallace 1972), and can be made easier by applying a prior filtering (Barnett 1983). Also for simple waves, the time derivative of the temporal phase gives a measure of the instantaneous frequency. Note that the phase speed of the wave of the kth mode k (x)/dx at time t and position x can be measured by dθ dφ k (t)/dt . The amplitude and phase of the leading Hilbert EOF Eq. (5.44) are shown in Fig. 5.7. The spatial amplitude shows that the maximum of wave amplitude is round 25-mb on the equator. It also shows the asymmetry of the amplitude in the vertical

Fig. 5.7 Spatial modes of the amplitude and phase of leading Hilbert EOF of the ERA-40 zonal mean zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Spatial amplitude of complex EOF1. (b) Spatial phase of complex EOF1

5.4 Complex Hilbert EOFs

113

direction. The spatial phase shows banded structure from 1-mb height down to around 50-mb, where propagation stops, and indicates the direction of propagation of the disturbance where the phase changes between −180◦ and 180◦ in the course of complete cycle. The temporal amplitude and phase, Eq. (5.45), are shown in Fig. 5.8 of the first Hilbert PC of the ERA-40 zonal mean zonal wind anomalies for the period January 1992–December 2001. For example, the amplitude is larger in the middle of the wave life cycle. The temporal phase, on the other hand, provides information on the phase of the wave. For every wave lifecycle the phase is nearly quasi-linear, with the slope providing a measure of the instantaneous frequency. As with conventional EOFs, Hilbert EOFs can be used to filter the data. For the example of the ERA-40 zonal mean zonal wind, the leading Hilbert EOF/PC can be used to filter out the QBO signal. Figure 5.9 shows the obtained filtered anomalies for the period Jan 1992–Dec 2001. The downward signal propagating is clearer than the signal shown in Fig. 5.3. The average downward phase propagation is about 1 km/month. The covariance matrix S in (5.42) is related to the cross-spectrum matrix 1  −iωτ zt+τ z∗T t e n τ n−τ

(ω) =

t=1

Fig. 5.8 Time series of the amplitude and phase of the leading Hilbert EOF of ERA-40 zonal mean zonal wind anomalies. Adapted from Hannachi et al. (2007). (a) Temporal amplitude of complex PC 1. (b) Temporal phase of complex PC 1

114

5 Complex/Hilbert EOFs

Fig. 5.9 Filtered field of ERA-40 zonal mean zonal wind anomalies using Hilbert EOF1 for the period January 1992–December 2001. Adapted from Hannachi et al. (2007)

via 

ωN

S=2

(ω)dω,

(5.46)

0 1 and represents the Nyquist frequency and t is the time where ωN = 2t interval between observations. This can be compared to (5.36). Note that since the covariance matrix of xt is only related to the co-spectrum (i.e. the real part of the spectrum) of the time series it is clear that conventional EOFs, based on covariance or correlation matrix, does not take into consideration the quadrature part of the cross-spectrum matrix, and therefore EOFs based on the cross-spectrum matrix generalise conventional EOFs. It is also clear from (5.46) that HEOFs are equivalent to FDEOFs with the crossspectrum integrated over all frequencies. Note that the frequency band of interest can be controlled by prior smoothing. Horel (1984) points out that HEOFs can fail to detect irregularly occurring progressive waves, see also Merrifield and Winant (1989). Merrifield and Guza (1990) have shown that complex EOF analysis in the time domain (HEOFs) is not appropriate for non-dispersive and broad-banded waves in wavenumber κ relative to the largest separation measured (array size x). In fact Merrifield and Guza (1990), see also Johnson and McPhaden (1993), identified κx as the main parameter causing spread of propagating variability into more than one HEOF mode, and the larger the parameter the lesser the captured data variance. Barnett (1983) applied HEOFs to study relationship between the monsoon

5.5 Rotation of HEOFs

115

and the Pacific trade winds and found strong coupling particularly at interannual time scales.

5.5 Rotation of HEOFs Although HEOFs constitute a very useful tool to study and identify propagating phenomena (Barnett 1983; Horel 1984; Lazante 1990; Davis et al. 1991) such as Kelvin waves in sea level or forced Rossby waves, the method suffers various drawbacks. For example, HEOFs is unable to isolate, in one single mode, irregular disturbance progressive waves Horel (1984). This point was also pointed out by Merrifield and Guza (1990), who showed that the method can be inappropriate for non-dispersive waves that are broadband in wavenumber relative to the array size. More importantly, the drawbacks of EOFs, e.g. non-locality and domain dependence (see Chap. 3) are inherited by HEOFs. Here again, rotation can come to their rescue. Horel (1984) suggested the varimax rotation procedure to rotate HPCs using real orthogonal rotation matrices, which can yield more realistic modes of variation. The varimax rotation was applied later by Lazante (1990). Davis et al. (1991) showed, however, that the (real) varimax procedure suffers a drawback related to a lack of invariance to arbitrary complex rephasing of HEOFs. Bloomfield and Davis (1994) proposed a remedy to rotation by using a complex unitary rotation matrix. Bloomfield and Davis (1994) applied the complex orthogonal rotation to synthetic examples and to sea level pressure. They argue that the rotated HPCs are easier to interpret than the varimax rotation.

Chapter 6

Principal Oscillation Patterns and Their Extension

Abstract EOF method is essentially an exploratory method to analyse the modes of variability of multivariate weather and climate data, with no model is involved. This chapter describes a different method, Principal Oscillation Pattern (POP) analysis, that seeks the simplest dynamical system that can explain the main features of the space–time data. The chapter also provides further extension of POPs by including nonlinearity. Examples from climate data are also given. Keywords Autoregressive model · Feedback matrix · Principal oscillation pattern · Fluctuation–dissipation · Cyclo-stationarity · Baroclinic structure · e-folding time · Principal interaction patterns

6.1 Introduction As was pointed out in the previous chapters, EOFs and closely related methods are based on contemporaneous information contained in the data. They provide therefore patterns that are by construction stationary in the sense that they do not allow in general the detection of propagating disturbances. In some cases, little dynamical information can be gained. The PCs contain of course temporal information except that they only reflect the amplitude of the corresponding EOF.1 Complex Hilbert EOFs, on the other hand, have been conceived to circumvent this shortcoming and allow for the detection of travelling waves. As pointed out in Sect. 5.3.2, HEOFs are not exempt from drawbacks, including, for example, difficulty in the interpretation of the phase information such as in the case of nondispersive waves or nonlinear dynamics, in addition to the drawbacks shared with EOFs. It is a common belief that propagating disturbances can be diagnosed using the lagged structure of the observed field. For example, the eigenvectors of the lagged

1 There

are exceptions, and these happen when, for example, there is a pair of equal eigenvalues separated from the rest of the spectrum.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_6

117

118

6 Principal Oscillation Patterns and Their Extension

covariance matrix at various lags can provide information on the dynamics of the propagating structures. The atmospheric system is a very complex dynamical system par excellence, whose state can be well approximated by an evolution equation or dynamical system: d x = F(x, t), dt

(6.1)

where the vector x(t) represents the state of the atmosphere at time t, and F is a nonlinear function containing all physical and dynamical processes such as nonlinear interactions like nonlinear advection, and different types of forcing such as radiative forcing, etc. Our knowledge of the state of the system is and will be always partial, and this reflects back on our exploration of (6.1). For example, various convective processes will only be parametrised, and subscale processes are considered as noise, i.e. known statistically. A first and important step towards exploring (6.1) consists in looking at a linearised version of this dynamical system. Hence a more simplified system deriving from (6.1) reads d x = Bx + εt , dt

(6.2)

where ε t is a random forcing taking into account the non-resolved processes, which cannot be represented by the deterministic part of (6.2). These include subgrid scales and nonlinear effects. This model goes along with the assumption of Hasselmann (1988), namely that the climate system can be split into two components: a deterministic or signal part and a nondeterministic or noise part. Equation (6.2) is a simple linear stochastic system that can be studied analytically and can be compared to observed data. This model is known as continuous Markov process or continuous first-order (multivariate) autoregressive (AR(1)) model and has nice properties. It has been frequently used in climate studies (Hasselmann 1976, 1988; Penland 1989; Frederiksen 1997; Frederiksen and Branstator 2001, 2005). In practice Eq. (6.2) has to be discretised to yield a discrete multivariate AR(1), which may look like: xt+1 = (I + Bt)xt + tεt = Axt + ε t+1 ,

(6.3)

in which t can be absorbed in B and can assume that t = 1, see also next sections for time-dependent POPs. Remark Note that when model data are used the vector xt may be complex, containing, for example, spectral coefficients derived, for instance, from spherical harmonics. Similarly, the operator A may also be complex. In the sequel, however, it is assumed that the operator involved in Eq. (6.2) is real.

6.2 POP Derivation and Estimation

119

The AR(1) model (6.3) is very useful from various perspectives, such as the exploration of observed data or the analysis of climate models or reduced atmospheric models. This model constitutes the corner stone of principal oscillation pattern (POP) analysis (Hasselmann 1988; von Storch et al. 1988, 1995; Schnur et al. 1993). According to Hasselmann (1988), the linear model part in (6.3), within the signal subspace, is the POP model. If the simplified model takes into account the system nonlinearity, then it is termed principal interaction pattern (PIP) model by Hasselmann (1988). The main concern of POP analysis consists of an Eigen analysis of the linear part of (6.3), see e.g. Hasselmann (1988), von Storch et al. 1988), and Wikle (2004). POP analysis has been applied mostly to climate variability but has started recently to creep towards other fields such as engineering and biomedical science, see e.g. Wang et al. (2012).

6.2 POP Derivation and Estimation 6.2.1 Spatial Patterns We consider again the basic multivariate AR(1) model (6.3), in which the multivari ate process (εt ) is a white noise (in time) with covariance matrix Q = E εε T . The matrix A is known as the feedback matrix (von Storch et al. 1988) in the discrete case and as Green’s function in the continuous case (Riskin 1984; Penland 1989). POP analysis attempts to infer empirical space–time characteristics of the climate system using a simplified formalism expressed by (6.3). These characteristics are provided by the normalised eigenvectors of the matrix A (or the empirical normal modes of the multivariate AR(1) model (6.3)) and are referred to as the POPs. In (6.3) we suppose that xt , like ε t , is zero mean. The autoregression matrix is then given by    −1  E xt xTt =  1  −1 A = E xt+1 xTt 0 ,

(6.4)

T

where  1 is the lag-1 autocovariance matrix of xt = x1t , x2t , . . . xpt . That is, if γij (τ ) is the lagged autocovariance between xit and xi,t+τ , then [ 1 ]ij = γij (1). Recall that  1 is not symmetric. If we now have a finite sample xt , t = 1, 2, . . . n, then an estimate of A is given by ˆ = S1 S−1 , A 0

(6.5)

where S1 and S0 are, respectively, the sample lag-1 autocovariance matrix and the covariance matrix of the time series. Precisely, we have the following result.

120

6 Principal Oscillation Patterns and Their Extension

ˆ minimises the residual: Proposition The matrix A F (A) =

n−1  t=1

xt+1 − Axt 2 =

n−1 

Ft (A),

t=1

where x 2 = E(xT x). Proof We first note the following few simple identities. For any p × 1 vectors a and

b and p × p matrix A, we have aT b = tr abT and aT (Ab) = tr abT AT =

tr AbaT . Now Ft () and also F () are quadratic forms in A and   Ft (A) = E xTt xt − xTt Axt−1 − xTt−1 AT xt + xt−1 AT Axt−1 . Let us forget for a moment the expectation operator E(.). The differential of Ft (A) is obtained by computing Ft (A + H) − Ft (A) for any matrix H, with small norm. We get Ft (A + H) − Ft (A)

  = −xTt Hxt−1 − xTt−1 HT xt + xTt−1 AT Hxt−1 + xTt−1 HT Axt−1 + O H 2   = DFt (A).H + O H 2 ,

where DFt (A) is the differential of Ft () at A, which can be simplified to yield

DFt (A).H = −2xTt−1 HT xt +2xTt−1 HT Axt−1 = −2tr xt−1 xTt H − xt−1 xTt−1 AT H . Now we can bring back either the expectation operator E() or the summation. 

If, for example, we  use the expectation operator, we get DFt (A).H = −2tr  −1 −  0 AT H . If the summation over the sample is used instead, we obtain the same expression except that  1 and  0 are, respectively, replaced by S1 and S0 . Hence the minimum of F () satisfies DFt (A).H = 0 for all H, and this yields (6.4). The next step consists of computing the eigenvalues and the corresponding eigenvectors.2 Let us denote by λk and uk , k = 1, . . . p, the eigenvalues and the associated eigenvectors of A, respectively. The eigenvectors can be normalised to have unit norm but are not orthogonal.

2 This

decomposition can be sometimes problematic due to the possible existence of small eigenvalues of  0 . In general it is common practice to filter the data first using, for example, EOFs and keeping the leading PCs (Barnett and Preisendorfer 1987).

6.2 POP Derivation and Estimation

121

Remarks • Because A is not symmetric, the eigenvalues/eigenvectors can be complex, in which case they come in conjugate pairs. ˆ are also solution of a • The eigenvalues/eigenvectors of A (or similarly A) generalised eigenvalue problem. Exercise Let L be an invertible linear transformation, and yt = Lxt . Show that the eigenvalues of the feedback matrix are invariant under this transformation. Hint We have yt+1 = LAL−1 yt , and the eigenvalues of A and LAL−1 are identical. γij (τ ) γii (0)γjj (0)

Exercise Let ρij (τ ) = √ xj,t+τ , then |ρij (τ )| ≤ 1.

be the lagged cross-correlation between xit and

Hint Use |E(XY )|2 ≤ E(X2 )E(Y 2 ). In the noise-free case (6.3) yields xt = At x0 , from which we can easily derive the condition of stationarity. The state xt can be decomposed into a linear combination of the eigenvectors of A as xt =

r 

at(i) ai ,

i=1

where r is the rank of A and ai , i = 1, . . . r, are the eigenvectors of A. Equation (6.3) then yields (i)

at+1 = ci λti , where ci is a constant. It is clear therefore that under stationarity conditions the eigenvalues of A satisfy |λi | < 1, i = 1, . . . p. Exercise Assume that λ is a real eigenvalue of A. Show, using a different method, that (under stationarity condition) the eigenvalues of A satisfy |λ| < 1. Furthermore, ˆ then these sample when the eigenvalues are estimated from the data, i.e. using A, eigenvalues are in fact inside the unit circle. Hint Let λ be an eigenvalue of A =  1  −1 0 and u the corresponding eigenvector, then we have  1 v = λ 0 v, where v =  −1 0 u. Now consider the random variable zt = vT xt . We have var (zt ) = vT  0 v and γz (1) = vT  1 v. That is γz (1) = λvar (zt ), which, under stationarity assumption, yields −1 ≤ λ = ρz (1) ≤ 1. The last condition is also satisfied for the sample feedback matrix. In particular, the above inequality becomes strict in general since in a finite sample the lag1 autocorrelation is in general less that one. Even for the population, the above

122

6 Principal Oscillation Patterns and Their Extension

inequality tends to be strict. In fact the equality ρij (τ ) = ±1 holds only when xi,t+τ = αxj,t .

The POPs U = u1 , . . . , up are the normalised (right) eigenvectors of A satisfying AU = U, where  = Diag λ1 , . . . , λp . Since the feedback matrix

is not symmetric, it also has left eigenvectors V = v1 , . . . , vp satisfying VT A = VT . These left eigenvectors are the eigenvectors of the adjoint AT of A and are known as the adjoint vectors of U (Appendix F). They satisfy VT U = UVT = Ip . Precisely, U and V are the left and right singular vectors of A, i.e. A = UVT =

p 

λk uk vTk .

k=1

The interpretation of POPs is quite different from EOFs. For example, unlike POPs, the EOFs are orthogonal and real. von Storch et al. (1988) interpret the real and imaginary parts of POPs as standing oscillatory and propagating patterns, respectively. Also, in EOFs the importance of the patterns is naturally dictated by their explained variance, but this is not the case for the POPs. In fact, there is no a priori unique rule of pattern selection in the latter case. One way forward is to look at the time series of the corresponding POP.

6.2.2 Time Coefficients As for EOFs, each POP has an associated time series, or POP coefficients, showing their amplitudes as a function of time. The k’th complex coefficient zk (t) at time t, associated with the k’th POP,3 is the projection of xt onto the adjoint vk of the k’th POP uk zk (t) = zkt = xTt vk .

(6.6)

Unlike PCs, the POP coefficients satisfy a stochastic model dictated by Eq. (6.3). In fact, it can be seen by post-multiplying (6.3) by vk that zk (t) yields a (complex) AR(1) model: zk (t + 1) = λk zk (t) + k,t+1 .

(6.7)

Remark The expected values of (6.7) decouple completely, and the dynamics are described by damped spirals.

3 Note

that there is no natural order for the POPs so far.

6.2 POP Derivation and Estimation

123

One would like to find out the covariance matrix of the new noise term in Eq. (6.7). To this end, one assumes that the feedback matrix is diagonalisable so A = UU−1 . The state vector xt is decomposed in the basis U of the state space following: xt =

p 

xt uk = Ux+ t . (k)

k=1 (k)

Note that the components xt are the coordinates of xt in this new basis. They are not, in general, identical to the original coordinates. Similarly we get for the noise term: εt =

p 

εt uk = Uε + t , (k)

k=1 + T where x+ t = (xt , . . . , xt ) and similarly for ε t . After substituting the above expression into (6.3), one obtains (1)

(p)

+ + x+ t+1 = xt + ε t+1 .

Component-wise this takes the form: (k)

(k)

xt+1 = λk xt

(k)

+ εt+1 k = 1, . . . p.

Exercise Derive the noise characteristics of ε+ t . −1 Hint Use the expression ε+ t = U ε t to get the covariance matrix C = (cij ) = (k) (l) U−1 QU−T . Hence E(εt εs ) = δts ckl .

Because λk = |λk |e−iωk is inside the unit circle (for the observed sample), the evolution of zk (t), in the noise-free case (zk (t) = λtk zk (0)), describes in the complex 1 plane a damped spiral with period Tk = 2π ωk and e-folding time τk = − log |λk | .

Within the normal modes, the variance of the POP coefficients, i.e. σk2 = E(zk2 (t)), reflects the dynamical importance of the k’th POP. It can be shown that σk2 =

ckl . 1 − |λk |2

(6.8)

Exercise Derive Eq. (6.8). Answer Use the identity  0 = A 0 AT + Q plus the decompositions A = UU−1 and C = U−1 QU−T . The covariance matrix then satisfies  0 = U 00 UT where  00 is the solution of the matrix equation:  00 =  00 T + C, which can be solved component-wise leading (for |λk | < 1) to ( 00 )kl = ckl /(1 − λk λ∗l ).

124

6 Principal Oscillation Patterns and Their Extension

Remark The above excitation, which reflects a measure of the dynamics of eigenmodes, is at odds with the traditional measure in which the mode with the least damping (or largest |λk |) is the most important. The latter view is based on the noisefree dynamics, whereas the former takes into consideration the stochastic forcing and therefore seems more relevant. Let us write the k’th POP uk as uk = urk + iuik and similarly for the POP coefficient zk (t) = zkr (t) + izki (t). The noise-free evolution of the k’th POP mode (taking for simplicity zk (0) = 1) is given by   zk (t)uk + zk∗ (t)u∗k = λtk uk + (λ∗k )t u∗k = 2|λk |t urk cos ωk t + uik sin ωk t . Therefore, within the two-dimensional space spanned by urk and uik , the evolution can be described by a succession of patterns, with decreasing amplitudes, given by urk → uik → −urk → −uik → urk . . . , and these represent the states occupied by the POP mode at successive times tm = mπ 2ωk , for m = 0, 1, 2, . . . (Fig. 6.1), see also von Storch et al. (1995). The result of travelling features is a consequence of the above equations, and therefore, a AR(1) model is inappropriate to model a standing wave. A familiar property of the AR(1) model (6.7) is the form of its spectral density function (Appendix C). In fact, if from (6.7) we regard t as the output of a linear digital filter (Chap. 2), then we get the following relationship between the spectral density functions f () and fz () of (t) and zk (t), respectively: fz (ω) =

Fig. 6.1 The evolution of the k’th POP within the plane (urk , uik ), showing the positions of uk (t) at successive times t0 , t1 , . . . , tm

f (ω) . |λk − eiω |2

(6.9)

Evolution of one POP ui k uk(t2)

u (t ) k 1

ur

k

O uk(tm)

u (t ) k 0

6.2 POP Derivation and Estimation

125

Exercise Derive the above relationship, Eq. (6.9).

 Hint k (t) can be regarded as a “filtered” version of zk (t), i.e. k (t) = h(u)zk (t − u)du, where the transfer function is given by h(u) = δ−1 (u)−λk δ(u), where δ(u) is the Dirac (spike) distribution.4 After applying Fourier transform (Chap. 2), one gets f (ω) = |(ω)|2 fz (ω), and hence (6.9). The function (ω) is the Fourier transform of the transfer function h(). The spectrum of zk (t) is a rational function of λk . If the noise k (t) is white, then it is clear that when λk is real (6.9) is the usual AR(1) model with its red spectrum. However, when λk is complex, then (6.9) has a spectral peak whose width increases as |λk | decreases and represents a second-order autoregressive model AR(2), see e.g. von Storch and Zwiers (1999, chap 15). The interpretation of POPs can be made easy by analysing the spectral information from the POP coefficients (or amplitudes). An example where tropical surface wind and SST are combined with stratospheric zonal wind and submitted to a POP analysis is investigated by Xu (1993). See also von Storch and Xu (1990), von Storch and Baumhefner (1991), Xue et al. (1994) and von Storch et al. (1995) (and references therein) who used POPs for prediction purposes. POP analysis with cyclo-stationary time series has also been addressed in von Storch et al. (1995).

6.2.3 Example POP analysis (Hasselmann 1988) was applied by a number of authors, e.g. von Storch et al. (1988), von Storch et al. (1995), Xu (1993). Schnur et al. (1993), for example, investigated synoptic- and medium-scale (3–25 days) wave activity in the atmosphere. They analysed geopotential heights from ECMWF analyses for the Dec–Jan–Feb (DJF) period in 1984–1987 for wavenumbers 5–9. Their main finding was a close similarity between the most significant POPs and the most unstable waves, describing the linear growth of unstable baroclinic structures with period 3– 7 days. Figure 6.2 shows POP1 (phase and amplitude) of twice-daily geopotential height for zonal wavenumber 8. Significant amplitude is observed around 45◦ N associated with a 90◦ phase shift with the imaginary part westward of the real part. The period of POP1 is 4 days with an e-folding of 8.6 days. Combined with the propagating structure of POP evolution, as is shown in Sect. 6.2.2, the figure manifests an eastward propagating perturbation with a westward phase tilt with height. The eastward propagation is also manifested in the horizontal cross-section of the upper level POP1 pattern shown in Fig. 6.3. Frederiksen and Branstator (2005), for example, investigated POPs of 300hPa streamfunction fields from NCAR/NCEP reanalysis and general circulation

4 The

Dirac function is defined by the property simply noted as δ(u) .



δa (u)f (u)du = f (a), and in general, δ0 (u) is

126

6 Principal Oscillation Patterns and Their Extension

Fig. 6.2 Leading POP, showing phase (upper row) and amplitude (lower row), of twice-daily ECMWF geopotential height field during DJF 1984–1987 for zonal wavenumber 8. Adapted from Schnur et al. (1993). ©American Meteorological Society. Used with permission

Fig. 6.3 Cross-section at the 200-hPa level of POP1 of Fig. 6.2. Adapted from Schnur et al. (1993). ©American Meteorological Society. Used with permission

6.3 Relation to Continuous POPs

127

Fig. 6.4 Leading EOF (a) and POP (b) of the NCAR/NCEP 300-hPa streamfunction for March. Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permission

model (GCM) simulations. Figure 6.4 shows the leading EOF and POP patterns of NCAR/NCEP March 300-hPa streamfunction field. There is similarity between EOF1 and POP1. For example, both are characterised by approximate large scale zonal symmetry capturing midlatitude and subtropical variability. The leading POPs are in general real with decay e-folding times. Figure 6.5 shows the POP2 for January. It shows a Pacific North America (PNA) pattern signature and is similar to EOF2. The leading POPs can be obtained, in general, from a superposition of the first 5 to 10 EOFs as pointed out by Frederiksen and Branstator (2005). Figure 6.6 shows the average global growth rate of the leading 5 POPs across the different months. The figure shows in particular that the leading POPs are all damped.

6.3 Relation to Continuous POPs 6.3.1 Basic Relationships Various interesting relationships can be derived from (6.3), and the relationship given in Eq. (6.4) is one of them. Furthermore by computing xt+1 xTt+1 and taking expectation, one gets  0 = A 0 AT + Q.

(6.10)



Also, expanding (xt+1 − ε t+1 ) xTt+1 − ε Tt+1 and taking expectation, after



using (6.10), yield E ε t xTt + E xt ε Tt = 2Q. On a computer, the continuous

128

6 Principal Oscillation Patterns and Their Extension

Fig. 6.5 POP2 (a) and EOF2 (b) of January 300-hPa NCEP/NCAR streamfunction field. Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permission –0.045

–0.050

–0.055

∼ (t) ω i day–1

–0.060

–0.065

–0.070

–0.075 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan

Fig. 6.6 Average global growth rate of the leading 5 POPs (continuous) and FTPOPs (dashed). Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permission

AR(1) model (6.2) has to be discretised, and when this is done, one gets a similar equation to (6.3). So if we discretise (6.2) using a unit time interval, which can be made after some scaling is applied, then Eq. (6.2) can be roughly approximated by

6.3 Relation to Continuous POPs

129

xt+1 = Ip + B xt + εt+1 ,

(6.11)

which is equivalent to (6.3) with A = Ip + B. The relationship (6.10) now becomes B 0 +  0 BT + B 0 BT + Q = O.

(6.12)

This last relationship can be regarded as a generalisation of the fluctuation– dissipation relation, derived in the context of linear inverse modelling, LIM (e.g. Penland 1989), B 0 +  0 BT + Q = O,

(6.13)

which can be obtained using (6.12) by dropping the nonlinear term in B. A more accurate discrete approximation of (6.2) can be obtained using an infinitesimal time step δτ to obtain the following process:

xt+nδτ = Ip + δτ B xt+(n−1)δτ + ε t+nδτ .

(6.14)

The iteration of (6.14) yields

n xt+nδτ = Ip + δτ B xt + ηt,τ ,

(6.15)

where ηt,τ is now an autocorrelated noise but uncorrelated with xt . Now, for large enough n, such that τ = nδτ remains finite, (6.15) can be approximated by xt+τ = eBτ xt + ηt,τ .

(6.16)

Note that now (6.16) is similar to (6.3) except that the noise is autocorrelated. Multiplying both sides of (6.16) by xTt and taking expectation yield  τ = eBτ  0 .

(6.17)

The matrix G(τ ) = exp (Bτ ) =  τ  −1 0 is known as the Green’s function or resolvent. Furthermore, from (6.16), we can also obtain the covariance matrix of the noise, i.e.   σ (τ ) = E ηt,τ ηTt,τ =  0 − G(τ ) 0 [G(τ )]T . (6.18) Relation (6.18) can be useful, for example, in deriving the conditional probability5 p (xt+τ |xt ). If we suppose that B is diagonalisable, and we decompose it using its

the noise term εt is Gaussian, then (6.15) implies that if xt is given the state vector xt+τ is also normal with mean G(τ )xt and covariance matrix σ (τ ), i.e.

5 If

130

6 Principal Oscillation Patterns and Their Extension

left and right eigenvectors, i.e. B = LR, then the “continuous” POPs are given by the right eigenvectors R of B. Note that this is not exactly the SVD decomposition of B. Because RLT = Ip , we also have a similar decomposition of the Green’s function: G(τ ) = Leτ R. Note that, in practice, the (feedback) matrix calculated first before the matrix B. Note also example, in  forecasting and  provides a natural namely E xt+τ − xˆ t+τ /tr ( 0 ), where xˆ t+τ and von Storch et al. (1995).

(6.19)

G(τ ) of the continuous POP is that Eq. (6.18) can be useful, for measure of forecasting accuracy, = G(τ )xt , see e.g. Penland (1989)

6.3.2 Finite Time POPs Finite time POPs (FTPOPs) or empirical finite time normal modes were introduced by Frederiksen and Branstator (2005) as the analogue of finite time normal modes in which the linear operator or feedback matrix A is obtained by direct linearisation of the nonlinear equations (Frederiksen 1997; Frederiksen and Branstator 2001). In FTPOPs, the linear operator in Eq. (6.2) is time-dependent, i.e. d xt = B(t)xt + ε t . dt

(6.20)

The solution to Eq. (6.20) is obtained as an integral (Appendix G):  xt = S(t, t0 )xt0 +

t

S(t, u)ε u du,

(6.21)

t0

where S(., .) is the propagator. Note that when the operator B is time independent, then S(t, t0 ) = e(t−t0 )B .

(6.22)

An explicit expression of S(t, u) can be obtained when B(t) and B(u) commute, i.e. B(t)B(u) = B(u)B(t), for all t and u, then (Appendix G): S(t, u) = e

t u

B(τ )dτ

.

(6.23)

 p 1 1 P r (xt+τ = x|xt ) = (2π )− 2 |σ (τ )|− 2 exp − (x − G(τ )xt )T [σ (τ )]−1 (x − G(τ )xt ) , 2 which, under stationarity, tends to the multinormal distribution N (0,  0 ) when τ tends to infinity.

6.3 Relation to Continuous POPs

131

The propagator satisfies the property (Appendix G): S(t, u)S(u, v) = S(t, v).

(6.24)

The FTPOPs are defined as the eigenvectors of S(t, 0). Over an interval [0, T ], the propagator can be approximated via Eq. (6.24) using a second-order finite difference scheme of S(tk , tk−1 ), for tk = T − (n − k)δt, k = 1, . . . n, and δt is a halfhour time step. The eigenvalues λ = λr + iλi (and associated eigenvectors) of the propagator S(T , 0) are used to compute the global growth rate ωi and phase frequency ωr following: 1 ωi = 2T log |λ| ωr = − T1 arctan( λλri ).

(6.25)

In their analysis, Frederiksen and Branstator (2005) considered one year period (T = 1 yr) for the global characteristics and examined also local characteristics for each month using daily data. As for POPs, the eigenvalues determine the nature of the FTPOPs, namely travelling when λi = 0 or recurring when λi = 0. Figure 6.7 shows the leading 300-hPa streamfunction FTPOP during March using NCEP/NCAR reanalysis for the northern and southern hemispheres. As for the leading POP, the leading FTPOP has an approximate zonally symmetric state with a particular high correlation between EOF1, FTPOP1 and POP1 (Fig. 6.4). There is also a similarity between the growth rates of the leading POPs and leading FTPOPs (Fig. 6.6). The leading FTPOP for January (Fig. 6.8) shows many common features with EOF2 and POP2 (Fig. 6.5) especially over the PNA region and has high correlation with both EOF2 and POP2.

Fig. 6.7 Leading NCEP/NCAR 300-hPa streamfunction FTPOP for March for the northern (left) and southern (right) hemispheres. Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permission

132

6 Principal Oscillation Patterns and Their Extension

Fig. 6.8 January FTPOP1 obtained from the NCAR/NCEP 300-hPa streamfunction. Adapted from Frederiksen and Branstator (2005). ©American Meteorological Society. Used with permission

6.4 Cyclo-Stationary POPs Cyclo-stationary POPs have been implicitly mentioned in the previous section when finite time POPs were discussed. In this section more details on cyclo-stationarity are given. In the POP model (6.3) the time series was assumed to be second-order stationary.6 This can be a reasonable assumption when we analyse, for example, data on an intraseasonal time scale for a given, e.g. winter, season. When the data contain a (quasi-) periodic signal, such as the annual or biennial cycles, then the AR(1) model can no longer be supported. An appropriate extension of the POP model, which takes into account the periodic signal, is the cyclo-stationary POP (CPOP) model. CPOP analysis was published first7 by Blumenthal (1991), who applied it to analyse El-Niño Southern Oscillation (ENSO) from a climate model, and was applied later by various authors, such as von Storch et al. (1995), who applied it to the Madden Julian Oscillation (MJO), and Kim and Wu (1999), who compared it to other EOF techniques. We assume now that our data contain, say, T cycles, and in each cycle we have l sample making the total sample size n = T l. For example, with a 10-year data of monthly SST, we have T = 10 and l = 12. Given a time series x1 , . . . , xn ,

6 In

practice POP analysis has been applied to many time series not necessarily strictly stationary. von Storch et al. (1995) for other unpublished works on CPOPs.

7 see

6.4 Cyclo-Stationary POPs

133

with n = T l, any observation xs , s = 1, . . . , n, belongs to some t’th cycle with t = t  − δτ0 , where t  and τ  are obtained from the relation: s = (t  − 1)l + τ  ,

(6.26)

with 0 ≤ τ  ≤ l − 1 and 1 ≤ t  ≤ T + 1, and the nested time within the cycle j is τ = τ  + lδτ0 , where δi denotes the Kronecker symbol. The observation xs can be identified alternatively by t and τ , i.e. xs = xt,τ . Note that the noise term is not cyclo-stationary. The CPOP (or cyclo-stationary AR(1)) model then reads xt,τ +1 = A(τ )xt,τ + ε t,τ +1

(6.27)

with the property xt,τ +l = xt+1,τ and A(τ + l) = A(τ ). The iteration of Eq. (6.27) yields xt+1,τ = xt,τ +l = B(τ )xt,τ + εt+1,τ ,

(6.28)

where the system matrix B(τ ) is given by B(τ ) = A(τ + l − 1) . . . A(τ ) =

l !

A(τ + l − k).

(6.29)

k=1

The CPOPs are the eigenvectors of the system matrix B(τ ) for τ = 1, . . . l, i.e. B(τ )u = λu.

(6.30)

It can be shown that the eigenvalues of (6.24) do not depend on τ . Exercise Let λ(τ ) and u(τ ) denote an eigenvalue and associated eigenvector of B(τ ), respectively. Show that λ(τ ) is also an eigenvalue of B(τ + q) for any q. Hint Just take q = 1 and compute B(τ + 1) [A(τ )u(τ )], using Eq. (6.29), and the periodicity of A(τ ). The above exercise shows that if u(τ ) is an eigenvector of B(τ ) with eigenvalue λ, then A(τ )u(τ ) is an eigenvector of B(τ + 1) with the same eigenvalue. The periodicity of A(τ ) is inherited by the eigenvectors of B(τ ), see the next exercise. Exercise Assume u(τ ) to be an eigenvector of B(τ ) with eigenvalue λ, and let u(τ + k) = A(τ + k − 1)u(τ + k − 1), for k = 1, . . . , l. Show that u(τ + l) = λu(τ ). Hint Use the result u(τ + l) = B(τ )u(τ ) plus the fact that u(τ ) is an eigenvector of B(τ ). From the above exercise, it can be seen that using a proper normalisation we can make u(τ ) periodic. Precisely, let the unit-length vector u(τ ) satisfy (6.30) with λ = |λ|e−iφ . Then the vector A(τ )u(τ ) is an eigenvector of B(τ + 1), with the same eigenvalue λ. Let u(τ + 1) = cA(τ )u(τ ), where c is a complex coefficient. We can

134

6 Principal Oscillation Patterns and Their Extension

choose c = ρ(τ )eiθ , so that u(τ + 1) is unit-length and also periodic. The choice ρ(τ ) = A(τ )u(τ ) yields u(τ +1)) = 1, and to achieve periodicity, we compute u(τ + l) recursively yielding u(τ + l) =

l !

ρ(τ + l − k)eilθ B(τ )u(τ ) = |λ|

k=1

l !

ρ(τ + l − k)ei(lθ−φ) u(τ ).

k=1

(6.31) Now, since u(τ + l) = 1 by construction, the above equation yields ( |λ| lk=1 ρτ +l−k = 1, and then u(τ + l) = ei(lθ−φ) u(τ ). By choosing θ = φ/ l, the eigenvectors u(τ ) become periodic. To summarise, the CPOPs are obtained from the set of simultaneous eigenvalue problems B(τ )u(τ ) = λu(τ ), τ = 1, . . . l, and one needs to only solve the problem for τ = 1. Once the CPOPs are obtained, we proceed as for the POPs and compute the CPOP coefficients z(t, τ ) = v(τ )∗T x(t, τ ), by projecting the data x(t, τ ) onto the adjoint pattern v(τ ), i.e. the eigenvector of BT (τ ) with the eigenvalue λ, yielding z(t + 1, τ ) = λz(t, τ ) + εt+1,τ .

(6.32)

Remark The adjoint patterns v(τ ), τ = 1, . . . , l − 1, can be calculated recursively starting from those of B(l), in a similar manner to the eigenvectors u(τ ), but going backward, see the exercise below. Exercise Let v(τ + 1) be an adjoint vector of B(τ + 1) with eigenvalue λ and v(τ ) the adjoint vector of B(τ ) with the same eigenvalue. Show that v(τ ) = αAT (τ )v(τ + 1), where α is a complex number. Find α that makes the adjoint unit-length and periodic. Hint Keeping in mind that BT (τ + 1)v(τ + 1) = λv(τ + 1), along with (6.29) plus the periodicity of A(τ ), we get    BT (τ ) AT (τ )v(τ + 1) = AT (τ ) BT (τ + 1)v(τ + 1) = λAT (τ )v(τ + 1). Therefore, AT (τ )v(τ + 1) is an eigenvector of BT (τ ), and v(τ ) = αAT (τ )v(τ + 1) (where we have assumed that all the eigenvalues are distinct). For the normalisation, assuming v(τ + 1) = 1, we can obtain v(τ ) = r(τ + 1)−1 eiθ AT (τ )v(τ + 1), where r(τ + 1) = AT (τ )v(τ + 1) and θ is as before. Remark The estimation of A(τ ) is obtained, for each τ = 1, . . . , l, using Sτ,1 S−1 τ,0 , where Sτ,0 is the sample covariance matrix of xt,τ , and Sτ,1 is the sample lagged-1 cross-covariance matrix between xt,τ and xt,τ +1 , t = 1, . . . , T .

6.5 Other Extensions/Interpretations of POPs 6.5.1 POPs and Normal Modes Here we attempt to compare two different but related techniques, POPs and normal modes, to evaluate the effects of various feedback systems on the dynamics of

6.5 Other Extensions/Interpretations of POPs

135

waves and disturbances. POP is an empirically based technique that attempts to gain knowledge of spatial–temporal characteristics of a complex system. Normal modes approach, on the other hand, is a (physical) model-based technique that aims at analysing the (linearly) fastest growing modes. The model could be a full GCM or another simple model, such as the quasi-geostrophic vorticity equations. The normal modes are obtained as the eigenfunctions of the linearised system equations by performing a stability analysis. Note that in this case there is no need for data but only the physical model. The linearisation is generally performed using a basic state8 (Simmons et al. 1983). The most unstable modes are the normal modes (eigenfunctions) of the linear system: dx = F (x0 )x dt

(6.33)

associated with eigenvalues λ of F (x0 ) satisfying |λ| > 1. In (6.33) the vector x0 is the basic state and F (x0 ) is the differential of F () at x0 , i.e. F (x0 ) = ∂F ∂x (x0 ). Note that the eigenvalues and the corresponding normal modes are in general complex and therefore have standing and propagating patterns. It is commonly accepted that POPs also provide estimates of normal modes. Schnur et al. (1993), see also von Storch and Zwiers (1999), investigated normal modes from a quasi-geostrophic model and computed POPs using data simulated by the model. They also investigated and compared normal modes and POPs using the quasi-geostrophic vorticity equations on the sphere and concluded that POPs can be attributed to the linear growing phase of baroclinic waves. Unstable normal modes have eigenvalues with magnitude greater than unity. We have seen, however, ˆ that the eigenvalues of A =  1  −1 0 , or its estimate A, are inside the unit circle. Part of this inconsistency is due to the fact that the normal modes are (implicitly) derived from a continuous system, whereas (6.3) is a discrete AR(1). A useful way is perhaps to compare the modes of (6.19) with the continuous POPs.

6.5.2 Complex POPs In POP analysis the state vector of the system is real. POPs are not appropriate to model and identify standing oscillations (Bürger 1993). A standing oscillation corresponds to a real POP associated with a real eigenvalue. But this implies a (real) multivariate AR(1) model, i.e. a red spectrum or damped system. Since the eigenvalues are inside the unit circle, a linear first-order system is unable to model standing oscillations. Two alternatives can be considered to overcome this shortcoming, namely:

8 Often

taken to be the mean state or climatology, although conceptually it should be a stationary state of the system. This choice is adopted because of the difficulties in finding stationary states.

136

6 Principal Oscillation Patterns and Their Extension

(i) increase the order of the system, (ii) incorporate further information by including the state of the system and its tendency. The first solution is clearly not parsimonious since it requires more unknowns and can be expensive to run and difficult to interpret. The second alternative is more appropriate since the other required information is contained in the momentum or tendencies of the system in the continuous case. In the discrete case, the “momentum” or conjugate information is readily provided by the Hilbert transform. The complexified field zt = xt + iyt ,

(6.34)

where yt = H(xt ) is then used to compute the POPs, yielding Hilbert POPs (HPOPs). The Hilbert transform H(xt ) contains information about the system state tendency. Both xt and H(xt ) play, respectively, the roles of position and momentum in Hamiltonian systems. Hence the use of the complexified system states becomes equivalent to using a second-order system but without increasing the size of the unknown parameters. The HPOPs are therefore obtained as the eigenvectors of the new complex matrix: A =  1  −1 0 ,

(6.35)





and  1 = E zt+1 z∗T . The matrix A in (6.35) represents where  0 = E zt z∗T t t the equivalent of the feedback matrix of a multivariate complex AR(1) model zt+1 = Azt + t+1 . Unlike usual POPs, HPOPs do not come in conjugate pairs and are able to resolve a maximum number of independent oscillations equal to the space dimensions.

6.5.3 Hilbert Oscillation Patterns POP analysis is based on a simple model representing a linearised version of a complex nonlinear model expressed in discrete form given by (6.3), which can also be written as xt+1 − xt = Bxt + ε t+1 , in which one recognises the left hand side as a “time derivative”. Since Hilbert transform can also be interpreted as a (special) time derivative, an alternative way is to consider a similar model except that the left hand side is now a Hilbert transform of the original field (Irene Oliveira, personal communication). The model reads H (xt ) = Axt + ε t+1 ,

(6.36)

6.5 Other Extensions/Interpretations of POPs

137

where yt = H (xt ) is the Hilbert transform of xt . The matrix A is estimated using the sample covariance matrix S of the data and the sample cross-covariance matrix Sxy between xt and yt , t = 1, . . . , n as ˆ = Sxy S−1 . A

(6.37)

ˆ If Szz is Hilbert oscillation patterns (HOPs) are obtained as the eigenvectors of A. the (Hermitian) covariance matrix of the complexified field zt = xt + iyt , then one has   1 −1 ; (6.38) A = i Ip − Szz S 2 hence, the eigenvalues of A are pure imaginary since the eigenvalues of Szz S−1 are real (non-negative). It is also straightforward to check that if Szz is not invertible, then i is an eigenvalue of A with associated eigenvectors given by SN, where N is the null space of Szz . Now let u be an eigenvector of Szz S−1 with eigenvalue λ, and then Szz v = λSv, where v = S−1 u. Taking xt = v∗T xt , yt = v∗T yt and zt = v∗T zt , one gets, since var(zt ) = var(xt ) + var(yt ) = 2var(xt ): var(zt ) = v∗T Szz v = 2v∗T Sv;

(6.39)

hence, λ = 2, and the corresponding eigenvalue of A is β = 0. Hence the spectrum of A contains λ = 0. In addition, when Szz is not invertible, then the spectrum also contains λ = i. These HOPs can be used to define Hilbert canonical correlation analysis (HCCA). Classical CCA (see Chap. 15) looks for most correlated patterns between two fields xt and yt . In HCCA, the field yt is taken to be H(xt ). The equations involved in HCCA are similar to those of CCA and are given by S−1 Sxy S−1 Syx a = λa,

(6.40)

and a similar equation for the other patterns, with Sxy and Syx , exchanged. Defining u = Sa, and noting that Syx = −Sxy , the above equation becomes A2 u = −λu.

(6.41)

Hence the eigenvalues of HCCA are the (opposite of the) square of the eigenvalues associated with HOPs, and the associated eigenvectors are given by S−1 uk , where uk , k = 1, . . . , p, are the HOPs.

138

6 Principal Oscillation Patterns and Their Extension

6.5.4 Dynamic Mode Decomposition An interesting method of analysing the dynamics of (linear or nonlinear) dynamical systems was introduced by Schmid (2010), under the name “dynamical mode decomposition” (DMD), to deal originally with fluid flow. The method was extended later by a number of authors (e.g. Tu et al. 2014, Williams et al. 2015). The method seeks to decompose the data using a set of modes characterised by oscillation frequencies and growth rates. The DMD modes are the analogues of normal modes for linear systems. These modes are obtained through analysing the eigenfunctions and associated eigenvalues of the composition operator, also known as Koopman operator in dynamical system theory. Because the DMD modes have temporal characteristics (e.g. growth/decay rates) and are not in general orthogonal, they can outperform EOF analysis in terms of data representation or dimensionality reduction. Briefly, if x1 , . . . , xn represent the set of d-dimensional time series, assumed to be related via an operator A, i.e. X1,n−1 = AX2,n +E, where Xk,l is the d ×(l−k+1) matrix [xk , xk+1 , . . . , xl ], and E is an error term, then the DMD eigenvalues and modes are provided, respectively, by the eigenvalues and the associated eigenvectors of A. The DMD modes can be computed using the Krylov method, via, for example, the Arnoldi algorithm (Appendix D), or simply using the SVD algorithm. Tu et al. (2014) presents the theory behind DMD and extends it to a large class of datasets along with a number of improved algorithms. They also showed the utility of the method using numerical examples.

6.6 High-Order POPs The POP model based on the multivariate AR(1), Eq. (6.3), is based on the lag1 autocorrelation of the process. This model can be extended by including m consecutive lags: xt =

m 

Al xt−l + ε t .

(6.42)

l=1

Equation (6.42) is the vector autoregressive, V AR(m), model of order m. The matrices Ak , k = 1, . . . m, can be estimated using various approaches such as the (stepwise) least squares (Neumaier and Schneider 2001) or state space models (e.g. Lütkepoch 2006). As for the AR(1) POP model, the V AR(m) model can be decomposed using pm p-dimensional normal (or Eigen) modes of (6.42), e.g. Neumaier and Schneider (2001) and Schneider and Neumaier (2001). These multivariate POPs are characterised as damped oscillators having characteristic features and e-folding or damping times. This decomposition yields a system of mp coupled univariate

6.7 Principal Interaction Patterns

139

AR(1) models in which the coupling is through the noise covariance. The idea is to use the extended, or delay, state space as for extended EOFs, which is presented in Chap. 7. Denoting by x t the delayed state vector using m lags, i.e. x t = (xt , xt−1 , . . . , xt−m+1 )T , and t = (ε t , 0, . . . , 0)T , the model (6.42) can be written as a generalised AR(1) model: x t = Ax t−1 + t ,

(6.43)

where ⎛

A1 ⎜ I ⎜ ⎜ A=⎜O ⎜ . ⎝ ..

A2 O I .. .

. . . Ap−1 ... O ... O .. . O O O ... I

⎞ Ap O⎟ ⎟ O⎟ ⎟. ⎟ O⎠ O

(6.44)

The same decomposition can now be applied as for the VAR(1) case. Note, however, because of the Fröbenius structure of the mp × mp matrix A in (6.44) (e.g. Wilkinson 1988, chap. 1.3), the eigenvectors vk , k = 1, . . . mp have a particular structure: ⎛

⎞ λm−1 uk ⎜ ⎟ .. ⎜ ⎟ . vk = ⎜ ⎟, ⎝ λuk ⎠

(6.45)

uk where uk is a m-dimensional vector. It can be seen, by partitioning the vector vk into p m-dimensional vectors, that uk satisfies   Ap + λAp−1 + . . . + λp−1 A1 uk = λp uk .

(6.46)

6.7 Principal Interaction Patterns The principal interaction pattern (PIP) method was proposed originally by Hasselmann (1988). Slight modifications of PIP method was introduced later by, e.g. Kwasniok (1996, 1997, 2004) and Wu (1996). A description is given below, and for more technical details, the reader is referred to Kwasniok (1996) and later papers. PIP method takes into account the (nonlinear) dynamics of a nonlinear system. In its simplest form, the PIP method attempts to project the dynamics of a Ndimensional (autonomous) dynamical system living in a Hilbert space E with basis ei , i = 1, . . . N:

140

6 Principal Oscillation Patterns and Their Extension

du = F (u), dt

(6.47)

 where u = N i=1 ui ei , onto a lower L-dimensional Hilbert space P. This space is spanned by the PIP patterns p1 , . . . , pL , with pi =

N 

pki ek , i = 1, . . . , L,

(6.48)

k=1

where the N × L matrix P = (pij ) contains the PIP patterns. The state vector u is then projected onto the PIP patterns to yield z = P roj (u) =

L 

zi pi .

(6.49)

i=1

The PIPs are normally chosen to be orthonormal, with respect to a predefined scalar product, i.e. [p∗i , pj ] = δij , i, j = 1, . . . , L.

(6.50)

The dynamics within the reduced PIP space is then given by z˙ i = [p∗i , P roj (F (u))], i = 1, . . . , L.

(6.51)

The patterns are then sought by minimising a costfunction measuring the discrepancy between the (true) tendency of the PIP coefficients, z˙ it , and the projection of the tendency u˙ onto the PIP space, i.e. P I P s = argmin{F =
},

(6.52)

i=1

where is an ensemble average. In simple terms, given an initial condition u0 , system (6.47) is integrated for a specified time interval τ . The obtained trajectory is then projected onto the PIPs. Similarly, the dynamics of Eq. (6.47) is projected onto the PIPs, which is then integrated forward using the projection P roj (u0 ) = uP 0 of u0 onto the PIPs as initial condition. The norm of the difference between the obtained two trajectories is then computed. More specifically, let uP τ be the state of the trajectory of (6.47) at time τ , starting from u0 , and projected onto the PIP space. Let also zτ be the state of the trajectory of (6.51) starting from uP 0 P . The discrepancy between the two trajectories at time τ , uτP −zτ , is then computed, and a global measure ε2 of the discrepancy is obtained. Kwasniok (2004), for example, used ε2 = uτPmax − zτmax 2 , for some integration time τmax , whereas Crommelin τ and Majda (2004) used the total integral of this discrepancy, i.e. ε2 = 0 max uτPmax − zτmax 2 . The costfunction to be minimised with respect to the matrix P is then

6.7 Principal Interaction Patterns

F (P) =< ε2 (u0 , τmax , P) > .

141

(6.53)

The minimisation of (6.53) with conditions of orthonormality of PIPs, and a prespecified τmax , is performed numerically with a quasi-Newton algorithm. Kwasniok (2004) applied PIP analysis to a barotropic quasi-geostrophic model. Figure 6.9 shows an example of the leading two PIPs along with the leading two rotated EOFs of the streamfunction field. The rotated EOFs are quite similar to the leading PIPs. The reason for this similarity, as suggested by Kwasniok (2004), is the possible dominance of the linear terms by the forcing chosen in the model. In general, however, this is not the case as shown in the example discussed by Crommelin and Majda (2004). These authors considered the six-dimensional

Fig. 6.9 Leading two PIPs (a,b) and two rotated EOFs (c,d) of the streamfunction from a barotropic quasi-geostrophic model on the sphere. Adapted from Kwasniok (2004). ©American Meteorological Society. Used with permission

142

6 Principal Oscillation Patterns and Their Extension

truncation model of the beta-plane quasi-geostrophic model of Charney and Devore (1979), also discussed in De Swart (1988). Crommelin and Majda (2004) compared several optimal bases applied to this six-dimensional system. Figure 6.10 shows the phase portrait and the model trajectory based on the leading 4 EOFs compared to the reference model trajectory. A similar plot is also obtained with the leading four PIPs (Fig. 6.11). The figures show clearly that the PIP trajectory represents better the reference model trajectory compared to EOFs. The PIP model was also found to be superior to the optimal persistent patterns-based models.

Fig. 6.10 Trajectory of the integration of the Charney and Devore (1979) model using the leading four-EOF model projected onto the leading two EOFs (top left) and onto the leading EOF (middle), and the reference trajectory projected onto the leading two EOFs (top right) and onto the leading EOF (bottom). Adapted from Crommelin and Majda (2004). ©American Meteorological Society. Used with permission

6.7 Principal Interaction Patterns

143

Fig. 6.11 Same as Fig. 6.10 but using a four-PIP model. The trajectories from the original reference model are also shown for comparison. Adapted from Crommelin and Majda (2004). ©American Meteorological Society. Used with permission

Chapter 7

Extended EOFs and SSA

Abstract Hilbert EOFs presented in Chap. 5 are based on a spectral method to identify propagating or oscillating features. This chapter describes a time domain method, the extended EOFs, to identify propagating patterns from spatio-temporal data sets. The method is similar to the EOF method except that the spatial dimension is extended to include lagged information. Examples from the Madden–Julian oscillation are also provided. Keywords Time–space EOFs · Singular spectrum analysis · Multichannel SSA · Extended EOFs · Dynamical reconstruction · OLR · Madden–Julian oscillation · Recurrence networks · Harmonic decomposition

7.1 Introduction Atmospheric fields are very often significantly correlated in both the space and time dimensions. EOF technique, for example, finds patterns that are both spatially and temporally uncorrelated. Such techniques make use of the spatial correlation but do not take into account the significant auto- and cross-correlations in time. As a result, travelling waves, for example, cannot be identified easily using these techniques as was pointed out in the previous chapter. Complex (or Hilbert) EOFs (HEOFs) (Rasmusson et al. 1981; Barnett 1983; Horel 1984; von Storch and Zwiers 1999) presented in Chap. 5, have been introduced to detect propagating structures. In HEOFs, a phase information is introduced using the conjugate part of the field, which is provided by its Hilbert transform. Chapter 5 provides illustration of Hilbert EOFs with the QBO signal. So the information regarding the “propagation” is contained in the phase-shifted complex part of the system. However, despite this extra information provided by the Hilbert transform, HEOFs approach does not take into consideration the temporal auto- and cross-correlation in the field (Merrifield and Guza (1990)). POPs and HPOPs (Hasselmann 1988; von Storch et al. 1995; Bürger 1993) are methods that aim at empirically inferring space–time characteristics of a complex field. These methods are based on a first-order autoregressive model and can © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_7

145

146

7 Extended EOFs and SSA

therefore be used to identify travelling structures and in some cases forecast the future system sates. The eigenfunctions of the system feedback matrix in POPs and HPOPs, however, do not provide an orthogonal and complete basis. Besides being linear, another drawback of the AR(1) model, which involves only lag-1 autocorrelations, is that it may be sometimes inappropriate to model higher order systems. The extended EOFs introduced by Weare and Nasstrom (1982) combine both the aspects of spatial correlation of EOFs and the temporal auto- and cross-correlation obtained from the lagged covariance matrix. Subsequent works by Broomhead and colleagues (e.g. Broomhead and King (1986a,b)) and Fraedrich (1986) focussed on the dynamical aspects of extended EOFs as a way of dynamical reconstruction of the attractors of a dynamical system that is partially observed and termed it singular system analysis (SSA). Multichannel singular spectrum analysis (MSSA) was used later by Vautard et al. (1992) and Plaut and Vautard (1994) and applied it to atmospheric fields. Hannachi et al. (2011) applied extended EOFs to stratospheric warming, whereas an application to Rossby wave breaking and Greenland blocking can be found in Hannachi et al. (2012), see also the review of Hannachi et al. (2007). An appropriate example where propagating signals are prominent and where extended EOFs can be applied to reveal these signals is the Madden–Julian oscillation (MJO). MJO is an eastward propagating planetary-scale wave of tropical convective anomalies and is a dominant mode of intraseasonal tropical variability. The oscillation has a broadband with a period between 40 and 60 days and has been identified in zonal and divergent wind, sea level pressure, outgoing long wave radiation and OLR (Madden and Julian 1994). Figure 7.1 shows the OLR field over the tropical band on 25 December 1996. It can be noted, for example, from this figure the low-value region particularly over the warm pool, which is an area of large convective activity, in addition to other regions such as Amazonia and tropical Africa. The OLR data come from NCEP/NCAR reanalyses over the tropical region

Fig. 7.1 Outgoing long wave radiation distribution over the tropics in 25 December 1996. Units w/m2 . Adapted from Hannachi et al. (2007)

7.2 Dynamical Reconstruction and SSA

147

Fig. 7.2 Leading EOF of OLR anomalies. Units arbitrary. Adapted from Hannachi et al. (2007)

equatorward from 30◦ latitude. Seasonality of OLR is quite complex and depends on the latitude band (Hannachi et al. 2007). Figure 7.2 shows the leading EOF pattern explaining about 15% of the total variability. It has opposite signs north and south of the equator and represents the seasonal cycle, mostly explained by the inter-tropical convergence zone (ITCZ).

7.2 Dynamical Reconstruction and SSA 7.2.1 Background The issue of dynamical reconstruction of attractors is rooted in the theory of dynamical systems and aims at reconstructing the multidimensional system’s trajectory, or more widely the attractor of a chaotic system. In a nutshell a chaotic system is a system that can be described by a set1 of ordinary differential equations d x = F (x), dt which cannot be analytically integrated. A chaotic trajectory is the trajectory of the chaotic system within its phase space. A characteristic feature of a chaotic system is its sensitivity to initial conditions, i.e. trajectories corresponding to two very close initial conditions diverge exponentially in a finite time. A chaotic system gives rise to a chaotic attractor, a set with extremely complex topology. Figure 7.3 shows an example of the popular Lorenz (1963) system (Fig. 7.3a) and a contaminated (or

1 Of

at least 3 variables.

148

a

7 Extended EOFs and SSA

b

6

4

4

2

X

X'

2 0

0 –2

–2 –4

6

–4 0

100

200

300

400

–6

0

100

200

300

400

Time

Time

Fig. 7.3 Time series of the Lorenz (1963) model (a) and its contamination obtained by adding a red noise. Adapted from Hannachi and Allen (2001)

noisy) time series (Fig. 7.3b). SSA can be used to obtain the hidden signal from the noisy data. The general problem of dynamical reconstruction consists in inferring dynamical and geometric characteristics of a chaotic attractor from a univariate time series, x1 , x2 , . . . , xn , sampled from the system. The solution to this problem is based on transforming the one-dimensional sequence into a multivariate time series using the so-called method of delays, or delay coordinates first proposed by Packard et al. (1980), and is obtained using a sliding window through the time series to yield T

xt = xt , xt+τ , . . . , xt+(M−1)τ ,

(7.1)

where τ is the time delay or lag and M is known as the embedding dimension. A basic result in the theory of dynamical systems indicates that the characteristics of the dynamical attractor can be faithfully recovered using the delay coordinate method provided that the embedding dimension is large enough (Takens 1981).

7.2.2 Dynamical Reconstruction and SSA In the sequel we suppose for simplicity that τ = 1. The analysis of the multivariate time series xt , t = 1, 2, . . . n − M + 1, for dynamical reconstruction is once more faced with the problem of dimensionality. Attractors of low-order chaotic systems are in general low-dimensional and can in principle be explored within a smaller dimension than that of the embedding space. A straightforward approach is to first reduce the space dimension using, for example, EOFs of the data matrix obtained from the multivariate time series given in Eq. (7.1), i.e. time EOFs. This approach has been in fact adopted by Broomhead and King (1986a,b). They used SVD to calculate an optimal basis for the trajectory of the reconstructed attractor. If the dynamic is in fact chaotic, then the spectrum will be discontinuous or singular, with the first few large singular values well separated from the floor (or

7.2 Dynamical Reconstruction and SSA

149

background) spectrum associated with the noise level, hence the label singular system or spectrum analysis (SSA). At the same time, Fraedrich (1986) applied SSA, using a few climate records, in an attempt to analyse their chaotic nature and estimate their fractal dimensions. Vautard et al. (1992) analysed the spectral properties of SSA and applied it the Central England temperature (CET) time series in an attempt to find an oscillation buried in a noise background. They claim that CET contains a 10-year cycle. Allen and Smith (1997) showed, however, that the CET time series consists of a long-term trend in a coloured background noise. SSA is in fact a useful tool to find a periodic signal contaminated with a background white noise. For example, if the time series consists of a sine wave plus a white noise, then asymptotically, the spectrum will have a leading pair of equal eigenvalues and a flat spectrum. The time EOFs corresponding to the leading eigenvalues will consist of two sine waves (in delay space) in quadrature. The method, however, can fail when, for example, the noise is coloured or the system is nonlinear. A probability-based approach is proposed in Hannachi (2000), see also Hannachi and Allen (2001). The anomaly data matrix, referred to as trajectory matrix by Broomhead and King (1986a,b), obtained from the delayed time series of Eq. (7.1), that we suppose already centred, is given by ⎛ ⎜ ⎜ X=⎜ ⎝

x1 x2 .. .

x2 x3 .. .

⎞ . . . xM . . . xM+1 ⎟ ⎟ .. ⎟ . . ⎠

(7.2)

xn−M+1 xn−M+2 . . . xn The trajectory matrix expressed by Eq. (7.2) has a special form, namely with constant second diagonals. This property gets transmitted to the covariance matrix: C=

n−M+1  1 1 XT X = xt xTt , n−M +1 n−M +1

(7.3)

t=1

which is a Toeplitz matrix, i.e. with constant diagonals corresponding to the same lags. This Toeplitz structure is known to have useful properties, see e.g. Graybill (1969). If σ 2 is the variance of the time series, then C becomes ⎛

1

⎜ ⎜ ρ1 1 C==⎜ ⎜ . 2 σ ⎝ .. ρM−1

⎞ ρ1 . . . ρM−1 . ⎟ . 1 . . .. ⎟ ⎟, ⎟ .. .. . . ρ1 ⎠ . . . ρ1 1

(7.4)

where  is the lagged autocorrelation matrix. If U = (u1 , . . . , uM ) is the set of (time) EOFs, or similarly the right singular vectors of X, the PCs are given by C = XU. The i’th PC ci = (ci (1), ci (2), . . . , ci (n − M + 1)) is then given by

150

7 Extended EOFs and SSA

ci (t) = xTM+t−1 ui =

M 

(7.5)

uil xM+t−l

l=1

for t = 1, . . . , n − M + 1. Hence the PCs are obtained as weighted moving averages of the time series. Note, however, that this is not a conventional moving average since the coefficients are function of the time series itself, and therefore, the PCs and also the EOFs are somehow “special” moving averages of the time series. One is tempted to say that the attribute “special” here points to nonlinearity, but of course once the weights are computed then the filtering takes the form of an ordinary moving average. The method, however, looks rather like an adaptive filter. It is pointed out in von Storch and Zwiers (1999) that the filter (7.5) causes a frequencydependent phase shift because the weights are not in general symmetric. However, the symmetric Toeplitz matrix C has an interesting property, namely the symmetry of its eigenvectors (see the exercise below). Figure 7.4 shows the spectra and the leading pair of EOFs of the covariance matrix of the extended data with a window lag M = 30, similar to Eq. (7.3), except that it is weighted by the inverse of a similar covariance matrix of the noise (Hannachi and Allen 2001). The leading two eigenvalues are equal and well separated from the rest of the spectrum. The associated (extended) EOFs show two sine waves in quadrature reflecting the (quasi-) periodic nature of the hidden signal. Exercise Show that the eigenvectors of the covariance matrix (7.3) are symmetric, i.e. the elements of each eigenvector are symmetric. Hint Let a = (a0 , a1 , . . . , aM−1 )T be an eigenvector of the lagged autocorrelation matrix  with eigenvalue λ, i.e. a = λa. Now let us define the vector b = (aM−1 , aM−2 , . . . , a0 )T , i.e. the “reverse” of a, and let us compute b. Clearly the first element [b]1 of b is precisely [a]M , i.e.

a

b

10.0

0.6

Eigenvector

Eigenvalue

0.4 1.0

0.2 0.0 –0.2

0.1

0

10 20 Eigenvalue rank

30

–0.4

0

10

20

30

Lag

Fig. 7.4 Spectrum of the grand covariance matrix, Eq. (7.3), of the noisy time series of Fig. 7.3b weighted by the inverse of the same covariance matrix of the noise (a) and the leading two extended EOFs. Adapted from Hannachi and Allen (2001)

7.3 Examples

[b]1 = b0 +

151 M−1 

ρk bk = aM−1 +

M−1 

k=1

ρk aM−k−1 = λaM−1 = λb0

k=1

and similarly for the remaining elements. Hence if the eigenvalues are distinct, then clearly a = b, and therefore, a is symmetric. This conclusion indicates therefore that the SSA filter is symmetric and does not cause a frequency-dependent phase shift. This phase shift remains, however, possible particularly for singular covariance matrices. Remarks • SSA enjoys various nice properties, such as being an adaptive filter. • The major drawback in SSA, however, is related to the problem of choosing the embedding dimension M. In general the bigger the value of M the more accurate the reconstruction. However, in the case of extraction of periodic signals, M should not be much larger than the period, see e.g. Vautard et al. (1992) for discussion. As we have mentioned above, SSA, like EOFs, can be used to reconstruct the original time series using a time reconstruction, instead of space reconstruction as in EOFs. Since xt , t = 1, . . . n − M + 1, can be decomposed as xt =

M 

ck (t)uk ,

(7.6)

k=1

one gets xt+l−1 =

M 

ck (t)uk,l

(7.7)

k=1

for l = 1, . . . M. It is interesting to note that for a given xt , t = 1, . . . n − M + 1, Eq. (7.7) yields different reconstructions as it will be discussed in section 7.5.3. Each decomposition, however, has its own variance distribution among the different SSA vectors.

7.3 Examples 7.3.1 White Noise For a white noise time series with variance σ 2 , the covariance matrix of the delayed vectors is simply σ 2 IM , and the eigenvectors are simply the degenerate unitary vectors.

152

7 Extended EOFs and SSA

7.3.2 Red Noise The red noise is defined by xt+1 = ρxt + εt+1 ,

(7.8)

for t = 1, 2, . . ., with independent and identically distributed (IID) noise with zero mean and variance σ 2 . The autocorrelation function is ρx (τ ) = ρ |τ | ,

(7.9)

and the Toeplitz covariance matrix C of xt = (xt , xt+1 , . . . , xt+M−1 )T takes the form ⎛ ⎜ ⎜ 1 C=⎜ ⎜ 2 σ ⎝

1 ρ .. . ρ M−1

⎞ ρ . . . ρ M−1 . ⎟ . 1 . . .. ⎟ ⎟. ⎟ .. .. . . ρ ⎠ . . . ρ1 1

(7.10)

To compute the eigenvalues of (7.10), one could, for example, start first by inverting C and then compute its eigenvalues. The inverse of C has been computed, see e.g. Whittle (1951) and Wise (1955). For example, Wise (1955) provided a general way to compute the autocovariance matrix of a general ARMA(p, q) time series. Following Wise (1955), Eq. (7.8) can be written in a matrix form as (I − ρJ) xt = ε t ,

(7.11)

where xt = (xt , xt+1 , . . .)T , and ε t = (εt , εt+1 , . . .)T are semi-infinite vectors, I is the semi-infinite identity matrix, and J is the semi-infinite auxiliary matrix whose finite counterpart is ⎛

0 1 ... ⎜ ⎜ 0 0 ... Jn = ⎜ ⎜. . . ⎝ .. . . . . 0 ... 0

⎞ 0 .. ⎟ .⎟ ⎟, ⎟ 1⎠ 0

that is with ones on the superdiagonal and zeros elsewhere. The matrix Jn is nilpotent with Jnn = O. Using Eq. (7.11), we get   −1  E xt xTt =  = σ 2 (I − ρJ)−1 I − ρJT ,

7.3 Examples

153

that is   σ 2  −1 = I − ρJT (I − ρJ) .

(7.12)

By taking the finite counterpart of Eq. (7.12) corresponding to a finite sample x1 , x2 , . . . , xn , we get the following tridiagonal inverse matrix of C: ⎛

σ 2 C−1

⎞ 1 −ρ 0 ... 0 ⎜ −ρ 1 + ρ 2 −ρ ... 0 ⎟ ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟ . . 0 ⎟ ⎜ 0 −ρ 1 + ρ 2 =⎜ . ⎟. .. .. .. ⎜ . ⎟ . . . ⎜ . −ρ ⎟ ⎜ ⎟ 2 ⎝ 0 ... −ρ 1 + ρ −ρ ⎠ 0 ... 0 −ρ 1 + ρ 2

(7.13)

Another direct way to compute C−1 is given, for example, in Graybill (1969, theorem 8.3.7, p. 180). Now the eigenvalues of C−1 are given by solving the polynomial equation: Dn = |C−1 − λIn | = 0. Again, a simple way to solve this determinantal equation is to decompose Dn as Dn = (1 − λ)n−1 + ρδn−1 , where

n−1

) ) ) 1 + ρ2 − λ ) −ρ ... 0 ) ) 2 − λ ... ) ) 0 −ρ 1 + ρ ) ) =) ) . . . .. .. .. ) ) −ρ ) ) ) 2 0 ... −ρ 1 + ρ − ρ )

is the determinant of C−1 − λI after removing the first line and first column, and

δn−1

) ) ) −ρ ) 0 0 ... 0 ) ) ) −ρ 1 + ρ 2 − λ −ρ ) ... 0 ) ) ) .. ) .. .. .. =) . ) . . . −ρ ) ) ) 0 ) 2 −ρ ... −ρ 1 + ρ − λ ) ) ) 0 2 ... 0 −ρ 1 + ρ − λ)

is the determinant of C−1 − λI after removing the second line and first column. We now have the recurrence relationships: δn = −ρn−1

, n = 1 + ρ 2 − λ n−1 + ρδn−1

(7.14)

154

7 Extended EOFs and SSA

so that n = 1 + ρ 2 − λ n−1 − ρ 2 n−1 , which yields, in general, a solution of the form: + bμn−2 n = aμn−2 1 2 , where μ1,2 are the roots (supposed to be distinct) of the quadratic equation x 2 − (1 + ρ 2 − λ)x + ρ 2 = 0. Note that the constants a and b are obtained from the initial conditions such as 2 and 3 .

7.4 SSA and Periodic Signals The use of SSA to find periodic signals from time series has been known since the early 1950s with Whittle (1951). The issue was reviewed by Wise (1955), who showed that for a stationary periodic (circular) time series of period n, the autocovariance matrix of the time series contains at least one eigenvalue with multiplicity greater or equal than two. Wise2 (1955) considered a stationary zeromean random vector x = (xn , xn−1 , . . . , x1 ) having a circular covariance matrix, i.e. γn+l = γn−l = γl , where γn = E (xt xt+n ) = σ 2 ρn , and σ 2 is the common variance of the xk ’s. The corresponding circular autocorrelation matrix is given by ⎛

1 ⎜ ρ1 ⎜ =⎜ ⎝ ρ2 ρ1

ρ1 ρ2 . . . 1 ρ1 . . . .. .. .. . . . ρ2 . . . ρ1

⎞ ρ1 ρ2 ⎟ ⎟ ⎟. ρ1 ⎠

(7.15)

1

Exercise 1. Show that the matrix ⎛

⎞ 0 0⎟ ⎟ ⎟ ⎟ ⎟ 0 0 ... 1⎠ 1 0 0 ... 0

0 ⎜0 ⎜ ⎜ W = ⎜ ... ⎜ ⎝0

1 0 .. .

0 1 .. .

... ... .. .

is unitary, i.e. WT W = WWT = I. 2 Wise

also extended the analysis to calculate the eigenvalues of a circular ARMA process.

7.4 SSA and Periodic Signals

155

   

T 2. Show that  = I+ρ1 W + WT +ρ2 W2 + W2 +. . .+qρp Wp + Wp T where p = [n/2], i.e. the integer part of n/2 and q equals 1 if n is odd, and 12 otherwise. 3. Compute the eigenvalues of W and show that it is diagonalisable, i.e. W = A A−1 . Hint The eigenvalues of W are the unit roots ωk , k = 1, . . . n of ωn − 1 = 0. Hence W = A A−1 , and Wα + W−α = A α + −α A−1 . This means in particular that Wα + W−α is diagonalisable with ωk + ωk−1 as eigenvalues. Exercise (Wise 1955) Compute the eigenvalues (or latent roots) λk , k = 1, 2 . . . n of  and show, in particular, that they can be expressed as an expansion into Fourier sine–cosine functions as λk = 1 + 2ρ1 cos

2π k 4π k 2πpk + 2ρ2 cos + . . . + 2qρp cos n n n

(7.16)

and that λk = λn−k . Remark The eigenvalues of  can also be obtained directly by calculating the determinant  = | − λI| by writing (see e.g. Mirsky 1955, p. 36) ⎡  = nk=1 ⎣(1 − λ) +

n−1 

⎤ ρj ωk ⎦ . j

j =1

The eigenvalues of  are therefore given by λk = f (θk ), θk = 2πn k , and f (ω) is the spectral density function of the circular process. As a consequence, we see that λn−k = λk . After the eigenvalues have been found, it can now be checked that the vector   2π k 2π k 2π nk 2π nk 1 cos uk = √ + sin , . . . , cos + sin n n n n n

(7.17)

is the unitary eigenvector associated with λk . Furthermore, any two eigenvectors of a degenerate eigenvalue are in quadrature. Note that when n is even (7.16) becomes λk = 1 + 2

n2/−1  j =1

ρj cos

2π kj + ρ n2 cos π k. n

(7.18)

The same procedure can be applied when the time series contains, for example, a periodic signal but is not periodic. This was investigated by Basilevsky (1983) and Basilevsky and Hum (1979) using the Karhunen-Loéve decomposition, which consists precisely of an EOF analysis of the lagged covariance (Toeplitz) matrix, i.e.

156

7 Extended EOFs and SSA

SSA. Basilevsky (1983) applied the procedure to UK unemployment time series. Note that, as in the circular case, the eigenvalues of the lagged covariance matrix provide a discrete approximation of the power spectrum of the time series. The eigenvectors, on the other hand, are time lagged series. Remark (SSA and Wavelets) Wavelet analysis has been used to study time series with intermittent phenomena including chaotic signals. A characteristic feature of wavelet transform is that it can focus on specific parts of the signal, to examine local structure by some sort of algebraic zooming (see Sect. 5.4.1, footnote 7.) For a continuous signal, the wavelet transform is a special integral transform whose kernel is obtained by a translation and dilation of a localised function, which is the mother wavelet (Meyer 1992; Daubechies 1992). Wavelet transform is comparable to SSA, with main differences related to locality. For example, SSA modes, such as EOFs and Fourier transform, contain information from the whole time series, whereas wavelet transform is localised and can therefore analyse complex signals, such as self-similar signals and singularities. A detailed account of the difference between the two methods, with more references, can be found in Ghil et al. (2002), who also discuss local SSA as a substitute to conventional SSA. We now consider the (lagged) autocovariance matrix of a stationary time series xt , t = 1, 2 . . . with autocovariance function γ () and spectral density f (), ⎛ ⎜ ⎜ =⎜ ⎝

γ0 γ1 .. . γn−1

⎞ γ1 . . . γn−1 γ0 . . . γn−2 ⎟ ⎟ .. ⎟ , .. .. . . . ⎠ γn−2 . . . γ0

(7.19)

and we suppose, without loss of generality, n to be odd. We also let u0 = T  1  1 2− 2 , 1, 0, 1, . . . , 0 , and for k = 1, . . . n − 1, uk = 2− 2 , cos 2πn k , sin 2πn k , T cos 4πn k , sin 4πn k , . . . , sin 2π(n−1)k . We now form the following two arrays: 2n U=

*

2/n (u0 , u1 , . . . , un−1 )T

(7.20)

and  = Diag (λ1 , λ2 , . . . , λn ) , where λ1 = 1, 2, . . . n−1 2 .

1 2π

 j

γj , and λ2k = λ2k+1 = 1

1 2π

(7.21)  j

γj e−2iπj k/n for k =

Remark When n is even, an additional row, n− 2 (1, −1, 1, . . . , 1, −1), is added in 1 the last row of U, and the last entry of  is λn = 2π j γj cos πj .

7.5 Extended EOFs or Multivariate SSA

157

Now for  defined in (7.15) the matrix UT U is diagonal and converges to 2π  as n increases. The nice thing is that the same result also extends to the lagged autocovariance matrix  in (7.19). Precisely, we have the following theorem, see e.g. Fuller (1976, p. 138). Theorem Let  be defined by (7.19), with absolutely summable autocovariance function γ (), and U and  defined by (7.20) and (7.21), respectively, and then UT U − 2π  converges to O as n goes to infinity. Hence for large sample sizes n, the eigenvalues λj of the lagged autocovariance matrix  of x1 , . . . , xn are approximately equal to 2πf (θj ), where θj = 2πj/n, j = 0, 1, . . . , n − 1.

7.5 Extended EOFs or Multivariate SSA 7.5.1 Background The extension of SSA to two or more variables was performed by Broomhead and King (1986a,b), who applied multichannel SSA (MSSA) to the Lorenz system for dynamical reconstruction of the attractor. It was also applied later by Kimoto et al. (1991) to look for propagating structures from 500-mb geopotential height reanalyses. MSSA makes use of space and time information simultaneously to find coherent structures, referred to sometimes as space–time EOFs. Extended EOF analysis is another name for MSSA and was introduced in the meteorological literature some years earlier by Weare and Nasstrom (1982). They applied it to the 300-mb relative vorticity and identified westward propagating Rossby waves with phase speeds of the order 0.4 m/s. They also applied it to tropical Pacific SST and identified strong persistence in El-Niño. Below we discuss the computation and use of extended EOFs, see also Hannachi et al. (2007).

7.5.2 Definition and Computation of EEOFs

In EEOF analysis the atmospheric state vector at time t, i.e. xt = xt1 , . . . xtp , t = 1, . . . , n, used in traditional EOF, is extended to include temporal information as

x t = xt1 , . . . xt+M−1,1 , xt2 , . . . xt+M−1,2 , . . . xt,p , . . . xt+M−1,p

(7.22)

158

7 Extended EOFs and SSA

with t = 1, . . . , n − M + 1. The new data matrix takes now the form: ⎡ ⎢ ⎢ X =⎢ ⎣



x1 x2 .. .

⎥ ⎥ ⎥. ⎦

(7.23)

x n−M+1 It is now clear from (7.22) that time is incorporated in the state vector side by side with the spatial dimension. If we denote by

x st = xts , xt+1,s . . . xt+M−1,s ,

(7.24)

then the extended state vector (7.22) is written in a similar form to the conventional state vector, i.e.   p (7.25) x t = x 1t , x 2t , . . . , x t except that now the elements x kt , k = 1, . . . p, of this grand state vector, Eq. (7.24), are themselves temporal-lagged values. The data matrix X in Eq. (7.23) now takes the form ⎡ ⎢ X =⎣

x 11 .. .

x 21 .. .

x 1n−M+1

x 2n−M+1

p

...

x1 .. .

...

p x n−M+1

⎤ ⎥ ⎦,

(7.26)

which is again similar to traditional data matrix X, see Chap. 2, except that now its elements are (temporal) vectors. The vector x st in (7.24) is normally referred to as the delayed vector obtained from the time series (xts ), t = 1, . . . n of the field value at grid point s. The new data matrix (7.26) is now of order (n − M + 1) × pM, which is significantly larger than the original matrix dimension n × p. We suppose that X in (7.26) has been centred and weighted, etc. The covariance matrix of time series (7.25) obtained using the grand data matrix (7.26) is ⎡

⎤ C11 C12 . . . C1M ⎢ C21 C22 . . . C2M ⎥ 1 ⎢ ⎥ = XT X = ⎢ . .. .. ⎥ , ⎣ .. n−M +1 . . ⎦ CM1 CM2 . . . CMM

(7.27)

7.5 Extended EOFs or Multivariate SSA

159

where each Cij , 1 ≤ i, j ≤ M, is the lagged covariance matrix between the i’th and j’th gridpoints and is given by3 Cij =

n−M+1  1 T j x it x t . n−M +1

(7.28)

t=1

If the elements of the data matrix (7.26) were random variables, then each submatrix 

T Cij = E x i x j , from the covariance matrix  = E X T X , will exactly take a symmetric Toeplitz form, i.e. with constant diagonals, and consequently,  will be block Toeplitz. Due to finite sampling; however, Cij is approximately Toeplitz for large values of n, compared to the window length M. This is in general the case when we deal with high frequency data, e.g. daily observations or even monthly averages from long climate model simulations. The symmetric covariance matrix  is therefore approximately block Toeplitz for large values of n. An alternative form of the data matrix is provided by re-writing the state vector (7.22) in the form

x t = xt1 , . . . xt,p , xt+1,1 , . . . xt+1,p , . . . xt+M−1,1 , . . . xt+M−1,p ,

(7.29)

that is x t = (xt , xt+1 , . . . , xt+M−1 )

(7.30)

where xt is the state vector at time t, t = 1, . . . n − M + 1, i.e.

xt = xt1 , . . . , xt,p . Hence the matrix (7.26) now takes the alternative form4 ⎡ ⎢ X1 = ⎣

x1 .. .

x2 .. .

⎤ . . . xM .. ⎥ . . ⎦

(7.31)

xn−M+1 xn−M+2 . . . xn This form is exactly equivalent to (7.26) since it is obtained from (7.26) by a permutation of the columns as X 1 = X P,

3 Other alternatives to compute C ij

(7.32)

also exist, and they are related to the way the lagged covariance between two time series is computed, see e.g. Priestly (1981) and Jenkins and Watts (1968). 4 used by Weare and Nasstrom (1982).

160

7 Extended EOFs and SSA

where P = (pij ), i, j = 1, . . . Mp, is a permutation matrix5 given by pij = δi,α ,

(7.33)

where α is a function of j given by α = rM + pj + 1 where j − 1 ≡ r(p), and [x] is the integer part of x. In order to compute the extended EOFs, the OLR anomalies, discussed in Sect. 7.1, with respect to the long-term climatology for the period 1 Jan 1996–31 Dec 2000, are analysed. The dimension of the data is reduced by keeping the leading 10 EOFs/PCs. Figure 7.5 shows the obtained spectrum of the grand covariance matrix of Eq. (7.27) using the leading 10 OLR PCs, with a window lag M = 80 days. The anomalies are computed with respect to the daily climatology over the period 1 Jan 1996–31 Dec 2000. The leading two eigenvalues correspond to the annual cycle. They do not look nearly equal and separated from the rest of the spectrum. This is due to the relatively small sample size and the choice of the window lag, which is much smaller than the length of the seasonal cycle. EEOFs are the EOFs of the extended data matrix (7.23), i.e. the eigenvectors of the grand covariance matrix  given in (7.27). They can be obtained directly by computing the eigenvalues/eigenvectors of (7.26). Alternatively, we can use again the SVD of the grand data matrix X in (7.26). This yields X = VUT ,

(7.34)

where U = (uij ) = (u1 , u2 , . . . , ud ) represents the matrix of the d extended EOFs or left singular vectors of X . Here, d = Mp represents now the number

Fig. 7.5 Spectrum of the grand covariance matrix, Eq. (7.3), of the leading 10 OLR PCs using a window lag M = 80 days. The vertical bars show the approximate 95% confidence interval using an hand-waving effective sample size of 116. Adapted from Hannachi et al. (2007)

5 That

is a matrix containing exactly 1 in every line and every column and zeros elsewhere. A permutation matrix P is orthogonal, i.e. PPT = PT P = I.

7.5 Extended EOFs or Multivariate SSA

161

of the newly obtained variables, i.e. the number of columns of the grand data matrix. The diagonal matrix  contains the singular values θ1 , . . . θd of X , and V = (v 1 , v 2 , . . . , v d ) is the matrix of the right singular vectors or extended PCs where the k’th extended PC is v k = (vk (1), . . . , vk (n − M + 1))T . The computation of Hilbert EOFs is again similar to conventional EOFs. Given the gridded data matrix X(n, p12) and the window length m, the cornerstone of EEOFs is to compute the extended data matrix EX as shown in the simple Matlab code: >> EX = [ ]; >> for t=1:(n-m+1) >> test0 = [ ]; test1 = [ ]; >> for s = 1:p12 >> test1 = X (t:(t+m-1), s)’; >> test0 = [test0 test1]; >> end >> EX = [EX; test0]; >> end The extended EOFs and extended PCs along with associated explained variance are then computed as for EOFs using the new data matrix EX from the above piece of Matlab code. These extended EOFs and PCs can be used to filter the data by removing the contribution from nonsignificant components and also for reconstruction purposes as detailed below and illustrated with the OLR example.

7.5.3 Data Filtering and Oscillation Reconstruction The extended EOFs U can be used as a filter exactly like EOFs. For instance, the SVD decomposition (7.34) yields the expansion of each row x t of X in (7.26) x Tt =

d 

(7.35)

θk vk (t)uk ,

k=1

for t = 1, . . . n − M + 1, or in terms of the original variables xt as xTt+j −1 =

d 

j

θk vk (t)uk

(7.36)

k=1

for t = 1, . . . n − M + 1, and j = 1, . . . M, and where T

j uk = uj,k , uj +M,k , . . . , uj +(p−1)M,k .

(7.37)

162

7 Extended EOFs and SSA j

Note that the expression of the vector uk in the expansion (7.36) depends on the form of the data matrix. The one given above corresponds to (7.26), whereas for the data matrix X 1 in Eq. (7.31) we get T

j uk = u(j −1)p+1,k , u(j −1)p+2,k , . . . , ujp,k .

(7.38)

Note also that when we filter out higher EEOFs, expression (7.36) is to be truncated to the required order d1 < d. Figure 7.6, for example, shows PC1 and its reconstruction based on Eq. (7.40) using the leading 5 extended PCs. Figure 7.5 shows another pair of approximately equal eigenvalues, namely, eigenvalues 4 and 5. Figure 7.7 shows a time plot of extended PC4/PC5, and their phase plot. This figure reflects the semi-annual oscillation signature in OLR. Figure 7.8 shows the extended EOF8 pattern along 10◦ N as a function of the time lag. Extended EOF8 shows an eastward propagating

Fig. 7.6 Time series of the raw and reconstructed OLR PC1. Adapted from Hannachi et al. (2007)

Fig. 7.7 Time series of OLR extended PC4 and PC5 and their phase plot. Adapted from Hannachi et al. (2007)

7.5 Extended EOFs or Multivariate SSA

163

Fig. 7.8 Extended EOF 8 of the OLD anomalies along 10o N as a function of time lag. Units arbitrary. Adapted from Hannachi et al. (2007)

wave with an approximate phase speed of about 9 m/s, comparable to that of Kelvin waves. The expansion (7.36) is exact by construction. However, when we truncate it by keeping a smaller number of EEOFs for filtering purposes, e.g. when we reconstruct the field components from a single EEOF or a pair of EEOFs corresponding, for example, to an oscillation, then the previous expansion does not give a complete picture. This is because when (7.36) is truncated to a smaller subset K of EEOFs yielding, e.g. yTt+j −1 =



j

θk vk (t)uk ,

(7.39)

k in K

where yt = yt,1 , . . . , yt,p is the filtered or reconstructed state space vector, then one obtains a multivalue function. For example, for t = 1 and j = 2, we get one value of yt,1 , and for t = 2 and j = 1, we get another value of yt,1 . Note that this is due to the fact that the EEOFs have time lagged components. To get a unique reconstructed value, we simply take the average of those multiple

164

7 Extended EOFs and SSA

values. The number of multiple values depends6 on the value of time t = 1, . . . n. The reconstructed variables using a subset K of EEOFs are then easily obtained from (7.39) by ⎧ 1 t  j ⎪ ⎪ t j =1 K θk vk (t − j + 1)uk for 1 ≤ t ≤ M − 1 ⎪ ⎪ ⎪ ⎨  j T M  1 yt = M j =1 K θk vk (t − j + 1)uk for M ≤ t ≤ n − M + 1 ⎪ ⎪ ⎪ ⎪ ⎪  ⎩ 1 M j j =t−n+M K θk vk (t − j + 1)uk for n − M + 2 ≤ t ≤ n n−t+1 (7.40) The eigenvalues related to the MJO are reflected by the pair (8,9), see Fig. 7.5. The time series and phase plots of extended EOFs 8 and 9 are shown in Fig. 7.9. This oscillating pair has a period of about 50 days. This MJO signal projects onto many PCs. Figure 7.10 shows a time series of the reconstructed PC1 component using extended EOFs/PCs 8 and 9. For example, PCs 5 to 8 are the most energetic components regarding MJO. Notice, in particular, the weak projection of MJO onto the annual cycle. Figure 7.11 shows the reconstructed OLR field for the period 3 March 1997 to 14 March 1997, at 5◦ N, using the extended 8th and 9th extended EOFs/PCs. The figure reveals that the MJO is triggered near 25–30◦ E over the African jet region and matures over the Indian Ocean and Bay of Bengal due to the moisture excess there. MJO becomes particularly damp near 150◦ E. Another feature that is

Fig. 7.9 Time series of OLR extended PCs 8 and 9 and their phase portrait. Adapted from Hannachi et al. (2007)

numbers can be obtained by constructing a M × n array A = (aj t ) with entries aj t = t − j + 1. Next, all entries that are non-positive or greater than n − M + 1 are to be equated to zero. Then for each time, t takes all the indices j with positive entries.

6 These

7.5 Extended EOFs or Multivariate SSA

165

Fig. 7.10 Reconstructed OLR PCs 1 to 8 using the extended EOFs 8 and 9. Adapted from Hannachi et al. (2007)

clear from Fig. 7.11 is the dispersive nature of the MJO, with a larger phase speed during the growth phase compared to that during the decay phase. Note that these reconstructions can also be obtained using least squares (see e.g. Vautard et al. 1992, and Ghil et al. 2002). The reconstructed components can also be restricted to any subset of the Eigen elements of the grand data matrix (7.26) or similarly the grand covariance matrix . For example, to reconstruct the time series associated with an oscillatory Eigen element, i.e. a pair of degenerate eigenvalues, the subset K in the sum (7.39) is limited to that pair. The reconstructed multivariate time series yt , t = 1, . . . n, can represent the reconstructed (or filtered) values of the original field at the original p grid points. In general, however, the number of grid points is too large to warrant an eigen decomposition of the grand data or covariance matrix. In this case a dimension reduction of the data is first sought by using say the leading p0 PCs and then apply a MSSA to these retained PCs. In this case the dimension of X becomes (n − M + 1) × Mp0 , which may be made considerably smaller than the original dimension. To get the reconstructed space–time field, one then use the reconstructed PCs in conjunction with the p0 leading EOFs.

166

7 Extended EOFs and SSA

Fig. 7.11 Reconstructed OLR anomaly field using reconstructed EOFs/PCs 1 to 8 shown in Fig. 7.10. Adapted from Hannachi et al. (2007)

Remark The previous sections discuss the fact that extended EOFs are used essentially for identifying propagating patterns, filtering and also data reduction. Another example where extended EOFs can be used is when the data contain some kind of breaks. This includes studies focussing on synoptic and large-scale processes in a particular season. In these studies the usual procedure is to restrict the analyses to data restricted to the chosen period. This includes analysing, for example, winter (e.g. DJF) daily data over a number of years. If care is not taken, an artificial oscillatory signal emerges. An example was given in Hannachi et al. (2011) who used geometric moments of the polar vortex using ERA-40 reanalyses. They used December–March (DJFM) daily data of the aspect ratio, the centroid latitude and the area of the vortex from ERA-40 reanalyses for the period 1958–2002. Figure 7.12a shows the spectrum of the extended time series in the delay coordinates of the vortex area time series using a window lag M = 400 days. A pair of nearly identical eigenvalues emerges and is well separated from the rest of the spectrum. The associated extended EOFs are shown in 7.11b, and they show a clear pair of sine waves in quadrature. The associated extended PCs are shown in 7.12c, revealing again a phase quadrature supported also by the phase plot (7.12d). The time series is then filtered by removing the leading few extended EOFs/PCs. The result is shown in Fig. 7.13. Note that in Fig. 7.13 the leading four extended EOFs/PCs were filtered out from the original vortex area time series.

7.5 Extended EOFs or Multivariate SSA

167

Fig. 7.12 Spectrum of the grand covariance matrix, Eq. (7.3), of the northern winter (DJFM) polar vortex (a) using a window lag M = 400 days, the leading two extended EOFs (b), the extended PC1/PC2 (c) and the phase portrait of the extended PC1/PC2 (d). Adapted from Hannachi et al. (2007)

Fig. 7.13 Raw polar vortex area time series and the reconstructed signal using the leading four extended EOFs/PCs (a) and the filtered time series obtained by removing the reconstructed signal of (a). Adapted from Hannachi et al. (2011). ©American Meteorological Society. Used with permission

168

7 Extended EOFs and SSA

7.6 Potential Interpretation Pitfalls EEOFs are useful tools to detect propagating features, but some care needs to be taken when interpreting the patterns. There are two main difficulties of interpretation related, respectively, to the standing waves and the relationship between the EEOFs substructures. The method finds one or more EEOFs where each EEOF is composed of a number of patterns or substructures. These patterns are taken to represent propagating features, and this would imply some sort of coherence between individual features within a given EEOF. The fact that the method attempts to maximise the variance of each EEOF (without considering extra constraints on the correlation, or any other measure of association, between the substructures of a given EEOF) suggests the existence of potential caveats. Chen and Harr (1993) used a two-variable model to show that the partition of the loadings is much sensitive to the variance ratio than to the correlation between the two variables. This may yield some difficulties in interpretation particularly when some sort of relationships are expected between the substructures of a given EEOF. Chen and Harr (1993) constructed a 6-variable toy model data to show that interpretation of EEOF patterns can be misleading. In the same token and like POPs, EEOFs interpretation can also be difficult when the data contains a standing wave. The problem arises for particular choices of the delay parameter τ . Monahan et al. (1999) showed that if the dataset contains a standing wave the EEOFs describing this wave will be degenerate if τ coincides with a zero of the autocovariance function of the wave’s time series. The model used by Monahan et al. (1999) takes the simple form: xt = ayt + ε t ,

where E ε t ε Tt+τ = η(τ )I, aT a = 1 and E (yt yt+τ ) = a(τ ), and ε t and yt T

are uncorrelated. For example, if zt = xTt , xTt+τ , the corresponding covariance matrix is   (0) (τ ) z = , (τ ) (0) where (τ ) = a(τ )aaT + η(τ )I is the covariance matrix function of xt . Denoting γ (τ ) = a(τ )+η(τ ), then the eigenvalue λ = γ (0)+γ (τ ) is degenerate if γ (τ ) = 0. Using more lags, the degeneracy condition becomes slightly more complicated. For T

example, when using two time lags, i.e. zt = xTt , xTt+τ , xTt+2τ , then one gets two obvious eigenvalues, λ1 = γ (0) + γ (τ ) + γ (2τ ), and λ2 = γ (0) − γ (2τ )

T T

with respective eigenvectors aT , aT , aT and aT , 0T , −aT . If γ (τ ) = γ (2τ ), then the second eigenvalue degenerates, and if in addition γ (τ ) = 0, then the first eigenvalue degenerates. When this happens, the substructures within a single EEOF can be markedly different, similar to the case mentioned above. Monahan et al.

7.7 Alternatives to SSA and EEOFs

169

(1999) performed EOF and a single-lag EEOF analysis of the monthly averages of the Comprehensive Ocean-Atmosphere Data Set (COADS) SLP from January 1952 to June 1997. The first EOF was identified as an east–west dipole and was interpreted as a standing mode, with its PC representing the ENSO time series. They obtained degeneracy of the leading eigenvalue when the lag τ is chosen near to the first zero of the sample autocovariance function of the PC of the standing mode, i.e. at 13 months. The result was a clear degradation, and in some substructures a suppression, of the standing wave signal leading to a difficulty of interpretation. Therefore, in the particular case of (unknown) standing wave, it is recommended to try various lags and check the consistency of the EEOF substructures.

7.7 Alternatives to SSA and EEOFs 7.7.1 Recurrence Networks Recurrence networks are defined in a similar way to climate networks, discussed in Sect. 3.9 of Chap. 3. Recurrence in phase space was explored by Reik Donner and colleagues in a different direction (Marwan et al. 2009; Donner et al. 2010). They viewed and interpreted recurrences matrix as adjacency matrix of some complex network. The adjacency (binary) matrix is defined, for a specific threshold distance ε, as Dij = 1{ε− xti −xtj ≥0} − δij , where xti , and xtj are M-dimensional time delay embedding of time series xt (Eq. 7.1). Donner et al. (2010) showed, in particular, that the recurrence network can provide quantitative information on the statistical properties of the dynamical system trajectory in phase space. Marwan et al. (2009) applied the approach to marine climate proxy records over Africa during the last 4.5Ma and identified various regime transitions.

7.7.2 Data-Adaptive Harmonic Decomposition Another EEOF-related method, the data-adaptive harmonic decomposition (DAHD), was proposed by Chekroun and Kondrashov (2017), and also by Kondrashov et al. (2018a,b), to deal with univariate and multivariate time series. The DAHD seeks to decompose the data in terms of elementary modes. In its simplest form, the method constructs a convolutional linear (integral) operator with the time series autocorrelation function as kernel. For multivariate time series, the kernel is replaced by the lagged autocorrelation matrix. The DAH modes are then given by the eigenelements of the integral operator. For a d-dimensional time series x1 , . . . xn (xt = (xt,1 , . . . xt,d ), t = 1, . . . n), with (lagged) window length M, the numerical approximation of the operator yields a grand bloc, symmetric d(2M −1)×d(2M −1) Hankel Matrix H = (H(p,q) ), p, q = 1, . . . , d. Each H(p,q)

170

7 Extended EOFs and SSA

is a (2M −1)×(2M −1) Hankel matrix. It is defined using the (2M −1)×(2M −1) cyclic permutation matrix P = (pij ), (i, j = 1, . . . 2M − 1), given by pij = δi+1,j + δi,2M−1 δ1,j ,

(7.41)

where δij is the Kronecker symbol. Precisely, if c is the vector containing the lagged correlations between the pth and qth variables, i.e. c = (p,q) (p,q) (p,q) (p,q) = corr(xt,p , xt+m,q ), m = −M + (ρ−M+1 , ρ−M+2 , . . . ρM−1 )T , where ρm 1, . . . , M − 1, then H(p,q) = [c, Pc, . . . P2M−2 c].

(7.42)

The DAH modes and associated eigenvalues are then given by the eigenelements of the grand (symmetric) Hankel matrix H. In a similar fashion to EEOFs, the eigenvalues of H come in pairs of equal in magnitude but with opposite sign, and the associated modes coefficients (time series) are shifted by quarter of a period. The obtained modes, and their coefficients, are used, in similar way to EEOFs, to identify oscillating features and to reconstruct the data. Kondrashov et al. (2018a) used DAH to analyse Arctic sea ice and predict September sea ice extent, whereas Kondrashov et al. (2018b) used it to analyse wind-driven ocean gyres. The authors argue, in particular, that the model has some predictive skill.

Chapter 8

Persistent, Predictive and Interpolated Patterns

Abstract Previous chapters discuss methods with no particular predictive power. This chapter describes further advanced linear methods that are more related to prediction. Three related methods, involving extrapolation (or prediction) and interpolation, with climate applications are discussed in this chapter. Keywords Power spectrum · Decorrelation time · Kolmogorov formula · Optimal persistence · Average predictability · Predictive patterns · Interpolated patterns · Southern oscillation · Forecastable component

8.1 Introduction Time-varying atmospheric fields are characterised, beside their high dimensionality and complexity, by high spatial and significant temporal correlations. We have encountered in the previous chapters various methods that take into consideration either or both of the last two characteristics, i.e. spatial and temporal correlations. EOFs, for example, take into account the spatial correlation, whereas EEOFs, or MSSA, take into account spatial as well as temporal correlations. The temporal correlation is examined, in principle, using auto- and crosscorrelations in time between different variables and involves the concept of persistence or serial correlation. Persistence, for example, is a local property and reflects the fact that atmospheric fields do not change substantially from 1 day to the next. Given the nature of the atmosphere,1 this local property propagates with time, and we end up with fields that persist for much longer than the local time scale of persistence. Sometimes, we talk about the system’s memory, or decorrelation time, given by the time lag beyond which atmospheric variables become uncorrelated. Persistence is a very useful concept in forecasting and can therefore be used in a similar fashion to EOFs.

1 As

a moving fluid masses.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_8

171

172

8 Persistent, Predictive and Interpolated Patterns

Early attempts along this direction were explored by Renwick and Wallace (1995) to determine patterns that maximise correlation between forecast and analysis. This is an example of procedure that attempts to determine persistent patterns. Persistent patterns are very useful in prediction but do not substitute predictable patterns. In fact, prediction involves in general (statistical and/or dynamical) models, whereas in persistence there is no need for modelling. This chapter reviews techniques that find most persistent and also most predictable patterns, respectively. The former attempt to maximise the decorrelation time, whereas the latter attempt to minimise the forecast error. We can also increase forecastability by reducing uncertainties of time series. A further technique based on smoothing is also presented, which attempts to find most smooth patterns.

8.2 Background on Persistence and Prediction of Stationary Time Series 8.2.1 Decorrelation Time Here we consider the case of a univariate stationary time series (see Appendix C) xt , t = 1, 2, . . . , with zero mean and variance σ 2 . The autocorrelation function ρ(τ ) is defined by ρ(τ ) = σ −2 E (xt xt+τ ), where xt , t = 1, 2 . . ., are considered as random variables. Suppose now that the centred time series is a realisation of these random variables, i.e. an observed time series of infinite sample, then an alternative definition of the autocorrelation is  n 1 1 ρ(τ ) = 2 lim xt xt+τ , (8.1) n→∞ n σ k=1

   where σ 2 = limn→∞ n1 nk=1 xt2 is the variance of this infinite sequence. The connection between both the definitions is assured via what is known as the ergodic theorem. Note that the first definition involves a probabilistic framework, whereas the second is merely functional. Once the autocorrelation function is determined, the power spectrum is defined in the usual way (see Appendix C) as f (ω) =

∞  τ =−∞

ρ(τ )e

2π iτ ω

=1+2

∞ 

ρ(τ ) cos2π ωτ.

(8.2)

τ =1

The autocorrelation function of a stationary time series goes to zero at large lags. The decorrelation time is defined theoretically as the smallest lag τ0 beyond which the time series becomes decorrelated. This definition may be, however, meaningless

8.2 Background on Persistence and Prediction of Stationary Time Series

173

since in general the autocorrelation does not have a compact support,2 and therefore the decorrelation time is in general infinite. Alternatively, the decorrelation time can be defined using the integral of the autocorrelation function when it exists:  ∞  ∞ T = ρ(τ )dτ = 2 ρ(τ )dτ (8.3) −∞

0

or similarly, for the discrete case, T =1+2

∞ 

ρ(τ ).

(8.4)

τ =1

It is clear from Eq. (8.2) that T = f (0).

(8.5)

The integral T in Eqs. (8.3) and (8.4) represents in fact a characteristic time between effectively independent sample values (Leith 1973) and can therefore be used as a measure of the decorrelation time. This can be seen, for example, when one deals with a AR(1) or Markovian time series. The autocorrelation function of the red noise is ρ(τ ) = exp (−|τ |/τ0 ), which yields T = 2τ0 . Therefore, in this case the integral T plays the role of twice the e-folding time of ρ(τ ), as presented in Sect. 3.4.2 of Chap. 3. Some climatic time series are known, however, not to have finite decorrelation time. These are known to have long memory, and the corresponding time series are also known as long-memory time series. By contrast short-memory time series have autocorrelation functions that decay exponentially with increasing lag, i.e. lim τ k ρ(τ ) = 0

τ →∞

(8.6)

for every integer k ≥ 0. Long-memory time series have autocorrelations that decay hyperbolically, i.e. ρ(τ ) ∼ aτ −α

(8.7)

for large lag τ and for some 0 < α < 1, and hence T = ∞. Their power spectrum also behaves in a similar way as f (ω) ∼ bωα−1

2A

(8.8)

function with compact support is a function that is identically zero outside a bounded subset of the real axis.

174

8 Persistent, Predictive and Interpolated Patterns

as ω → 0. In Eqs. (8.7) and (8.8) a and b are constant (e.g. Brockwell and Davis 2002, chap. 10).

8.2.2 The Prediction Problem and Kolmogorov Formula Consider again a stationary zero-mean time series xt , t = 1, 2, . . ., with variance σ 2 and autocorrelation function ρ(). A familiar problem in time series analysis is the prediction of xt+τ from previous values, i.e. from xs , s ≤ t. The τ -step ahead prediction xˆt+τ of xt+τ in the mean square sense is given by the conditional expectation: xˆt+τ = E [xt+τ |ss , s ≤ t]

(8.9)

 2 and minimises the mean square error E xt+τ − xˆt+τ . It is expressed as a linear combination of previous values as xˆt+τ =



αk xt−k = h(B)xt ,

(8.10)

k≥0

 where h(z) = k≥0 αk zk , and B is the backward shift operator, i.e. Bzt = zt−1 . The prediction error covariance is given by 2  σ12 = min E xt+τ − xˆt+τ .

(8.11)

The prediction theory based on the entire past of the time series using least square estimation has been developed in the early 1940s by Kolmogorov and Wiener and is known as the Kolmogorov–Wiener approach. Equation (8.10) represents a linear filter with response function H (ω) = h(eiω ), see Appendix C. Now the prediction error ετ = xt+τ − xˆt+τ has (1 − H (ω)) as response function. Hence the prediction error variance becomes  π σετ = |1 − H (ω)|2 f (ω)dω, (8.12) −π

where f (ω) is the power spectrum of xt , t = 1, 2, . . . , at frequency ω. Equation (8.12) requires an explicit knowledge of the filter (8.10) satisfying (8.11). Note, however, that if we use a finite past to predict xt+1 , e.g. xˆt+1 = a1 xt + a2 xt−1 + . . . ap xt−p+1 + εt+1 ,

(8.13)

8.2 Background on Persistence and Prediction of Stationary Time Series

175

then the coefficients a1 , a2 , . . . , ap can be obtained explicitly using the Yule–Walker equations, and therefore σε2 in (8.12) becomes explicit and is a function of the lagged covariance matrix of xt , t = 1, . . . , n. The one-step ahead prediction error variance can be computed, in fact, without need to know the optimal filter. It is given by an elegant formula, the Kolmogorov formula, which is based solely on the knowledge of the power spectrum f (ω) of the time series. The Kolmogorov formula reads (Kolmogorov 1939, 1941)  σ12 = exp

1 2π



π

−π

log 2πf (ω) dω ,

(8.14)



π 1 which can also be written as 2π exp 2π log f (ω) dω . The logarithm used here −π is the natural logarithm with base e. Note that the expression given by Eq. (8.14) is always finite for a stationary process. 

π 1 Exercise Show that for a stationary time series, 0 ≤ exp 2π −π log f (ω) dω < ∞. π Hint Use the fact that log x ≤ x along with the identity σ 2 = −π f (ω) dω, π yielding −∞ ≤ −π log f (ω) dω ≤ σ 2 . π Note that −π log f (ω) dω can be identical to −∞, in which case σ1 = 0, and the time series xt , t = 1, 2, . . ., is known as singular or deterministic; see e.g. Priestly (1981) or (Brockwell and Davis 1991, 2002). The Kolmogorov formula can be extended to the multivariate case, see e.g. Whittle (1953a,b) and Hannan (1970). Given a nonsingular spectral density matrix F(ω), i.e. |F(ω)| = 0, the one-step ahead prediction error is given by (Hannan 1970, theorem 3” p. 158, and theorem 3”’ p. 162) 

σ12

1 = exp 2π



π





1 tr (log 2π F(ω)) dω = exp 2π −π



π

−π

log |2π F(ω)|dω . (8.15)

Recall that in (8.15) |2π F(ω)| = (2π )p |F(ω)|, where p is the dimension of the multivariate time series. Furthermore, if xˆ t+τ is the minimum square error τ -step ahead prediction of xt and  ετ the error covariance matrix3 of ε τ = xt+τ − xˆ t+τ , and then σ12 = det( ε1 ).

error covariance matrix  ετ can also be expressed explicitly using an expansion in power series of the spectral density matrix. In particular, if F(ω) is factorised as F(ω) = 1 ∗ iω iω 2π 0 (e )0 (e ), then  ε1 takes the form:

3 The

 ε1 = 0 (0)∗0 (0); see Hannan (1970, theorem 1, p. 129).

176

8 Persistent, Predictive and Interpolated Patterns

Table 8.1 Comparison between time domain and spectral domain characteristics of multivariate stationary time series Time domain  τ = 0 :  = F(ω)dω π 1 −2iπ ω dω τ = 0 :  τ = 2π −π F(ω)e 2 [ 0 ]ii = σi is the ith variance

Spectral domain  ω = 0 : F(0) = τ  τ  ω = 0 : F(ω) = τ e2π iωτ [F(0)]ii = Ti is the ith decorrelation time

Remark Stationary time series can be equally analysed in physical space or spectral space and yields the same results. There is a duality between these two spaces. Table 8.1 shows a comparison between time and spectral domains. The table shows that spectral domain and time domain analyses of stationary time series are mirrors of each other assured using Fourier transform. It is also clear from the table that the image of EOFs is persistent patterns, whereas the image of frequency domain EOFs is POPs.

8.3 Optimal Persistence and Average Predictability 8.3.1 Derivation of Optimally Persistent Patterns The objective here is similar to that of EOFs where, instead of looking for patterns that maximise the observed variance of the space–time field, one seeks patterns that persist most. The method has been introduced and applied by DelSole (2001). Formally speaking, the spatial patterns themselves, like EOFs, are stationary. It is the time component of the patterns that is most persistent, i.e. has the largest decorrelation time. Given a space–time field xt , t = 1, 2, . . . , the objective is to find a pattern u, the optimally persistent pattern (OPP),  ∞whose time series coefficients display the maximum decorrelation time T = 2 0 ρ(τ )dτ . Precisely, given a p-dimensional times series xt , t = 1, 2, . . . , we let yt , t = 1, 2, . . . , n, be the univariate time series obtained by projecting the field onto the pattern u, i.e. yt = uT xt . The autocovariance function of this time series is given by γ (τ ) = E (yt+τ yt ) = uT  τ u,

(8.16)

where  τ is the autocovariance matrix function of xt . Hence the autocorrelation function of the time series is ρ(τ ) =

uT  τ u , uT u

(8.17)

where  is the covariance matrix of the time series xt , t = 1, 2, . . . . Using the identity  −τ =  Tτ , the decorrelation time of (yt ) reduces to

8.3 Optimal Persistence and Average Predictability

T =

uT

 ∞

0

  τ +  Tτ dτ u . uT u

177

(8.18)

The maximum of T in (8.18) is given by the generalised eigenvalue problem:  −1 Mu = λu,

(8.19)

∞

where M = 0  τ +  Tτ dτ . Note that Eq. (8.19) can also be transformed to yield a simple eigenvalue problem of a symmetric matrix as Mv = C−T MC−1 v = λv,

(8.20)

where C =  1/2 is a square root of the covariance matrix . The optimal patterns are then given by uk =  −1/2 vk ,

(8.21)

where vk , k = 1, . . . , p, are the eigenvectors of the symmetric matrix M in Eq. (8.20). This matrix is easily obtained via an SVD decomposition of , and hence the eigenvalue problem, Eq. (8.20), becomes easier to solve than the generalised eigenvalue problem in Eq. (8.19). Note that by analogy with , it is appropriate to refer to the matrix M = F(0) as the co-decorrelation time matrix. Finally, it is clear from Eq. (8.18) that when the process is not long memory, the decorrelation time is proportional to the ratio of the power at zero frequency (i.e. very low frequency) to the total integrated power, i.e. variance. Equation (8.20) can be written in terms of the power spectral matrix F(ω), and the equivalent generalised eigenvalue problem is F(0)u = λu.

(8.22)

So the eigenvalues maximise the Raleigh quotient, and the eigenvectors maximise the ratio of the zero frequency to the total power. Exercise Show that M is semi-definite positive, and consequently M is also symmetric and positive semi-definite. Hint Use the autocovariance function γ () of yt . Remark If F(ω) is the spectral density matrix of xt , then M = F(0) and is symmetric semi-definite positive. Let zt , t = 1, 2, . . . , n, be a stationary multivariate time series with M as covariance matrix, and then the generalised eigenvalue problem, Eq. (8.19), represents the solution to a filtering problem based on signal-to-noise maximisation, in which the input is xt , and the output is xt + zt . The eigenvalue problem, Eq. (8.20), produces a set of optimal patterns that can be ordered naturally according to the magnitude of the (non-negative) eigenvalues of M. The patterns uk , k = 1, . . . , p, are not orthogonal, but vk , k = 1, . . . , p,

178

8 Persistent, Predictive and Interpolated Patterns

are. That is the optimal persistent patterns uk , k = 1, . . . , p, are orthogonal with respect to the inner product < a, b >= aT b. The existence of the two sets of patterns is a classical result from generalised eigenvalue problems, where uk and vk , k = 1, . . . , p, are known, respectively, as the signal patterns and the filter patterns. The time series xt , t = 1, 2, . . . , n, can be decomposed using the orthogonal basis vk , k = 1, . . . , p, to yield xt =

p 

αk (t)vk =

k=1

p 

T xt vk vk .

(8.23)

k=1

In this case the time series coefficients are not uncorrelated. Alternatively, it is possible to compromise the orthogonality of the filter patterns and get instead uncorrelation of the time coefficients in a similar way to REOFs. This is achieved using again the orthogonality of the filter patterns, or in other words the biorthogonality of the optimal persistent pattern uk , and the associated filter wk = uk , i.e. wk ul = δkl for k = 1, . . . , p. This property can be used to obtain the alternative expansion: xt =

p 

(8.24)

βk (t)wk ,

k=1

where now βk (t) = xTt uk .

(8.25)

Note that the patterns (or filters) wk , k = 1, . . . , p, are not orthogonal, but the new time series coefficients βk (t), k = 1, . . . , p, are uncorrelated. In fact, we have E [βk (t)βl (t)] = uTk ul = δkl .

(8.26)

The sampling errors associated with the decorrelation times T = λ, resulting from a finite sample of data, can be calculated in a manner similar to to that of EOFs (DelSole 2001; Girshik 1939; Lawley 1956; Anderson 1963; and North et al. 1982). In some time series with oscillatory autocorrelation functions, such as those produced by a AR(2) model, the correlation integral T in (8.3) can tend to zero as the theoretical decorrelation time goes to infinity.4 DelSole (2001) proposes using the integral of the square of the autocorrelation function, i.e. example of a AR(2) model where ρ(τ ) = e−|τ |/τ0 cos ω0 τ and T1 = −1 1 + ω02 τ02 was given in DelSole (2001).

4 The

τ0



∞ 0

ρ(τ )dτ =

8.3 Optimal Persistence and Average Predictability





T2 =

179

ρ 2 (τ )dτ.

(8.27)

0

In this case, the maximisation problem cannot be solved analytically as in Eq. (8.18), and the solution has to be found numerically. Note also that the square of the autocorrelation function does not solve the problem related to the infinite decorrelation time for a long-memory time series. A comparison between the performance of T-optimals, Eq. (8.3), or T2-optimals, Eq. (8.27), applied to the Lorenz (1963) model by DelSole (2001) shows that the T2optimal remains correlated well beyond 10 time units, compared to that obtained using EOFs or Eq. (8.3). The latter patterns become uncorrelated after three time units as shown in Fig. 8.1. This may have implication on forecasting models. For example, the optimal linear prediction of the Lorenz (1963) system (Penland 1989) had skill about 12 time unit, which makes the T2-optimal provide potential for statistical forecast models. The T2-optimal mode (Fig. 8.1), however, cannot be well modelled by the first-order Markov model. It can be better modelled by a secondorder Markov model or AR(2) as suggested by DelSole (2001). Furthermore, the T2-optimal mode cannot also be produced by the POP model.

8.3.2 Estimation from Finite Samples In practice we deal with discrete and the matrix M is normally written time series, as a sum  +  1 +  T1 +  2 +  T2 + . . . , which is normally limited to some lag τ0 beyond which the autocorrelations of the (individual) variables become nonsignificant, i.e.

T1 Optimal T1 = 0.5

First EOF 62%

T2 Optimal T2 = 1.9

1.0

ρτ

0.5

0.0

–0.5

0

10 τ

0

10 τ

0

10 τ

Fig. 8.1 Autocorrelation function of the leading PC (left), the leading time series of the T-optimal (middle) and that of the T2 -optimal (right) of the Lorenz model. Adapted from DelSole (2001). ©American Meteorological Society. Used with permission

180

8 Persistent, Predictive and Interpolated Patterns

M =  +  1 +  T1 + · · · +  τ0 +  Tτ0 .

(8.28)

Remark Note that, in general, τ0 need not be large, and in some cases it can be limited to the first few lags. For example, for daily geopotential heights τ0 is around a week. For monthly sea level pressure, τ0 ≈ 9 − 12 months. For sea surface temperature, one expects a larger value of τ0 . Again here we suppose that the data are short memory. There are of course exceptions with variables that may display signature of long memory such as wind speed or perhaps surface temperature. In those cases the lag τ0 will be significantly large. Some time series may have longterm trends or periodic signals, in which case it is appropriate to filter out those signals prior to the analysis. Persistent patterns from atmospheric fields, e.g. reanalyses, using T or T2 measures may not be too different, particularly, for the prominent modes. In fact, DelSole (2001) finds that the leading few optimal T- or T2-persistent patterns are similar for allowable choices of EOF truncation (about 30 EOFs), but with possible differing ordering between the two methods. For example, the leading T2-optimal of daily NCEP 500–mb heights for the period 1950–1999 is shown in Fig. 8.2. This pattern is also the second T-optimal pattern. The pattern bears resemblance to the Arctic Oscillation (AO) Thompson and Wallace (1998). The trend signature in the time series is not as strong as in the AO pattern as the T2-optimal is based on all days not only winter days. Note also the 12–15 days decorrelation time from the autocorrelation function of the time series (Fig. 8.2). An interesting result of OPP is that it can also identify other low-frequency signatures such as trends and discontinuities. For example, the trailing OPP patterns are found to be associated with synoptic eddies along the storm track (Fig. 8.3). In reality, the above truncation in Eq. (8.28) can run into problems when we compute the optimal decorrelation time. In fact, a naive truncation of Eq. (8.18), giving equal weights to the different lagged covariances, can yield a negative decorrelation time as the lagged covariance matrix is not reliable due to the small sample size used when the lag τ is large. Precisely, to obtain the finite sample of Eq. (8.22), a smoothing of the spectrum is required as the periodogram is not a consistent estimator of the power spectrum. When we have a finite sample xt , t = 1, . . . , n, we use the sample lagged covariance matrix  1 n−τ xt+τ xTt 0 ≤ τ < n Cτ = n1 t=1 (8.29) n+τ T t=1 xt xt−τ −n < τ < 0 n In this case, Eq. (8.22) takes the form ˆ t F(0)u = λC0 u,

(8.30)

8.3 Optimal Persistence and Average Predictability

181

Fig. 8.2 The leading filter and T2-optimal persistent patterns (top), the associated time series (middle) and its autocorrelation function (bottom) of the daily 500-hPa geopotential height anomalies for the period 1950–1999. The analysis is based on the leading 26 EOFs/PCs. Adapted from DelSole (2001). ©American Meteorological Society. Used with permission

where the sample power spectrum is given by ˆ F(ω) =

M  τ =−M

ατ Cτ e−iωk τ

(8.31)

182

8 Persistent, Predictive and Interpolated Patterns

Fig. 8.3 Same as Fig. 8.2 but for the trailing filter and T2-optimal persistent pattern. Adapted from DelSole (2001). ©American Meteorological Society. Used with permission

with ω√k = 2π k/n, and ατ is a lag window at lag τ . DelSole (2006) considered M = n (Chatfield 1989) and used a Parzen window,

8.3 Optimal Persistence and Average Predictability

, ατ =

183

τ 2

τ 1−6 M if 0 ≤ τ ≤ M/2 +6 M

τ 2 2 1− M if M/2 ≤ τ ≤ M,

(8.32)

because it cannot give negative power spectra compared, for example, to the Tukey window. Note that here again, through an inverse Fourier transform of Eq. (8.31), the finite sample eigenvalues maximise a similar Raleigh quotient to that derived from Eq. (8.22). DelSole (2006) applied OPP analysis to surface temperature using the observed data HadCRUT2, compiled jointly by the Climate Research Unit and the Met Office’s Hadley Centre, and 17 IPCC AR4 (Intergovernmental Panel for Climate Change 4th Assessment Report) climate models. He found, in particular, that the leading two observed OPPs are statistically distinguishable from noise and can explain most changes in surface temperature. On the other hand, most model simulations produce one single physically significant OPP. Remark A similar method based on the lag-1 autocorrelation ρ(1) was proposed by Wiskott and Sejnowski (2002). They labelled it slow feature analysis (SFA), and it is discussed more in Sect. 8.6.

8.3.3 Average Predictability Patterns Average predictability pattern (APP) analysis was presented by DelSole and Tippett (2009a,b) based on the concept of average predictability time (APT) decomposition, which is a metric for the average predictability. Let us designate by  τ the covariance matrix of  time series xt , t = 1, 2, . . . , at  the forecast of a p-dimensional lead time τ , i.e. E (ˆxt+τ − xt+τ )(ˆxt+τ − xt+τ )T , and the APT is given by S=2

∞ 

Sτ ,

(8.33)

τ =1

  where Sτ = p1 tr ( ∞ −  τ ) −1 ∞ , also known as the Mahalanobis signal, and  ∞ is the climatological covariance. APT decomposition analysis seeks vectors v such that the scalar time series yt = xTt v, t = 1, 2, . . ., has maximum APT. Keeping in mind that for these univariate 2 = vT  v, the pattern v is obtained as the solution time series στ2 = vT  τ v and σ∞ ∞ of the following generalised eigenvalue problem: 2

∞ 

( ∞ −  τ ) v = λ ∞ v.

(8.34)

τ =1

Note that Eq. (8.34) can be transformed to yield the following eigenvalue problem:

184

8 Persistent, Predictive and Interpolated Patterns

2

∞  

−1/2 T

I − ∞

−1/2

τ ∞

 w = λw,

(8.35)

τ =1 1/2

where w =  ∞ v and is taken to be unit norm. The APP u is then obtained by projecting the time series onto v, i.e. E(xt yt ), and u =  ∞ v.

(8.36)

A similar argument to the OPP can be applied here to get the decomposition of the substituting uk and time series xt , t = 1, . . . , n, using Eqs. (8.24) and (8.25) after p vk , Eq. (8.36), for uk and wk , Eq. (8.24), respectively, i.e. xt = k=1 (vTk xt )uk . To estimate the APPs, the linear model xt+τ = Axt + ε t , with A = Cτ C−1 0 , is T . The patterns are then solution to the C used, and for which  τ = C0 − Cτ C−1 τ 0 eigenproblem: 

T Cτ C−1 0 Cτ v = λC0 v.

(8.37)

τ

The estimation from a finite sample is obtained in a similar manner to the OPPs. In fact, to avoid getting negative APT values, which could result from a “naive” truncation of Eq. (8.37), DelSole and Tippett (2009b) suggest using a Parzen lagged window ατ , given in Eq. (8.32), to weight the lagged covariance matrices, which does not produce negative spectra. The APT is then estimated using S=2

M 

ατ Sατ .

(8.38)

τ =1

DelSole and Tippett (2009b) applied APP analysis to the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) 6-hourly 1000-hPa zonal velocity during the 50-year period from 1 January 1956 to 31 December 2005, providing a sample size of 73,052. Their analysis reveals that the prominent predictable patterns reflect the dominant low-frequency modes, including a climate change signal (leading pattern), a ENSOrelated signal (second pattern) in addition to the annular mode (third pattern). For example, the second predictable component (Fig. 8.3) has an average predictability time of about 5 weeks, and predictability is mostly captured by ENSO signal. They also obtained the MJO signal when the zonal wind was reconstructed based on the leading seven predictable patterns. The remaining patterns were identified with weather predictability having time scales less than a week. Fischer (2015, 2016) provides an alternative expression of the OPP and APP analyses based on the reduced rank multivariate regression Y = XB + E, where E is a n × p residual matrix and B a p × d matrix of regression coefficients. For OPP

8.4 Predictive Patterns

and APP analyses, the tth row of Y is given by yt = represents the Parzen lag window, Eq. (8.32), at lag τ .

185

M

τ =−M

ατ xt+τ , where ατ

8.4 Predictive Patterns 8.4.1 Introduction Each of the methods presented so far serves a particular goal. EOFs, for example, are defined without constraint on the temporal structure (since they are based on using contemporaneous maps) and are forced to be correlated with the original field. On the other hand, POPs, for example, compromise the last property, i.e. correlation with the field, but use temporal variation. EEOFs, or MSSA, use both the spatial and the temporal structure and produce instead a set of patterns that can be used to filter the field or find propagative patterns. They do not, however, make use of the of the temporal structure, e.g. autocovariance, in an optimal way. This means that there is no constraint on predictability. Because they are formulated using persistence, the OPPs use the autocovariance structure of the field. They achieve this by finding patterns whose time series evolution decays very slowly. These patterns, however, are not particularly predictable. Patterns that maximise covariance or correlation between forecast and analysis (Renwick and Wallace 1995) deal explicitly with predictability since they involve some measure of forecast skill. These patterns, however, are not the most predictable since they are model-dependent. An alternative way to find predictable patterns is to use a conventional measure of forecast skill, namely the prediction error variance. In fact this is a standard measure used in prediction and yields, for example, the Yule–Walker equations for a stationary univariate time series. Predictive Oscillation Patterns (PrOPs) Kooperberg and O’Sullivan (1996) achieve precisely this. PrOPs are patterns whose time coefficients minimise the one-step ahead prediction error and as such are considered as the Optimally Persistent Patterns or simply the most predictable patterns. When this approach was introduced, Kooperberg and O’Sullivan (1996) were not motivated by predicting the weather but mainly by working out a hybrid of EOFs and POPs that attempt to retain desirable aspects of both, but not by predicting the weather.

8.4.2 Optimally Predictable Patterns Let xt , t = 1, 2 . . . , be a multivariate stationary time series with spectral density matrix F(ω). The objective is to find a pattern u with the most predictable time series yt = uT xt , t = 1, 2 . . . , in terms of the one-step ahead prediction. That is,

186

8 Persistent, Predictive and Interpolated Patterns

the time series yt , t = 1, 2 . . . , has the smallest one-step ahead prediction error variance. Using the autocovariance of this time series, as a function of the lagged covariance matrix of xt , see Eq. (8.16), the spectral density function f (ω) of yt , t = 1, 2 . . . , becomes f (ω) = uT F(ω)u,

(8.39)

and the one-step ahead prediction error Eq. (8.15) yields  σ12 = exp

1 2π



π −π

  log 2π uT F(ω)u dω ,

(8.40)

which has to be minimised. Because log is a monotonically increasing function, it is simpler to minimise log(σ12 ) instead. The required pattern u is then obtained as the solution to the optimisation problem: minu

 π

−π

 log 2π uT F(ω)u dω s.t. uT u = 1,

(8.41)

which can be transformed to yield the unconstrained problem:   min F(u) = u

π

log

−π

uT F(ω)u dω . uT u

(8.42)

Remark Equivalent formula can also be derived, such as min u

where F =

1 2π



−π

  π 1 uT u uT F(ω)u exp log dω , uT u 2π −π uT Fu

F(ω)dω =

(8.43)

1 2π .

Higher order PrOPs are obtained in a similar manner under the constraint of being orthogonal to the previous PrOPs. For instance, the k + 1th PrOP is given by  uk+1 =

argmin u, uT uα = δα,k+1

π

−π

log

uT F(ω)u dω. uT u

(8.44)

Suppose, for example, that the first k PrOPs, k ≥ 1, have been identified. The next one is obtained as the first PrOP of the residuals: zt = xt −

k 

yt,j uj

(8.45)

j =1

for t = 1, 2, . . . , and where yt,j = xTt uj , j = 1, 2, . . . , k. In fact, Eq. (8.45) is obtained by removing recursively the contributions from previous PrOPs and can be rewritten as

8.4 Predictive Patterns

187

⎛ zt = ⎝Ip −

k 

⎞ uj uTj ⎠ xt = Ak xt ,

(8.46)

j =1

and Ak is simply the projection operator onto the orthogonal complement to the space spanned by the first k PrOPs. The PrOP optimisation problem derived from the residual time series zt , t = 1, 2, . . . , yields  min u

π

−π

ln

uT ATk F(ω)Ak u dω, uT u

(8.47)

which provides the k + 1th PrOP uk+1 . Note that Ak uk+1 = uk+1 , so that uk+1 belongs to the null space of (u1 , . . . , uk ).

8.4.3 Computational Aspects The optimisation problem in Eq. (8.42) or Eq. (8.47) can only be carried out numerically using some sort of descent algorithm (Appendix E) because of the nonlinearity involved. Given a finite sample xt , t = 1, 2, . . . , n, the first step consists in estimating the spectral density matrix using, for example, the periodogram (see Appendix C): I(ωp ) =

n n 1  −itωp  T itωp xt e xt e nπ t=1

(8.48)

t=1

n−1 n for ωp = 2πp n , p = −[ 2 ], . . . , [ 2 ], where [x] is the integer part of x. Note that the first sum in the rhs of Eq. (8.48) is the Fourier transform xˆ (ωp ) of xt , t = 1, . . . , n at frequency ωp , and I(ωp ) = π1 xˆ (ωp )ˆx∗T (ωp ), where the notation x ∗ stands for the complex conjugate of x. However, since the periodogram is not a consistent estimator of the power spectrum, a better estimator of the spectral density matrix F(ω) can be obtained by smoothing the periodogram (see Appendix C): n

ˆ F(ω) =

[2] 

(ω − ωp )I(ωp ),

(8.49)

p=−[ n−1 2 ]

ˆ where () is a spectral window. The smoothing makes F(ω) asymptotically consistent (see e.g. Jenkins and Watts 1968 and Chatfield 1996). Furthermore, the ˆ smoothed periodogram F(ω) will be in general full rank and avoids the integrals in Eq. (8.42) or Eq. (8.47) to be singular. The function to be optimised in Eq. (8.42) is then approximated by

188

8 Persistent, Predictive and Interpolated Patterns

π F(u) = n



[ n2 ]−1



log

k=−[ n−1 2 ]

ˆ k )u uT F(ω ˆ k+1 )u uT F(ω + T u u uT u

(8.50)

and similarly for Eq. (8.47). A gradient type (Appendix E) can be used in the minimisation. The gradient of F(u) is given by −

  π F(ω) 1 1 ∇F(u) = Ip − dω u 4π 2π −π uT F(ω)u ⎤ ⎡ [ n2 ]−1  ˆ k+1 ) ⎥ ˆ k) F(ω F(ω ⎢ + ≈ ⎣I p − ⎦ u. T ˆ T ˆ k+1 )uT uT F(ω n−1 u F(ωk )u k=−[

2

(8.51)

]

The application of PrOPs to 47-year 500-mb height anomalies using the National Meteorology Center (NMC) daily analyses (Kooperberg and O’Sullivan 1996) shows some similarities with EOFs. For example, the first PrOP is quite similar to the leading EOF, and the second PrOP is rather similar to the third EOF. An investigation of forecast errors (Fig. 8.4) suggests that PrOPs perform, for example, better than POPs and have similar performance to EOFs.

Fig. 8.4 Forecast error versus the number of the patterns using PCA (continuous), POPs (dotted) and PrOPs (dashed) of the NMC northern extratropical daily 500-hPa geopotential height anomalies for the period of Jan 1947–May 1989. Adapted from Kooperberg and O’Sullivan (1996)

8.5 Optimally Interpolated Patterns

189

8.5 Optimally Interpolated Patterns 8.5.1 Background Let again xt , t = 1, 2, . . ., be a zero-mean stationary multivariate time series with covariance matrix  and spectral density matrix F(ω). In the prediction theory, the τ -step ahead prediction xt,τ of xt given xt−k , for k ≥ τ , is given by the conditional expectation: xt,τ = E [xt |xt−τ , xt−τ −1 , . . .] .

(8.52)

Equation (8.52) is known to minimise the mean square error variance

T

xt − xt,τ . The prediction formula in Eq. (8.52) corresponds E xt − xt,τ also to a linear prediction using the past values of the times series, i.e. xt,τ = h(B)xt ,

(8.53)

 τ +k and A , k = 1, 2, . . ., are p × p matrices. where h(z) = k k≥1 Ak z Equation (8.53) is a linear filter with frequency response function given by

H(ω) = h eiω .

(8.54)

Accordingly, the error xt − xt,τ has Ip − H(ω) as frequency response function, and hence the error covariance matrix takes the form (see Sect. 2.6, Chap. 2):  τ =

π −π

   ∗T Ip − H(ω) F(ω) Ip − H(ω) dω.

(8.55)

8.5.2 Interpolation and Pattern Derivation This section discusses a method identifying patterns using interpolation, optimally interpolated patterns (OIP) Hannachi (2008). Interpolation problems are related, but not identical to the prediction problem. The objective here is to interpolate a stationary time series, i.e. obtain a replacement of one or several missing values. More details on interpolation in stationary time series can be found in Grenander and Rosenblatt (1957), Bonnet (1965) and Hannan (1970). We suppose that we are given xt−j , for all j = 0, the objective is then to seek an estimate xˆ t of xt , which is estimated by

190

8 Persistent, Predictive and Interpolated Patterns

xˆ t =



αj xt−j = h(B)xt ,

(8.56)

j =0

where h(z) =



j =0 αj z

j,

such that the mean square error

xt − xˆ t 2 = E



T

xt − xˆ t xt − xˆ t

(8.57)

is minimised. This is equivalent (see Sect. 2.6) to minimising  tr

π −π







Ip − H(ω) F(ω) Ip − H(ω)

∗T



(8.58)

under reasonable assumptions of continuity of the spectral density matrix F(ω) of the stochastic process xt , t = 0, ±1, ±2, . . . , and integrability of F−1 (ω). The

T

of the interpolation error xt − xˆ t , covariance matrix  = E xˆ t − xt xˆ t − xt which is also represented by Eq. (8.55), can be computed and is given by   = 2π

π

−π

(2π F(ω))−1 dω

−1 .

(8.59)

Furthermore,  is nonsingular, and the optimal interpolation filter is given by H(ω) = Ip − 2π  −1 F−1 (ω).

(8.60)

We give below an outline of the proof and, for more details, refer to the above references. We let xk = (Xk1 , . . . , Xkp ), k = 0, ±1, ±2, . . . , be a sequence of zeromean second-order random vectors, i.e. with components having finite variances. Let also Ht be the space spanned by the sequence {Xkj , k = t, j = 1, . . . , p, k = 0, ±1, ±2, . . . , k = t} known also as random function. Basically, Ht is composed of finite linear combinations of elements of this random function. Then Ht has the structure of a Hilbert space with respect to a generalised scalar product (see Appendix F). The estimator xˆ t in Eq. (8.56) can be seen as the projection5 of xt onto this space. Therefore xt − xˆ t is orthogonal to xˆ t and also to xs for all s = t. The first of these two properties yields

5 Not

exactly so, because Ht is the set of all finite linear combinations of elements from the sequence. However, this can be easily overcome by considering partial sums of the form N hN (B)xt = k=−N,k=0 αk xt−k and then make N approach infinity. The limit h() of hN () is then obtained from  π (8.61) lim tr (H(ω) − HN (ω)) F(ω) (H(ω) − HN (ω))∗ dω = 0, N →∞

−π

where HN (ω) = hN (eiω ).

8.5 Optimally Interpolated Patterns

E

191

 

= O for all s = t, xt − xˆ t x∗T s

(8.62)

where the notation (∗ ) stands for the complex conjugate. This equation can be expressed using the  spectral density F(ω) and the multivariate frequency  matrix response function Ip − H(ω) of xt − xˆ t , refer to Sect. 2.6 in Chap. 2. Note that t can be set to zero because of stationarity. This necessarily implies π   isω dω = O for s = 0. −π Ip − H(ω) F(ω)e

(8.63)

This necessarily implies that the matrix inside the integral is independent of ω, i.e.   Ip − H(ω) F(ω) = A,

(8.64)

where A is a constant matrix. The second orthogonality property, i.e. 

= O, implies a similar relationship; namely, E xt − xˆ t xˆ ∗T t 

π

−π

Ip − H(ω) F(ω)H∗T (ω)dω = O.

(8.65)

Now, by expanding the expression of the covariance matrix  τ in Eq. (8.55) and using the last orthogonality property in Eq. (8.65), one gets  =

π −π

Ip − H(ω) F(ω)dω = 2π A.

(8.66)

  Finally, substituting the expression of Ip − H(ω) from Eq. (8.64) into the expression of  in Eq. (8.55) and noting that A is real (see Eq. (8.66)) yield A−1 =

1 2π



π

−π

F−1 (ω)dω,

(8.67)

where the invertibility of  is guaranteed by the integrability of F−1 . Now given a unitary vector u, the previous interpolation problem can be cast in terms of the univariate time series yt = uT xt , t = 1, 2 . . . . The mean square interpolation error for yt is 2

E yt − yˆt = uT u.

(8.68)

The optimally interpolated pattern (OIP) u is the vector minimising the interpolation error variance, Eq. (8.68). Hence the OIPs are given by the eigenvectors of the interpolation error covariance matrix  in Eq. (8.59) associated with its successively increasing eigenvalues.

192

8 Persistent, Predictive and Interpolated Patterns

Exercise In general, Eq. (8.66) can only be obtained numerically, but a simple example where  can be calculated analytically is given in the following model (Hannachi 2008): xt = uαt + ε t , where u is a constant p-dimensional vector, αt a univariate stationary time series with spectral density function g(ω) and εt a p-dimensional random noise, uncorrelated in time and independent of αt , with covariance  ε . Assume var(αt ) = 1 = u . Show that  = ε +

β 1 − βuT  ε −1 u

uuT ,

and find the expression of β.

8.5.3 Numerical Aspects Given a sample of observed multivariate time series xt , t = 1, 2 . . . , n, to find the OIPs of these data, one first estimates the interpolation error covariance matrix using an estimate of the spectral density matrix as in the previous section using ˆ −1 = 1  4π 2



π −π

 −1 ˆF(ω) dω ,

(8.69)

ˆ where F(ω) is an estimate of the spectral density matrix given, for example, by Eq. (8.49). Note that, as it was mentioned above, smoothing the periodogram makes ˆ F(ω) full rank. A trapezoidal rule can then be used to approximate this integral to yield ˆ −1 

n 2 −1

 1 ≈ Fˆ −1 (ωk ) + Fˆ −1 (ωk+1 ) , 4π n  n−1  k=−

(8.70)

2

where [x] is the integer part of x, and ωk = 2π k/n represents the kth discrete frequency. Any other form of quadrature, e.g. Gaussian, or interpolation (e.g. Linz and Wang 2003) can also be used to evaluate the previous integral. In Hannachi (2008) the spectral density was estimated using smoothed periodogram given in Eq. (8.49), where () is a smoothing spectral window and I(ωk ) = π −1 xˆ (ωk )ˆx∗T (ωk ), where  xˆ (ωk ) is the Fourier transform of xt , t = 1, . . . , n at ωk , that is xˆ (ωk ) = n−1/2 nt=1 xt e−iωk t .

8.5 Optimally Interpolated Patterns

193

8.5.4 Application The following low-dimensional example was analysed by Hannachi (2008). The system is a three-variable time series given by ⎧ 3/2 ⎨ xt = α t + 1.6εt1 y = 32 αt + 2.4εt2 ⎩ t zt = 12 + 1.5εt3 ,

(8.71)

where α = 0.004, εt1 , εt2 and εt3 are first-order AR(1) models with respective lag-1 autocorrelation of 0.5, 0.6 and 0.3. Figure 8.5 shows a simulated example from this model. The trend is shared between PC1 and PC2 but is explained solely by OIP1, as shown in Fig. 8.6, which shows the histograms of the correlation coefficients between the linear trend and the PCs and OIP time series.

Fig. 8.5 Sample time series of system, Eq. (8.71), giving xt , yt and zt (upper row), the three PCs (middle row) and the three OIPs (bottom row). Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission. (a) xt . (b) yt . (c) zt . (d) PC 1. (e) PC 2. (f) PC 3. (g) OIP 1. (h) OIP 2. (i) OIP 3

194

8 Persistent, Predictive and Interpolated Patterns

Fig. 8.6 Histograms of the absolute value of the correlation coefficient between the linear trend in Eq. (8.71) and the PCs (upper row) and the OIP time series (bottom row). Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission

Another example of OIP application is discussed next using the northern hemispheric sea level pressure (Hannachi 2008). Given the large dimensionality involved in the computation of the spectral density matrix, an EOF truncation is used. For example, with the retained leading m = 5 EOFs/PCs, the leading OIP has 95% total interpolation error variance, whereas the next two OIPs have, respectively, 2% and 1%. The leading two OIPs are shown in Fig. 8.7 along with the leading two EOFs. The leading OIP, in particular, shows clearly the NAO signal. Compare this with the mixed AO/NAO EOF1. The second OIP represents (neatly) the Pacific oscillation pattern, compared again to the mixed EOF2 pattern. It was noted in Hannachi (2008) that as the number of retained EOFs/PCs increases, the spectrum of the interpolation covariance matrix isolates the leading two OIPs, reflecting the low-frequency patterns. An example of spectrum is shown in Fig. 8.8 with m = 20 EOFs/PCs. The stability of these two OIP patterns for different values of m can be checked by comparing, for example, the patterns shown in Fig. 8.7, with the same patterns for different values of m. This is shown in Fig. 8.9.

8.5 Optimally Interpolated Patterns

195

Fig. 8.7 Leading two OIPs (upper row) and EOFs (bottom row). The OIPs are based on the leading 5 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission

Fig. 8.8 Leading 20 eigenvalues of the interpolation error covariance matrix shown in percentage of total interpolation error variance based on the leading 20 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission

196

8 Persistent, Predictive and Interpolated Patterns

Fig. 8.9 Spatial and temporal correlation of the leading 3 OIP patterns (thin) and associated IPC time series (bold), obtained based on the leading 5 EOFs/PCs, with the same OIPs and IPCs, for m=5,6, . . . 25 EOFs/PCs of northern hemispheric SLP anomalies. Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission

Power spectra of the leading five OIP time series, i.e. interpolated PCs (IPCs), is shown in Fig. 8.10 based on two estimation methods, namely Welch and Burg methods. The decrease of low-frequency power is clear as one goes from IPC1 to IPC5. Another example where OIP method was applied is the tropical SLP. The leading two OIPs (and EOFs) are shown in Fig. 8.11. The leading OIP of tropical SLP anomalies, with respect to the seasonal cycle, represents the Southern Oscillation mode. Compare, for example, this mode with the leading EOF, which is a monopole pattern. The second EOF is more associated with OIP1.

8.6 Forecastable Component Analysis Forecastable patterns are patterns that are derived based on uncertainties in time series. Forecastable component analysis (ForeCA) was presented by Goerg (2013) and is based on minimising a measure of uncertainty represented by the (differential) entropy of a time series. For a second-order stationary time series xt , t = 1, 2, . . ., with autocorrelation function ρ(.) and spectral density f (.), a measure of uncertainty is given by the “spectral” entropy6

6 The idea behind this definition is that if U is a uniform distribution over [−π, π ] and V a random variable, independent of U , and with probability density function g(.), then the time series yt = √ 2 cos(2π V t + U ) has precisely g(.) as its power spectrum (Gibson 1994).

8.6 Forecastable Component Analysis

197

Fig. 8.10 Power spectra of the leading five IPCs based on the leading 5 EOFs/PCs of northern hemispheric SLP anomalies using the Welch and Burg methods. Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission. (a) Power spectral density (Welch). (b) Power spectral density (Burg)

H (x) = −

1 log 2π



π

−π

f (ω) log f (ω)dω.

(8.72)

The forecastability of the time series is given by F (x) = 1 − H (x).

(8.73)

Note that the factor log 2π in Eq. (8.72) corresponds to the spectral entropy of a white noise.

198

8 Persistent, Predictive and Interpolated Patterns

Fig. 8.11 Leading two OIPs and EOFs of the northern hemispheric SLP anomalies. The OIPs are obtained based on the leading 5 EOFs/PCs. Adapted from Hannachi (2008). ©American Meteorological Society. Used with permission. (a) Tropical SLP OIP 1. (b) Tropical SLP OIP 2. (c) EOF 1. (d) EOF 2

8.6 Forecastable Component Analysis

199

Given a n × p data matrix X, the ForeCA pattern is defined as the vector w that maximises the forecastability of the time series x = XT w, subject to the constraint wT Sw = 1, with S being the sample covariance matrix. Before proceeding, the data matrix is first whitened, and then the multivariate spectrum matrix F(ω) is obtained. The univariate spectrum along the direction w is then given by F (ω) = wT F(ω)w.

(8.74)

The ForeCA patterns are then obtained by minimising the uncertainty, i.e. w∗ = argmax



π

w,wT w=1 −π

F (ω) log F (ω)dω.

(8.75)

In application the integral in Eq. (8.74) is approximated by a sum over the discrete frequencies ωj = j/n, j = 0, 1, . . . , n − 1. Goerg (2013) used the weighted overlapping segment averaging (Nuttal and Carter 1982) to compute the power spectrum. The estimate from Eq. (8.49) can be used to compute the multivariate spectrum. Fischer (2015, 2016) compared five methods pertaining to predictability, namely OPP, APP, ForeCA, principal trend analysis (PTA) and slow feature analysis (SFA). PTA or trend EOF analysis (Hannachi 2007) is presented in Chap. 15, and SFA (Wiskott and Sejnowski 2002), mentioned in the end of Sect. 8.3, is based on the lag-1 autocorrelation. In Fischer (2016) these methods were applied to a global dataset of speleothem δ 18 O for the last 22 ka, whereas in Fischer (2015) the methods were applied to the Australian daily near-surface minimum air temperature for the period of 1910–2013. He showed, in particular, that the methods give comparable results with some subtle differences. Figure 8.12 shows the leading three predictable component time series of δ 18 O dataset for the five methods. It is found, for example, that OPP analysis, SFA and PTA tend to identify low-frequency persistent components, whereas APP analysis and ForeCA represent more measure of predictability. Of particular interest from this predictability analysis is the association of these signals with North American deglaciation, summer insolation and the Atlantic meridional overturning circulation (AMOC), see Fig. 8.12.

200

8 Persistent, Predictive and Interpolated Patterns Component 1

Component 2

Component 3

2 OPA

0 –2

APTD

2 0

2

ForeCA

Standardized δ18O

–2

0

–2 2 SFA

0 –2 2

PTA

0

0

5

10

15

20

0

5

10

15

20

0

5

10

15

20

–2 Time (ka BP)

Fig. 8.12 The leading three predictable component time series (black) obtained from the five methods, OPP (first row), APP (second row), ForeCA (third row), SFA (fourth row) and PTA (last row) applied to the global δ 18 O dataset. The yellow bars refer, respectively, to the timing of Heinrich Stadial 1, centred around 15.7 ka BP, and the Younger Dryas, centred around 12.2 ka BP. The red curves represent the percentage of deglaciated area in North America (left), the summer insolation at 65◦ N (middle) and the AMOC reconstruction (right). Adapted from Fischer (2016)

Chapter 9

Principal Coordinates or Multidimensional Scaling

Abstract This chapter describes patterns obtained based on proximity or similarity measures, i.e. multidimensional scaling (MDS). Conventional EOFs correspond to the case of quadratic distance. In this chapter other forms of similarities are discussed, with climate application, and which can yield structures that cannot be revealed by classical MDS. Keywords Principal coordinate analysis · Dissimilarity measure · Multidimensional scaling · Classical scaling · Stress function · Non-metric scaling · Isomap · Kernel smoothing · Asian summer monsoon

9.1 Introduction Most multivariate data analysis techniques involve in a way or another a measure of proximity or similarity, between variables. Take for example ordinary EOFs or PCA, the covariance (or correlation) matrix involves a measure of covariability between all pairs of variables. The correlation between two variables provides in fact a measure of proximity between them. A correlation value of one means that the two variables are proportional whereas a zero correlation means no linear association. This proximity, however, is not a standard distance in the coordinate space but provide a measure proximity with respect to time (for time series) or realisations in general. In fact, a high correlation between two time series means that the two variables evolve in a coherent (or similar) manner in time hence the concept of similarity. On various occasions data are given in terms of distances between the different variables instead of their actual coordinates or their time variations. A familiar problem is then to find a configuration of the points in a lowdimensional space using these pairwise distances. This has given rise to the concept of multidimensional scaling. Multidimensional scaling (MDS) is a geometric method for reconstructing a configuration from its interpoint distances, and enables one to uncover the structure embedded in high-dimensional data. It is an exploratory technique that aims at visualising proximities in low-dimensional spaces. It attempts therefore to preserve, © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_9

201

202

9 Principal Coordinates or Multidimensional Scaling

as much as possible, the observed proximities in the high-dimensional data space. In visualisation MDS plots similar objects close together. For example, EOFs can find a low-dimensional embedding of the data points that preserves their variances as measured in the high-dimensional input space. MDS, on the other hand, finds that embedding which preserves the interpoint distances. As it will be shown later, these two methods are equivalent when the distances are Euclidean. The MDS method has been developed originally in Psychology in the mid 1930s by Schoenberg (1935). The fundamental and classical paper on MDS is that of Young and Householder (1938, 1941). Later, the method was developed further by Torgerson (1952, 1958) by incorporating ideas due to Jucker and others, see, e.g., Sibson (1979), and is sometimes referred to as classical scaling. Gower (1966) popularised the method under the name of principal coordinate analysis. The method was later extended to other fields such as sociology, economy and meteorology, etc. MDS is not only a method of visualisation of proximities, but also a method of dimension reduction like PCA and others (see e.g. Carreira-Perpiñán 2001, Tenenbaum et al. 2000). Various text books present in more detail MDS (Cox and Cox 1994; Young 1987; Mardia et al. 1979; Borg and Groenen 2005; Chatfield and Collins 1980).

9.2 Dissimilarity Measures The starting point of MDS is a set of interpoint distances, or precisely a matrix consisting of all pairwise similarities. The Euclidean distance is a special case of dissimilarity measure. In general, however, dissimilarities need not be distances in the usual Euclidean sense. Definition A n × n matrix D = (dij ) is a distance matrix if it is symmetric and satisfies dij ≥ 0, and dii = 0, for i, j = 1, 2, . . . , n Remark Some authors define a distance matrix as the opposite of the matrix given in the above definition, i.e. symmetric matrix with zero-diagonal and non-positive off-diagonal elements. The above definition provides the general form of distance or dissimilarity matrix. For example, the metric inequality dij ≤ dik + dkj , for all i, j and k, is not required. The choice of a dissimilarity measure depends in general on the type of data and the problem at hand. For example, in the case of quantitative measurements, such as continuous data, the most common dissimilarity measures include the Minkowski the distance between two points xi = (xi1 , . . . , xin ) and xj =

distance, where xj 1 , . . . , xj n is given by

dij = xi − xj λ =

 n  k=1

1

λ

|xik − xj k |λ

(9.1)

9.2 Dissimilarity Measures

203

for λ > 0. The Minkowski distance is a generalisation of the ordinary Euclidean distance obtained with λ = 2. Note also that when λ = 1 we get the L1 -norm and when λ → ∞ we get the L∞ -norm, i.e. xi − xj ∞ = maxk |xik − xj k |. Other dissimilarity measures also exist such as Canberra and Mahalanobis metrics. The latter is closely related to the Euclidean metric and is given by dij = T



xi − xj S−1 xi − xj , where S is the data covariance matrix. Table 9.1 provides some of the most widely used metrics for continuous variables. Euclidean and related distances are not particularly useful when, for example, the measurements have different units, or when they are categorical, i.e. not of numerical type. For binary data, a similarity coefficient is usually defined as the rate of common attributes to the total number of attributes. Assume each individual takes the values 0 or 1, and let a and d designate the number of common attributes between individuals i and j , i.e. a attributes when i = j = 1 (i.e. the individuals co-occur), and d attributes when i = j = 0 as summarised in the contingency table, see Table 9.2. Then the similarity coefficient is defined by sij =

a+d , a+b+c+d

(9.2)

and the dissimilarity (or distance) can be defined by dij = 1 − sij . As an illustration we consider a simple example consisting of six patients labelled A, B, . . . , F . The variables (or attributes) considered are sex (M/F), age category (old/young), employment (employed/unemployed) and marital status (married/single) and we use the value 1 and 0 to characterise the variables. We then construct the following data matrix (Table 9.3): a+d 2+1 The similarity between, for example, A and B is a+b+c+d = 2+1+1+0 = 34 . For categorical data the most common dissimilarity measure is given by Sneath’s coefficient: sij = 1 −

1 δij k , p k

Table 9.1 Examples of the most widely used metrics for continuous variables

Metric Euclidean Minkowski Mahalanobis Canberra One minus Pearson correlation

Formula d(xi , xj )  1 2 2 k (xik − xj k )  1 (x − xj k )λ λ

k ik T −1

xi − xj S xi − xj  |xik −xj k | k |xik |+|xj k |

1 - corr(xi , xj )

Table 9.2 Contingency table j

1 0

i 1 a c

0 b d

204 Table 9.3 Example of a binary data matrix

9 Principal Coordinates or Multidimensional Scaling

Attributes Sex Age Employment Marital status

Variables A B C 1 0 1 1 1 0 1 1 1 0 0 1

D 0 0 1 0

E 0 1 1 1

F 1 1 1 1

where δij k is 1 if i and j agree on variable k and 0 otherwise. An overview of other coefficients used to obtain various types of proximities can be found, e.g. in Cox and Cox (1994).

9.3 Metric Multidimensional Scaling 9.3.1 The Problem of Classical Scaling We assume that we are given a n × p data matrix, i.e. a set of n measurements in a p-dimensional Euclidean space. It is always possible to compute a p × p (or n × n) symmetric association matrix between the variables. Now given a matrix of distances D = (dij ) between the different n points the converse problem consists in finding the original coordinates or configuration of the points as well as the dimension of the space embedding the data. This is the original metric or classical scaling problem (Torgerson 1952; Young and Householder 1938) when the data are quantitative. We suppose that our data matrix is X = (x1 , x2 , . . . , xn )T where T

xk = xk1 , . . . , xkp is the kth point with its p coordinates. The objective is then to find X given the distance matrix D. Note that all we needed for MDS is the matrix of proximities between the units without worrying about the measurements on individual variables. Before we carry on, let us go back a bit to EOFs. We know that EOFs correspond to an orthogonal projection of the data from a p-dimensional Euclidean space Ep onto a lower r-dimensional subspace Er where the overall variability is maximised. Now we let dij and dij∗ designate respectively the distances in Ep and Er . Then it is possible to show that the subspace  Er defined by taking the leading r PCs is also the  2 ∗2 space for which ij dij − dij is minimised. The idea in MDS is also similar. We seek a low-dimensional representation of the data such that the obtained distances give a faithful representation of the true distances or dissimilarities between units.

9.3 Metric Multidimensional Scaling

205

9.3.2 Principal Coordinate Analysis So far we did not impose extra constraints on the distance matrix D. However, to be able to solve the problem, we need the nondegenerate1 distance or dissimilarity matrix  = (δij ) to satisfy the triangular metric inequality: δij ≤ δik + δkj

(9.3)

for all i, j and k. In fact, when (3) is satisfied one can always find n points in a (n − 1)-dimensional Euclidean space En−1 in which the interpoint distances satisfy dij = δij for all pairs of points. For what follows, and also for simplicity, we suppose that the data have been centred by requiring that the centroid of the set of n points lie at the origin of the coordinates: n 

xk = 0.

(9.4)

k=1

Note that if the dissimilarity matrix  does not fulfil nondegeneracy or the triangular metric inequality (9.3), then the matrix has to be processed to achieve these properties. In the sequel we assume, unless otherwise stated, that these two properties are satisfied. It is possible to represent the information contained in  by n points in En−1 , i.e. using a n × (n − 1) matrix of coordinates. An application of PCA analysis to this set of points can yield a lower dimensional representation in 2   which the dissimilarity matrix ∗ minimises i,j δij − δij∗ . Since EOF analysis involves an eigen decomposition of a covariance-type matrix, we will express the distances δij , that we suppose Euclidean, using the n × n scalar product A= matrix p XXT , where the element at the ith line and j th column of A is aij = k=1 xik xj k . 2 p

Using Eq. (9.4) and recalling that δij2 = k=1 xik − xj k , we get  1 2 δij − δi.2 − δ.j2 + δ..2 , 2   where δi.2 = δ.i 2 = n1 nj=1 δij2 and δ..2 = n12 ni,j =1 δij2 . aij = −

(9.5)

Exercise Derive Eq. (9.5). Hint By summing δij2 = aii + ajj − 2aij over one index then both indexes one gets  n n 2 2 and δ.j2 = ajj + n1 ni=1 aii = δj. i=1 aii = 2 δ.. . Hence the diagonal terms are aii = δi.2 − 12 δ..2 , which yields −2aij = δij2 − δi.2 − δ.j2 + δ..2 .

1 That

is zero-diagonal; δii = 0 for all i.

206

9 Principal Coordinates or Multidimensional Scaling

The expression aij =

1 2



 aii + ajj − δij2 represents the cosine law from triangles.2

Also, the matrix with elements (δij2 − δi.2 − δ.j2 + δ..2 ), i, j = 1, . . . n, is known as the double-centred dissimilarity matrix. Denote by 1 = (1, 1, . . . , 1)T the vector of length n containing only ones, the scalar product matrix A in (9.5) can be written in the more compact form A=−

  1 1 1 In − 11T 2 In − 11T , 2 n n

(9.6)

where 2 = (δij2 ), and In is the identity matrix.   Remark The operator In − n1 11T is the projection operator onto the orthogonal complement 1⊥ to 1. Furthermore, Eq. (9.6) can be inverted to yield 2 . From the expression of δij2 shown in the previous exercise we have 2 = 11T Diag(A) + Diag(A)11T − 2A

(9.7)

which can be regarded formally as the inverse of Eq. (9.6). By construction, A is symmetric positive semi-definite, and can be decomposed as A = U2 UT where U is orthogonal. Hence the principal coordinate matrix X is given:3 X = U,

(9.8)

of order n × d, if d is the rank of A. Hence we get a configuration in d dimensions, and the components (λ1 u1i , λ2 u2i , . . . , λn uni ) provide the coordinates of the ith point. Note in passing that the solution X in (9.8) minimises the matrix norm A − XXT . Remark: Relation to EOFs Given an Euclidean distance matrix, the classical MDS solution is obtained analytically and corresponds to a SVD decomposition of A, i.e. A = XXT . In EOFs one seeks an eigenanalysis of XT X. But these two methods are equivalent, see Sect. 3.3 of Chap. 3. One could anticipate that for a data matrix X, the EOFs correspond to the right singular vectors of X, and the principal coordinates correspond to the left singular vectors of X, i.e. the PCs. Although this is true in principle, it is also misleading because in EOFs we have the data matrix and we seek 2 This

is because in a triangle with vertices indexed by i, j and k, and side lengths dij , dik d 2 +d 2 −d 2

and dj k the angle θi at vertex i satisfies cos θi = ij 2dijikdik j k . Hence, if we define bij k =   2 − d 2 /2, then b dij2 + dik ij k = dij dik cos θi , i.e. a scalar product. jk decomposition of a positive semi-definite symmetric matrix A as A = QQT is known as Young–Householder decomposition.

3 The

9.3 Metric Multidimensional Scaling

207

EOFs whereas in MDS we have the distance matrix and we seek the data matrix. One should, however, ask: “how do we obtain EOFs from the same dissimilarity matrix?” The eigenvectors of the matrix of scalar products provide, in fact, the PCs, see Sect. 3.3 of Chap. 3. So one could say the PCs are the principal coordinates, and the EOFs are therefore given by a linear transformation of the principal coordinates.

Classical Scaling in Presence of Errors When the matrix A has zero eigenvalues the classical procedure considers only the part of the spectrum corresponding to positive eigenvalues. The existence of zero eigenvalues is implicitly assured by the double-centred structure of A since A1 = 0. A natural question arises here, which is related to the robustness of this procedure when errors are present. Sibson (1972, 1978, 1979) has investigated the effect of errors on scaling. It turns out, in particular, that classical scaling is quite robust, where observed dissimilarities remain approximately linearly related to distances (see also Mardia et al. 1979).

9.3.3 Case of Non-Euclidean Dissimilarity Matrix When the dissimilarity matrix is not Euclidean the matrix A obtained from (9.5) may cease to be positive semi-definite, hence some of its eigenvalues may be negative. The classical solution to this problem is simply to choose the first k largest (positive) eigenvalues of A, and the corresponding normalised eigenvectors, which are taken as principal coordinates. Another situation that appears in practice corresponds to the case when one is provided with similarities rather than dissimilarities. A n × n symmetric matrix C = (cij ) is a similarity matrix if cij ≤ cii for all i and j . A standard way to obtain a distance matrix D = (dij ) from the similarity matrix C is to use the transformation: dij2 = cii − 2cij + cjj .

(9.9)

It turns out that if C is positive semi-definite, then D is Euclidean, and the corresponding scalar product matrix A is given by     1 1 A = In − 11T C In − 11T . n n

(9.10)

Note that the matrix A in (9.10) is positive semi-definite, and it is straightforward to show that (9.9) leads to (9.10) as in the previous case. When the matrix C is not positive semi-definite, a solution is presented in Sect. 5.3 towards the end of this chapter.

208

9 Principal Coordinates or Multidimensional Scaling

Remark: Invariance to Linear Transformations The MDS configuration is indeterminate with respect to translation, and rotation. The most famous example of MDS application is the road map where distances between cities are provided and the objective is to find a map showing the cities location. The obtained MDS solution is the same if for example we reflect the map. More generally, the MDS solution is invariant to any affine transformation of the form y = Bx + b.

9.4 Non-metric Scaling In classical MDS the dissimilarity between pairs of objects is assumed to be linearly or quadratically related to the distances between the corresponding pairs of individual points in some geometric configuration. However, perfect reproduction of the Euclidean distance is not always the best choice, particularly when we are dealing with ordinal data. In fact, it appears that in many situations the actual numerical values of similarities/dissimilarities have little intrinsic meaning, e.g. in ordinal data. In this case the rank order of the distances between objects becomes more meaningful. The objective of non-metric or ordinal MDS is to attempt to recover the order of the proximities, and not the proximities or their linear transformation. The projection we are seeking should therefore attempt to match the rank order of the distances in the projected (quite often two-dimensional) space to the rank order of the distances in the original space. This can be achieved by introducing monotonically increasing mappings that act on the original dissimilarities, hence preserving the ordering of the dissimilarities. The method of non-metric MDS, developed by Shepard (1962a,b) and Kruskal (1964a,b), is based on the assumption that the interpoint distances dij , in the projection space are monotonically related to the dissimilarities δij in the original configuration. In order to find a configuration such that the ordering of the distances dij and the dissimilarities δij are as close as possible Shepard (1962a,b) and Kruskal (1964a,b) developed a way to measure how well the dissimilarities and distances depart from monotonicity. This measure is provided by the so-called stress function, and measures the difference between the distances dij and, not the dissimilarities, but the best monotonic transformation δˆij of the dissimilarities δij . The stress is given by  S=

 2 dij − dˆij ,  2 i 0, is bounded from above, and for any even p > 0, it is bounded from below as (see e.g. Cadzow 1996) |κy (p, q)| ≤ |κx (p, q)| for all p > q and, |κx (p, q)| ≤ κy (p, q)| for all q > p.

(12.4)

An important particular case is obtained with p = 4 and q = 2, which yields μ (4) the kurtosis κy (4, 2) = σy 2 − 3. The objective is then to maximise the kurtosis y

since this corresponds to the case p > q in the first equality of Eq. (12.4) above. Note that in this procedure X is supposed to be non-Gaussian. Using the sample y1 , . . . , yn , the kurtosis of (12.3) is then computed and is a function of the deconvolving coefficients g = (g1 , . . . , gK )T . The gradient of the kurtosis can be  (12.2) as yt = ψ(B)xt , where ψ(z) = k fk zk and B is the backward shift operator, 2 i.e. Bxt = xt−1 , one gets fy (ω) = |(ω)| fx (ω) = |ψ(eiω )|2 fx (ω). In this expression |(ω)| is the gain of the filter, and fx () and fy () are the power spectra of xt and yt , respectively. The Fourier transform of the deconvolving filter, i.e. its frequency response function, is then given by [(ω)]−1 . p 2 Of order (p, q) of a random variable Y , κ (p, q) is defined by κ (p, q) = c (p)|c (q)|− q , where y y y y cy (p) is the cumulant of order p of Y , and where it is assumed that cy (q) = 0. The following are examples of cumulants: cy (1) = μ (the mean), cy (2) = μ(2) = σ 2 , cy (3) = μ(3), cy (4) = μ(4) − 3σ 4 and cy (5) = μ(5) − 10μ(3)μ(2), where the μ’s are the centred moments. 1 Writing

268

12 Independent Component Analysis

easily computed, and the problem is solved using a suitable gradient-based method (Appendix E), see e.g. Cadzow (1996), Cadzow and Li (1995), Haykin (1999) and Hyvärinen (1999) for further details.

12.2.2 Blind Source Separation The previous discussion concerns univariate time series. In the multivariate case there is also a similar problem of data representation, which is historically related to ICA. In blind source separation (BSS) we suppose that we observe a multivariate time series xt = (x1t , . . . , xmt )T , t = 1, . . . , n, which is assumed to be a mixture of m source signals. This is also equivalent to saying that xt is a linear transformation of an  m-dimensional unobserved time series st = (s1t , . . . , smt )T , t = 1, . . . , n, i.e. xt = m i=1 aik sit , or in matrix form: xt = Ast ,

(12.5)

where A = (aij ) and represents the matrix of unknown mixing coefficients. The components of the vector st are supposed to be independent. The objective is then to estimate this time series st , t = 1, . . . , n, or similarly the mixing matrix A. This problem is also known in speech separation as the cocktail-party problem. Briefly, BSS is a technique that is used whenever one is in the presence of an array of m receivers recording linear mixtures of m signals. Remark An interesting solution to this problem is obtained from non-Gaussian signals st , t = 1, . . . , n. In fact, if st is Gaussian so is xt , and the solution to this problem is trivially obtained by pre-whitening xt using principal components, i.e. using only the covariance matrix of xt , t = 1, . . . , n. The solution to the BSS problem is also obtained as a linear combination of xt , t = 1, . . . , n, as presented later. Note also that the BSS problem can be analysed as an application to ICA.

12.2.3 Definition of ICA We assume that we are given an m-dimensional unobserved non-Gaussian random variable s = (S1 , . . . , Sm )T , with independent components, and we suppose instead that one observes the mixed random variable x = (X1 , . . . , Xm )T : x = As,

(12.6)

12.3 Independence and Non-normality

269

where A = (aij ) and represents the m×m mixing matrix. The objective of ICA is to estimate both the underlying independent components3 of s and the mixing matrix A. If we denote the inverse of A by W = A−1 , the independent components are obtained as a linear combination of x: u = Wx.

(12.7)

In practice one normally observes a sample time series xt , t = 1, . . . , n, and the objective is to find the linear transformation st = Wxt , such that the components st1 , . . . , stm are “as statistically independent as possible”. Because the basic requirement is the non-normality of st , ICA is sometimes considered as nonGaussian factor analysis (Hyvärinen et al. 2001). Historically, ICA originated around the early 1980s as a BSS problem in neurophysiology (see e.g. Jutten and Herault 1991), in neural networks and spectral analysis (Comon et al. 1991; Comon 1994) and later in image analysis and signal processing (Mansour and Jutten 1996; Bell and Sejnowski 1997; Hyvärinen 1999). A detailed historical review of ICA can be found in Hyvärinen et al. (2001). In atmospheric science, ICA has been applied much later, see e.g. Basak et al. (2004). Since non-Gaussianity is a basic requirement in ICA, methods used in ICA are necessarily based on higher (than two) order moments using numerical methods, see e.g. Belouchrani et al. (1997) for second-order methods in BSS. Hence, as in projection pursuit, criteria for non-Gaussianity as well as appropriate algorithms for optimisation are required to do ICA. These issues are discussed next.

12.3 Independence and Non-normality 12.3.1 Statistical Independence The objective of ICA is that the transformation s = Wx, see Eq. (12.7), be as statistically independent as possible. If one denotes by f (s) = f (s1 , . . . , sm ) the joint probability density of s, then independence means that f (s) may be factorised as f (s1 , . . . , sm ) =

m !

(12.8)

fk (sk ),

k=1

where fk (sk ) is the marginal density function of sk , k = 1, . . . , m, given by  fk (sk ) =

3 Also

hidden factors or variables.

Rm−1

f (s1 , . . . , sm )

m ! i=1,i=k

dsk .

(12.9)

270

12 Independent Component Analysis

It is well known that independence implies uncorrelatedness, but the converse is in general incorrect.4 The Gaussian random variable constitutes an exception where both the properties are Uncorrelatedness between any two components

equivalent. si and sj means cov si , sj = 0. Independence, however, yields non-correlation between g(si ) and h(sj ) for any5 function g() and h() (Hyvärinen 1999, and references therein) as     E g(si )h(sj ) − E [g(si )] E h(sj ) = 0, i.e. si and sj are nonlinearly uncorrelated for all types of nonlinearities. This is clearly difficult to satisfy in practice when using independence. We seek instead a simpler way to measure independence, and as it will be seen later, this is possible using information-theoretic approaches.

12.3.2 Non-normality We have already touched upon non-normality in the previous chapter in relation to projection pursuit, which is also related to independence as detailed below. In order to have a measure of non-normality, it is useful to know the various properties of the Gaussian distribution. This distribution (see Appendix B) is completely specified by its first- and second-order moments, i.e. its mean and its covariance matrix. In statistical testing theory various ways exist to test whether a given finite sample of data comes from a multivariate normal distribution, see e.g. Mardia (1980).6 The most classical property of the Gaussian distribution, apart from its symmetry (i.e. zero skewness), is related to its fourth-order cumulant or kurtosis. If X is a zeromean Gaussian random variable, then the (excess) kurtosis

2 κ4 (X) = E(X4 ) − 3 E(x 2 )

(12.10)

vanishes. A random variable with positive kurtosis is called super-Gaussian or leptokurtic.7 It is typically characterised by a spike at the origin and with fatter tail than the normal distribution with the same variance. A random variable with

X to be a standard normal random variable and define Y = X 2 , then X and Y are uncorrelated but not independent. 5 Measurable. 6 Who referred to the statement “Normality is a myth; there never was, and never will be a normal distribution” due to Geary (1947). As Mardia (1980) points out, however, this is an overstatement from a practical point of view, and it is important to know when √ a sample departs from normality. 7 Such as the Laplace probability density function f (x) = √1 e− 2|x| . 4 Take

2

12.4 Information-Theoretic Measures

271

negative kurtosis is known as sub-Gaussian or platykurtic.8 It is characterised by a relatively flat density function at the origin and decays fast for large values. It is clear that the squared kurtosis can serve as a measure of non-normality and can be used to find independent components from a sample of time series. Note that this is similar to blind deconvolution (Shalvi and Weinstein 1990) where the kurtosis is maximised. The serious drawback of kurtosis is that it is dominated by a few outliers. A simple alternative is to use robust measures of kurtosis based on order statistics. The only problem in this case, however, is that this measure becomes nondifferentiable, and gradient type methods cannot be used. One way forward is to smooth the empirical distribution function by fitting a smooth cumulative distribution function and then estimate the order statistics using this smoothed distribution. There are also other measures that characterise a normal random variable, such as maximising differential entropy, which is presented next.

12.4 Information-Theoretic Measures Most measures applied in ICA use information-theoretic quantities, which are rooted in information theory. The most important quantity involved in those measures is the differential entropy defined for continuous random variables. Entropy has already been presented in the previous chapter, and, for convenience, we give below a brief description of it.

12.4.1 Entropy The entropy of a random variable can be defined as the average amount of information given by observing the variable. The more predictable the smaller the entropy and vice versa. For a discrete variable U taking on the values u1 , . . . , uq , with respective probabilities p1 , . . . , pq , the (Shannon) entropy is given by H (U ) = −

q 

pk ln pk .

(12.11)

k=1

For a continuous variable with probability density function f (u), the differential entropy or Boltzmann H-function is given by  H (U ) = −

8 Such

R

as the uniform distribution over [−1, 1].

f (u) ln f (u)du.

(12.12)

272

12 Independent Component Analysis

The entropy can also be extended naturally to the multivariate case. If u = (u1 , . . . , um )T is an m-dimensional random vector with probability density function f (u), the entropy of u is  H (u) = −

Rm

f (u) ln f (u)

m !

(12.13)

duk .

k=1

A particular property of the differential entropy is that it is maximised for a normal distribution among all the distributions with the same covariance structure, see Sect. 3 of the previous chapter. The entropy also enjoys another property related to independence, m namely if the components u1 , . . . , um of u are independent, then H (u) = k=1 H (ui ). The entropy, however, is not invariant under covariance changes. More generally, if u = g(v) is an invertible transformation between the multivariate random variables u and v whose probability density functions are, respectively, fu () and fv (), then using the fact that H (u) = −E [ln fu (u)], one gets  H (u) = H (v) + E [ln |J |] = −

 Rm

fv (v) ln fv (v)dv +

Rm

ln |J |fu (u)du, (12.14)

where J is the determinant of the Jacobian of the transformation, i.e. J = det ∂u ∂v . Equation (12.14) derives from the relationship between the densities fu () and fv () of u and v, respectively,9 i.e. |J |fu (u) = fv (v). In particular, for a linear transformation u = Av, one gets H (u) = H (v) + ln |detA|.

(12.15)

The entropy gives rise to a number of important measures useful for ICA as detailed next, see also their usefulness in projection pursuit.

12.4.2 Kullback–Leibler Divergence We consider here the set of all continuous probability density functions of mdimensional random vectors. The Kullback–Leibler (K–L) divergence for the two probability density functions f () and g() is given by  Df g =

f (u) du g(u)

 first E [h(u)] = h(u)fu (u)du and E [h(g(v))] h(u)fv (v)|J |du, one obtains the required result.

9 Writing



f (u) ln

(12.16)

=



h(g(v))fv (v)dv

=

12.4 Information-Theoretic Measures

273

Properties of the K–L Divergence • Df f = 0. • Df g ≥ 0 for all f () and g(). Proof The classical proof for the second property uses the so-called Gibbs inequality10 (Gibbs 1902, chap XI, theorem 2), see also Fraser and Dimitriadis (1994). Here we give another sketch of the proof using, as before, ideas from the calculus of variations. Let f () be any given probability density function, and we set the task to compute the minimum of Df g considered as a functional of g(). Let ε() be a “small perturbation” function such that [g + ε] () is still a probability density for a given probability density g(), that is, f + ε/ge0. This necessarily means that ε(u)du = 0. Given the constraint  satisfied by g(), we consider the objective function G(g) = Df g − λ(1 − g(u)du), where λ is a Lagrange multiplier. 

ε(x) ε(x) ≈ g(x) Now using the approximation ln 1 + g(x) , one gets G (g + ε) − G (g) ≈    f (x) − g(x) + λ ε(x)dx. The necessary condition of optimum yields f = −λg, i.e. f = g since f () and g() integrate to 1. It remains to show that g = f is indeed  (x) dx. For any “small” perturbation ε(), we the minimum. Let F (g) = f (x) ln fg(x) have, keeping in mind that log(1 + ε)−1 ≈ −ε + ε2 /2, F (f + ε) − F (f ) =  ε2 (x) 1 2 2 f (x) dx + o(ε ) ≥ 0. Hence g = f minimises the functional F ().

The K–L divergence is sometimes regarded as a distance between probability density functions where in reality it is not because it is not symmetric.11 The K–L divergence yields an important measure used in ICA, namely mutual information.

12.4.3 Mutual Information We have seen that if the components ofu = (u1 , . . . , um ) are independent, then the entropy of u, H (u), reduces to m i=1 H (ui ). For a general multivariate random variable u, the difference between the two quantities is precisely the mutual information:12 I (u) =

m 

H (uk ) − H (u).

(12.17)

k=1

10 Some

authors attribute this inequality to Kullback and Leibler (1951). way to make it symmetric is to consider the extended K–L divergence: D(f, g) = Df g +  = (f − g) Log fg .

11 One

Dg f 12 Bell and Sejnowski (1995) define the mutual information I

(X, Y ) between input X and output Y of a neural network by I (X, Y ) = H (Y ) − H (Y |X), where H (Y |X) is the entropy of the output that does not come from the input X. The objective in this case is to maximise I (X, Y ).

274

12 Independent Component Analysis

The mutual information is also known as redundancy (Bell and Sejnowski 1995), and the process of minimising I () is known as redundancy reduction (Barlow 1989). The mutual information I (u) provides a natural measure of the dependence between the components of u. In fact, I (u) ≥ 0 and that I (u) = 0 if and only if (u1 , . . . , um ) are independent. Exercise Show that I (u1 , . . . , um ) ≥ 0. Hint See below. The mutual information can also be defined using the K–L divergence. In fact, if f () is the probability density ( of u, fk () is the marginal density function of uk , k = 1, . . . , m, and f˜(u) = m k=1 fk (uk ), then  I (u) = Df f˜ =

f (u) du, f (u) ln f˜(u) 

(12.18)

and hence I (u) ≥ 0. Exercise Derive the last equation (12.18).

 ˜(u)du. The Hint We can rewrite (12.18) to yield I (u) = −H (u) − f (u) ln f(    lastterm can be expanded to yield m ln f (u )du f (u) i i i Rm−1 j =i duj = i=1 R − i H (ui ).

12.4.4 Negentropy Since the entropy is not invariant to variable changes, see Eq. (12.15), it is desirable to construct a similar function that is invariant to linear transformations. Such measure is provided by the negentropy. If u is an m-dimensional random variable with covariance matrix  and u is the multinormal random variable with the same covariance, then the negentropy of u is given by J (u) = H (u ) − H (u) .

(12.19)

  Recall that u is the closest to u and that H (u ) = 12 ln || (2π e)m . Given that the multinormal distribution has maximum entropy, among distribution with same covariance, the negentropy is zero only for a multinormal distribution. The negentropy has already been presented in projection pursuit as a measure of nonnormality, see from (12.19) one Eq. (12.34)in the previous chapter. Furthermore,  gets J (u) − i J (ui ) = i H (ui ) − H (u) + H (u ) − i H (uσ 2 ), where uσ 2 is i

the normal variable with zero mean and variance σi2 = ()ii ; hence,

i

12.4 Information-Theoretic Measures

I (u) = J (u) −

275

m 

J (ui ) +

i=1

1 ln 2

(m

2 i=1 σi

||

.

(12.20)

Note that if u1 , . . . , um are uncorrelated,  the last term vanishes. If u = Wx, then since I (u) = i H (ui ) − H (x) − ln |detW |, one gets when the covariance matrix of u is the identity matrix, i.e. WT W =  −1 x , equivalence between negentropy and mutual information: I (u) = −

m 

J (uk ) + c,

(12.21)

k=1

where c is a constant not depending on W. Equation (12.21) is very useful when optimising the objective function with respect to W.

12.4.5 Useful Approximations One of the main drawbacks of the previously defined information-theoretic measures is that they all rely on an estimate of the probability density function of the data. Probability density function estimation is a well known difficult problem particularly in high dimension. Some useful approximations have been constructed for the entropy/mutual information based on the Edgeworth expansion, see Eq. (11.15) in the preceding chapter. Hyvärinen (1998) presents a more robust approximation of the negentropy: J (y) ≈



ki [E (Gi (y)) − E (Gi (ν))]2

(12.22)

i

with ν a standard normal variable, where ki are positive constants, y is supposed to be scaled (with zero mean and unit variance) and Gi () are some non-quadratic functions. Note that Gi () cannot be quadratic, otherwise J (y) would be identically zero. Example of such functions includes (Hyvärinen 1998; Hyvärinen and Oja 2000) 2 G1 (u) = ln cosh(au), and G2 (u) = − exp − y2

(12.23)

for some 1 ≤ a ≤ 2. For the multivariate case, Hyvärinen (1998) also provides an approximation to the mutual information; involving the third- and fourth-order cumulants, namely

276

12 Independent Component Analysis

1  4 (κ3 (ui ))2 + (κ4 (ui ))2 + 7 (κ4 (ui ))4 − 6 (κ3 (ui ))2 κ4 (ui ) 48 i=1 (12.24) for uncorrelated ui , i = 1, . . . , m, and c is a constant. m

I (u) = c +

12.5 Independent Component Estimation Given a multivariate time series of dimension m, xt , t = 1, . . . , n, obtained by mixing p independent components of unknown times series st , t = 1, . . . , n, the objective is to find the mixing matrix A and the independent components satisfying xt = Ast , or in matrix form: X = AS,

(12.25)

where S = (s1 , . . . , sn ). The usual procedure to solve this problem is to find instead a matrix W, known as matrix of filters, such that ut = Wxt , or in matrix form: U = WX

(12.26)

recovers the underlying independent components13 S. The matrix W is also known as the weight or decorrelating matrix. In the ICA model (12.6) and (12.7) the filter

matrix is obtained in such a way that the components u1 , . . . , up of U are as statistically independent as possible. A basic requirement for the ICA problem to be identifiable is that the components of st are non-Gaussian, see e.g. Hyvärinen et al. (2001). Hence, in the search process the vector ut = Wxt will have, in particular, to be maximally non-Gaussian. Various objective functions exist to obtain independent components, and these are discussed next.

12.5.1 Choice of Objective Function for ICA Unlike EOFs/PCA where the objective function is quadratic and the solution is easily obtained, in ICA various objective functions exist along with various algorithms to compute them. A great deal of objective functions is based on an information-theoretic approach, but there are also other objective functions using ideas from neural networks. Furthermore, these objective functions can be either “one unit” where each ICA component is estimated at a time or “several”/“whole units” where several/all ICA

13 Possibly

in a different order and rescaled.

12.5 Independent Component Estimation

277

components are obtained at once (Hyvärinen 1999). The one-unit case is particularly similar to projection pursuit where an index of non-normality is maximised. A particularly interesting independence index for one-unit ICA is provided by the following criteria:

Negentropy For a given direction w, the negentropy of the index y = wT X

(12.27)

is obtained using Eq. (12.19) and estimated using, for example, Eq. (12.22). The maximisation of J (y) yields the first ICA direction. Subsequent ICA directions can be computed in a similar manner after removing the effect of the previously identified directions as in PP. Alternatively, one can also estimate negentropy using a non-parametric estimation of the data probability density function discussed below.

Non-normality There is a subtle relationship between projection pursuit and ICA. This subtlety can be made clear using an argument based on the central limit theorem (CLT). The linear expression in Eq. (12.27) can be expressed as a linear combination of the (unknown) independent components using Eq. (12.6). A linear combination of these components is more Gaussian than any of the individual (non-Gaussian) components. Hence to achieve the objective, one has to maximise non-normality of the index (12.27) using, e.g. kurtosis, Eq. (12.10) or any other index of nonnormality from the preceding chapter.

Information-Theoretic Approach Information-theoretic-based approaches, presented in Sect. 12.4, can be used for both: the one-unit and multi-unit search method. The mutual information, which incorporates entropy and the Kullback–Leibler divergence, can be used particularly for the multi-unit case where all independent components are estimated simultaneously. This can be made possible owing to the approximation shown in Eq. (12.24).

Likelihood Maximisation Approach Another approach, based on maximum likelihood estimation (MLE), was presented by Pham et al. (1992) using the likelihood of model, Eq. (12.7). Using the same arguments as for variable changes in Sect. 12.4.1, e.g. Eq. (12.14), and recalling that

278

12 Independent Component Analysis

the components u1 , . . . , um of u are independent, the log-likelihood of model (12.7), based on the sample, is L=

n  t=1

ln fx (xt ) =

m n  

ln fk (wTk xt ) + n ln |detW|,

(12.28)

t=1 k=1

where wk is the kth column of WT , and fk () is the probability density function of the kth component uk of u. One major difficulty with (28) is that fu () is unknown and has to be estimated, see e.g. Pham et al. (1992). It is possible to overcome this difficulty by using non-parametric methods for probability density estimation, as discussed next.

Information Maximisation Approach A different approach to solve the ICA problem was developed by Bell and Sejnowski (1995) by maximising the input–output information (info-max) from a neural network system. The info-max approach goes as follows. An input x, with probability density f (), is passed through a (m → m) neural network system with weight matrix W and bias vector w0 and a sigmoid function g() = (g1 (), . . . , gm ()). In practice the sigmoids are taken to be identical.14 The output is then given by y = g (Wx + w0 )

(12.29)

with probability density fy (). The objective is to maximise the information transferred. This can be achieved by minimising redundancy, a generalisation of mutual information, or by maximising the output entropy. Noting y = (g1 (u1 ), . . . , gm (um )), where (u1 , . . . , um )T = Wx + w0 , and using Eq. (12.14) along( with the fact that the Jacobian of transformation (12.29) is J = d |W| m k=1 du gk (uk ), the maximisation of H (y) with respect to W is then achieved by maximising the second term in the right hand side of Eq. (12.14), i.e. using the gradient of this term with respect to W. The updating rule is then given by ) m ) ! )d ) ∂ ∂ ∂ ) ) ln |J | = ln |W| + ln W ∼ ) du gk (uk )) , ∂W ∂W ∂W

(12.30)

k=1

where we have noted u = (u1 , . . . , um )T = Wx + w0 . The first term in the right hand side of Eq. (12.30) has been calculated (see Appendix D) and is W−T . For the second term, the derivative with respect to the individual element wij of W is given by

14 For

example, tanh() or the logistic function g(u) =

eu 1+eu .

12.5 Independent Component Estimation m  k=1

279

1

∂ d gk (uk ) = d ∂wij du du gk (uk )

d2 g (u ) du2 i i xj . d du gi (ui )

(12.31)

Exercise Derive Eq. (12.31). Similarly we get the derivative with respect to the second argument w0 = d2

T gk (uk ) ln |J | du2 w1,0 , . . . , wm,0 in Eq. (12.29): ∂∂w = . The learning rule for the d k,0 du gk (uk )

network is then

W = W−T + axT w0 = a,

(12.32)

where ⎛ aT = ⎝



d2 d2 g (u ) g (u ) du2 1 1 du2 m m ⎠ ,..., d . d du g1 (u1 ) du gm (um )

For example, if the logistic function is used, then a = 1 − 2y. Note that when one has a sample x1 , . . . , xn , the function to be minimised is E [ln |J |], where the expectation of ln |J | and also the gradient, i.e. E [W] and E [w0 ], is simply a sample mean. Remark Using Eq. (12.14) along with Eq. (12.17), the mutual information I (u) is related to the entropy of the output v = g(u) from a sigmoid function g() via I (u) = −H (g(u)) + E

 m  k=1

 d gk (uk )| | du ln , fk (uk )

(12.33)

where fk () is the marginal probability density of uk . Exercise Derive the above equation (12.33).

  Hint First, use Eqs. (12.14) and ( (12.17) , I (u) = H (ui ) − H (u) = H (ui ) − d ∂u H (v) + E(ln |J |), where J = du gk (uk ), (i.e. J = | ∂v |). Next use H (ui ) = −E[ln fi (ui )]. Equation (12.33) indicates that maximum information transmission (i.e. redundancy reduction) can be achieved by optimally aligning or matching the sloping part d of the squashing (sigmoid) function with the input density function, i.e. du gk () = fk (). This means that the sigmoid is precisely the cdf of the input probability density

280

12 Independent Component Analysis

function. This coding principle applied in a neurobiological context (Laughlin 1981) has been found to maximise neuron’s information capacity.15 The same equation also points out to the fact that if uk , k = 1, . . . , m, are d gk () as respective pdf, then maximising the joint entropy is independent with du equivalent to minimising mutual information; hence, Infomax and ICA become equivalent. As pointed out by Jung et al. (2001), there is a difficulty in practice since this means that the sigmoids have to be estimated. They argue, however, that for super Gaussian independent component sources, super-Gaussian sigmoids, e.g. logistic function, give good results. Similarly, for sub-Gaussian independent component sources, a sub-Gaussian sigmoid is also recommended. Remarks It has been shown by Cardoso (1997) that Infomax is equivalent to MLE when the input cumulative distribution functions match the sigmoid functions, see also Hyvärinen (1999). The previous approach can be applied simply to just one unit to find individual independent components at a time, which results precisely in a PP problem.

A Non-parametric Approach A major difficulty in some of the previous methods is related to the estimation of the pdf. This is a well known difficult problem particularly in high dimensions. A useful and practical way to overcome this problem is to use a non-parametric approach for the pdf estimation using kernel  methods. The objective is to minimise the mutual information of y = Wx, I (y) = H (yi ) − ln |W| − H (x), see Eq. (12.15). This is equivalent to minimising F (W) =

m 

H (yk ) − ln |W| = −

k=1

m 



E ln fyk (wTk x) − ln |W|,

(12.34)

k=1

where wk , k = 1, . . . , m, are the columns of WT . Note that Eq. (12.34) is identical to the likelihood (12.28). Also the inclusion of ln |W| in (12.34) means that this term will be maximised making W full rank. It is also recommended to choose wk , k = 1, . . . , m, to be unitary, which makes the objective function bounded, as in projection pursuit. The most common way to estimate the marginal probability densities fyk () is to use the kernel smoother. Given a sample xt , t = 1, . . . , n, the kernel density estimate p() is 1  φ nh n

p(x) =

t=1

15 By



x − xt h

 ,

ensuring that all response levels are used with equal frequency.

(12.35)

12.5 Independent Component Estimation

281

where φ() is a kernel density  function  usually taken to be the standard Gaussian kernel φ(x) = (2π )−1/2 exp −x 2 /2 , and h is a smoothing parameter. Consequently, using this estimator, the minimisation of Eq. (12.34) becomes max{F (W) =

n

   wk (xt −xk ) n 1 + nLog|W|} s.t. wj = 1, Log φ k=1 k=1 nh h (12.36)

m

t=1

for j = 1, . . . , m. The advantage of this approach is that one needs only to estimate the marginal pdfs and not the joint pdf. Note that Eq. (12.36) can also be used in projection pursuit to estimate individual directions.

Other Methods Various other methods exist that deal with ICA, see e.g. Hyvärinen et al. (2001) for details. But before closing this section, it is worth mentioning a particularly interesting and easy method to use, based on a weighted covariance matrix,

see e.g. Cardoso (1989). The method is based on finding the eigenvectors of E x 2 xxT , which can be estimated by  ˆ = 1  xk 2 xk xTk , n n

k=1

where the data have been sphered prior to computing this covariance matrix. The method is based on the assumption that the kurtosis of the different components is different. The above nonlinear measures of association between variables can also be used, in a similar way to the linear covariances or Pearson correlation, to define climate networks, see Sect. 3.9 (Chap. 3). For example, mutual information (Donges et al. 2009; Barreiro et al. 2011) and transfer entropy (Runge et al. 2012) have been used to define climate networks and quantify statistical association between climate variables.

12.5.2 Numerical Implementation Sphering/Whitening Prior to any numerical procedure, in ICA it is important to preprocess the data. The most common way is to sphere the data, see Chap. 2. After centring, the sphered (or whitened) variable is obtained from the transformation z = Q (x − E(x)), such that the covariance matrix of z is the identity. For our sample covariance matrix

282

12 Independent Component Analysis

S = E2 ET , one can take for Q the inverse of the square root16 of S. i.e. Q = −1 ET . From the sample data matrix X = (x1 , . . . , xn ), the sphered data matrix Z corresponds to the standardised PC matrix. The benefit of sphering has been nicely illustrated in various places, e.g. Hyvärinen et al. (2001) and Jenssen (2000), and helps simplify calculations. For example, if the new mixing matrix B, for which z = Bs = QAs, is orthogonal, then the number of degrees of freedom is reduced from m2 to m(m − 1)/2.

Optimisation Algorithms Once the data have been sphered, the ICA problem can be solved using one of the previous objective functions. When the objective function is relatively simple, e.g. the kurtosis (12.10) in the unidimensional case, the optimisation can be achieved using any algorithm such as gradient type algorithms. For other objective functions such as those based on information-theoretic approaches, commonly used algorithms include the Infomax (Bell and Sejnowski 1995), the maximum likelihood estimation and the FastICA algorithm (Hyvärinen and Oja 2000). • Infomax The gradient of the objective function based on Infomax has already been given in Eq. (12.32). For example, when the logistic sigmoid is used, the learning rule or algorithm is given by W = W−T + [1m − 2g (Wx)] xT ,

(12.37)

whereas for the tangent hyperbolic, particularly useful for super-Gaussian independent components (Hyvärinen 1999), the learning rule becomes W−T − 2 tanh (Wx) xT . The algorithm used in Infomax is based on the (stochastic) gradient ascent of the objective function.17 • FastICA The FastICA algorithm was introduced by Hyvärinen and Oja (2000) in order to accelerate convergence, compared to some cases with Infomax using stochastic gradient. FastICA is based on a fixed point algorithm similar to the Newton iteration procedure (Hyvärinen 1999). For the one-unit case, i.e. one ICA at a time, FastICA finds directions of maximal non-normality. FastICA was developed in conjunction with information-theoretic approaches based on the negentropy approximation in (12.22), using a non-quadratic function G(), with derivative g(), and yields, after discarding the constant E (G(ν)),

16 Note

that the square root is not unique; E and EET are two square roots, and the last one is symmetric. 17 Hyvärinen (1999) points out that algorithm (12.37) can be simplified by a right multiplication by WT W to yield the relative gradient method (Cardoso and Hvam Laheld 1996).

12.5 Independent Component Estimation

283

  J (w) = E G wT xt , 2

where w is unitary. Because of sphering, one has E wT xt = wT w = 1. Using a Lagrange multiplier, the solution corresponds

to the stationary points of the extended objective function J (w) − λ w 2 − 1 , given by   F (w) = E xt g wT xt − λw = 0.

(12.38)



dg The Jacobian of F (w) is JF (w) = E xt xTt du (wT xt ) − λIm . Hyvärinen (1999)

dg T

dg uses the approximation E xt xTt du (wT xt ) ≈ E xt xTt E du (w xt ) , which is isotropic and easily invertible. Now, using the Newton algorithm method, this yields the iterative form: w∗k+1 = wk − wk+1 =

  E xt g(wTk xt ) −λwk

dg E du (wTk xt ) −λ

w∗k+1 w∗k+1 .

(12.39)

The parameter λ can using Eq. (12.38) through multiplication by  be estimated

wT , to yield λ = E wTk xt g wTk xt at each iteration. To estimate several independent components, one applies the same procedure using projected gradient, as in simplified EOFs (Chap. 4), by imposing an extra constraint of orthogonality. Note also that if the matrix W is chosen to be orthogonal, the Infomax learning rule (12.37) becomes comparable to the algorithm (12.39) written in matrix form. Hyvärinen (1999) argues that the FastICA algorithm seems to have quadratic and cubic convergence compared to the linear convergence of gradient-based algorithms. Furthermore, unlike other gradient methods, the algorithm has no step size to fix. More details on the properties of the algorithm can be found in Hyvärinen et al. (2001). • Non-parametric methods Apart from neural-based algorithms, e.g. Infomax and FastICA, there is also the gradient-based approach, which can be applied to the non-parametric estimation (12.36). In this approach the smoothing is normally a function of sample size, and its optimal value is ho = 1.06n−1/5 (Silverman 1986). The algorithm can be made more efficient by using a FFT transform to estimate where the first expression is for the evaluation of the objective function, and the second one is for gradient computation. The optimisation of (12.36) can be transformed into an unconstrained problem via the change of variable wk = pk −1 pk and using the  −T ∂ identity ∂pij (ln |W|) = P − pi −1 pij , i.e. ij   −1 ∂ ln |W| = diag p1 2 , . . . , pm 2 P + P−T , ∂P

284

12 Independent Component Analysis

which can be reported back into the gradient of (12.36). Alternatively, if W is orthogonal, the last term in (12.36) vanishes, and one can use the projected gradient algorithm as in simplified EOFs. A few other algorithms have also been developed, and the reader is directed to Hyvärinen (1999) for further details.

12.6 ICA via EOF Rotation and Weather and Climate Application 12.6.1 The Standard Two-Way Problem The use of ICA in climate research is quite recent, and the number of research papers is limited. Philippon et al. (2007) employ ICA to extract independent modes of interannual and intraseasonal variability of the West African vegetation. Mori et al. (2006) applied ICA to monthly sea level pressures to find the main independent contributors to the AO signal. See also Basak et al. (2004) for an analysis of the NAO, Fodor and Kamath (2003) for an application of ICA to global temperature series and Aires et al. (2000) for an analysis of tropical sea surface temperatures. The ICA technique has the potential to avoid the PCA “mixing problem”. PCA has the tendency to mix several modes of comparable magnitude, often generating spurious regional overlaps or teleconnections where none exists or distorting existing overlaps or teleconnections (Aires et al. 2002). There is a wide class of ICA algorithms that achieve approximate independence by optimising criteria involving higher order cumulants; for example, the JADE criterion proposed by Cardoso and Souloumiac (1993) performs joint diagonalisation of a set of fourth-order cumulant matrices. The orthomax-based criteria proposed in Kano et al. (2003) are, respectively, quadratic and linear functions of fourth-order statistics. Unlike higher order cumulant-based methods, the popular FastICA algorithm chooses a single non-quadratic smooth function (e.g. g(x) = log cosh x), such that the expectations of this function yield a robust approximation to negentropy (Hyvärinen et al. 2001). In the next section a criterion is introduced, which requires the minimisation of the sum of squared fourth-order statistics formed by covariances computed from squared components. ICA can be interpreted in terms of EOF rotation. Aires et al. (2002) used a neural network-based approach to obtain independent components. Hannachi et al. (2009) presented a more analytical way to ICA via rotation by minimising a criterion based on the sum of squared fourth-order statistics. The optimal rotation matrix is then used to rotate the matrix of initial EOFs to enhance interpretation. The data are first pre-whitened using EOFs, and the ICA problem is then solved by rotating the matrix of the uncorrelated component scores (PCs), i.e. Sˆ = YT ,

(12.40)

12.6 ICA via EOF Rotation and Weather and Climate Application

285

for some orthogonal matrix T. The method then uses the fact that if the components s1 , . . . , sk are independent, their squares s12 , . . . , sk2 are also independent. Therefore the model covariance matrix of the squared components is diagonal. Given any orthogonal matrix V, and letting G = YV, the sample covariance matrix between the element-wise squares of G is R=

1 (G  G)T H(G  G) , n−1

(12.41)

where H is the centring operator, and  denotes the element-wise (Hadamard) matrix product. Hannachi et al. (2009) minimised the rotation criterion: F(V) =

1 ( R F − diag(R) F ) , 2

(12.42)

where R F = trace(RRT ) is the Fröbenius norm of R. The matrix T in (12.40) satisfies F(T) = 0 and can therefore be estimated by minimising the functional F(.). The following is a simple illustrative example that compares two models of the Arctic Oscillation (AO), see Hannachi et al. (2009) for more details and further references. Model AO1: x = A1 s

(12.43)

Model AO2: x = A2 s,

(12.44)

and

where x is a three-dimensional vector of the observed time series representing the SLP anomalies at the centres of action of the AO, namely, Arctic, North Pacific and North Atlantic, s is two-element vector of independent variables and A1 and A2 are two mixing matrices given by ⎛

⎞ 2 0 A1 = ⎝ −1 1 ⎠ −1 −1



and

⎞ 1 1 A2 = ⎝ −1 0 ⎠ . 0 −1

Assuming the components of s to be uniformly distributed over [−1, 1], Fig. 12.1 shows PC1 vs PC2 of the above models. The figure shows clearly that the two models are actually distinguishable. However, this is not the case when we compute the covariance matrix of these models, which are proportional since A1 AT1 = 2A2 AT2 .

(12.45)

286

12 Independent Component Analysis

a

b

IC2 2 IC1

0

−2

c

0 EOF1

2

−2

d

EOF2

0

2

EOF1

−2

IC2

IC2

0

−2 −2

2

IC2 IC1

EOF2

EOF2

2

0 EOF1

EOF2

2

EOF1

0

−2 −2

0 IC1

2

−2

0 IC1

2

Fig. 12.1 Scatter plot of PC1 vs PC2 (a, b) and IPC1 vs IPC2 (c, d) of models AO1 (a, c) and AO2 (b, d). Adapted from Hannachi et al. (2009) ©American Meteorological Society. Used with permission

The joint density of the latent signals s1 and s2 is uniform on a square. In Fig. 12.1a the joint density of the scores of EOF 1 and EOF 2 for Model AO1 is uniform on a square, and PC 1 and PC 2 are independent. In Fig. 12.1b, the joint density of the scores of EOF 1 and EOF 2 cannot be expressed as the product of the marginal densities, and the components are not independent. When the above procedure is applied, the independent components are extracted successfully (Fig. 12.1c, d). An example of application of ICA via rotation to monthly SLP from NCEP-NCAR was presented by Hannachi et al. (2009). Monthly SLP anomalies for the period Jan 1948–Nov 2006 were used. Figure 12.2 shows grid points where the time series of SLP anomalies depart significantly, at the 5% level, from normality. It is clear that non-Gaussianity is a dominant feature of the SLP anomalies. The data were first reduced by applying an EOF analysis, and ICA rotation was applied to various number of EOFs. Figure 12.3 shows the first five independent principal components (IPCs) obtained by rotating the first five component scores. Their cross-correlations have been checked and found to be zero according to some prescribed level of machine precision. The crosscorrelations of various nonlinear transformations of IPCs have also been computed and compared to those obtained using PCs.

12.6 ICA via EOF Rotation and Weather and Climate Application

287

Fig. 12.2 Grid points in the northern hemisphere where sea level pressure anomalies depart from Gaussianity at the 1% significance level using a KS test. ©American Meteorological Society. Used with permission

IC 1

2 0 −2 −4 Jan60

Jan70

Jan80

2

4

0

2

IC 3

IC 2

Jan50

−2

Jan90

Jan00

0 −2

−4 Jan60

Jan80

Jan00

Jan60

Jan80

Jan00

Jan60

Jan80

Jan00

4

IC 4

IC 4

2 0 −2

2 0 −2

Jan60

Jan80

Jan00

Fig. 12.3 Independent principal components obtained via EOF rotation of the first five PCs of the monthly mean SLP anomalies. Adapted from Hannachi et al. (2009). ©American Meteorological Society. Used with permission

The upper triangular part of the matrix in Table 12.1 shows the absolute values of the correlations between the element-wise fourth power of IPCs 1–5. The lower part shows the corresponding correlations using the PCs instead of the IPCs. Significant

288

12 Independent Component Analysis

Table 12.1 Correlation matrix of the fourth power elements of ICs 1 to 5 (above the diagonal) and the same correlation but for the PCs (below the diagonal). The sign of the correlations has been dropped. Bold faces and underlined values refer to significant correlations at 1% and 5% levels, respectively IPC1/PC1 1 0.08 0 0.01 0.1

IPC2/PC2 0 1 0.01 0.01 0.05

Table 12.2 As in Table 12.1 but using third power of absolute value function

IPC3/PC3 0.01 0.02 1 0.08 0.04 IPC1/PC1 1 0.13 0.02 0.05 0.14

IPC2/PC2 0.02 1 0.05 0.04 0.09

IPC4/PC4 0 0.02 0 1 0.01 IPC3/PC3 0 0.04 1 0.11 0.07

IPC5/PC4 0 0.01 0.010 0.02 1 IPC4/PC4 0.02 0.05 0.03 1 0

IPC5/PC5 0.03 0.03 0 0.05 1

correlations at the 1% and 5% levels, respectively, are indicated below the diagonal. Table 12.2 is similar to Table 12.1 except that now the nonlinear transformation is the absolute value of the third power law. Note again the significant correlations in the transformed PCs in the lower triangular part of Table 12.2, whereas no such significant correlations are obtained with the IPCs. Hannachi et al. (2009) computed those same correlations using various other nonlinear functions and found consistent results with Tables 12.1 and 12.2. Note that the IPCs reflect also large non-normality, as can be seen from the skewness of the IPCs. For example, Fig. 12.4 shows the q–q plots of all SLP IPCs. The straight diagonal lines in Fig. 12.4 are for the normal distribution, and any departure from these lines reflects non-normality. Clearly, all the q–q plots display strong nonlinearity, i.e. non-normality. A formal KS test reveals that the null hypothesis of normality is rejected for the first three IPCs at 1% significance level and for the last two IPCs at 5% level. The non-normality of the PCs has also been checked and compared with that of the IPCs using the q–q plot (not shown). It is found that the IPCs are more non-normal than the PCs. The spatial patterns associated with the IPCs are shown in Fig. 12.5. In principle, the rotated EOFs have no natural ranking, but the order of the rotated EOFs in Fig. 12.1 is based on the amount of variance explained by those patterns. The first REOF looks like the Arctic Oscillation (AO) pattern,18 and the second REOF represents the NAO. The fourth pattern, for example, is reminiscent of the Scandinavian teleconnection pattern. Figure 12.6 shows, respectively, the correlation map between

18 Note,

however, that the middle centre of action is displaced from the pole and shifted towards northern Russia.

12.6 ICA via EOF Rotation and Weather and Climate Application

289

b

0 −4 4 c

d

0 −4 4 e

IC5 quantiles

IC2 quantiles

4 a

IC4 quantiles

IC3 quantiles

IC1 quantiles

Quantile−quantile plot

0

−4

0

4

Standard normal quantiles

−4 4 0 −4 Standard Normal Quantiles

Fig. 12.4 Quantile plots on individual IPCs of the monthly SLP anomalies. Adapted from Hannachi et al. (2009). ©American Meteorological Society. Used with permission

Fig. 12.5 Spatial patterns associated with the leading five IPCs. The order is arbitrary. Adapted from Hannachi et al. (2009). ©American Meteorological Society. Used with permission

290

12 Independent Component Analysis

Fig. 12.6 Correlation map of monthly SLP anomaly IPC1 (top) and IPC2 (bottom) with HadISST monthly SST anomaly for the period January 1948–November 2006. Only significant values, at 1% level, are shown. Correlations are multiplied by 100. Adapted from Hannachi et al. (2009). ©American Meteorological Society. Used with permission

SLP IPC1 (Fig. 12.6, top) and SLP IPC2 (Fig. 12.6, bottom) with the Hadley Centre Sea Ice and Sea Surface Temperature (HadISST). It can be seen, in particular, that the correlation pattern with IPC1 reflects the North Pacific oscillation, whereas the correlation with IPC2 reflects well the NAO pattern. The same rotation was also applied in the context of exploratory factor analysis by Unkel et al. (2010), see Chap. 10.

12.6 ICA via EOF Rotation and Weather and Climate Application

291

12.6.2 Extension to the Three-Way Data The above independent component analysis applied to two-way data was extended to three-way climate data by Unkel et al. (2011). Three-way data consist of data that are indexed by three indices, such as time, horizontal space and vertical levels. For this four-dimensional case (3 spatial dimensions plus time), with measurements on J horizontal grid points at K vertical levels for a sample size n, the data are represented by third-order tensor: ✘ = (xj tk ),

j = 1, . . . J, t = 1, . . . n, k = 1, . . . K.

(12.46)

A standard model for the three-way data, Eq. (12.46), is the three-mode Parafac model with R components (e.g. Caroll and Chang 1970; Harshman 1970): xj tk =

R 

aj r btr ckr + εj tk ,

j = 1, . . . J, t = 1, . . . n, k = 1, . . . K.

r=1

(12.47) In Eq. (12.47) A = (aj r ), B = (btr ) and C = (ckr ), are the component matrices (or modes) of the model, and ε = (εj tk ) is an error term. A slight modification of the Parafac model, Eq. (12.47), was used by Unkel et al. (2011), based on the Tucker3 model (Tucker 1966), which yields the following model: XA = A (C| ⊗ |B)T + EA ,

(12.48)

where the J × (nK) matrices XA and EA are reshaped version of ✘ = (xj tk ) and ε = (εj tk ), obtained by K frontal slices of those tensors, and | ⊗ | is the columnwise Kronecker (or Khatri–Rao) matrix product. The matrices A, B and C are obtained by minimising the costfunction:   F = XA − AAT XA CCT ⊗ BBT 2F ,

(12.49)

where ⊗ is the Kronecker matrix product (see Appendix D), by using an alternating ˆ of A is then rotated towards least square optimisation algorithm. The estimate A independence based on the algorithm of the two-way method discussed above, see Eq. (12.41). Unkel et al. (2011) applied the method to the NCEP/NCAR geopotential heights by using 5 components in modes A and C and 6 components in B. The data represent winter (Nov–Mar) means for the period 1949–2009 with 2.5◦ ×2.5◦ horizontal grid, north of 20◦ N , sampled over 17 vertical levels, so J = 4176, n = 61 and K = 17. Figure 12.7 shows the rotated patterns. Patterns (i) and (iii) suggest modes related to stratospheric activity, showing two different phases of the winter polar vortex during sudden stratospheric warming (e.g. Hannachi and Iqbal 2019). Pattern (ii)

292

12 Independent Component Analysis

Fig. 12.7 Rotated three-way independent component analysis of the winter means geopotential heights from NCEP/NCAR. The order is arbitrary. Adapted from Unkel et al. (2011)

12.7 ICA Generalisation: Independent Subspace Analysis

293

is dominated by the Scandinavian pattern, prominent in mid-troposphere, while the last mode represents the North Atlantic Oscillation, a prominent pattern in the lower and mid-troposphere.

12.7 ICA Generalisation: Independent Subspace Analysis Like projection pursuit conventional, ICA is a one-dimensional algorithm, that is the algorithm attempts to identify unidimensional (statistically) independent components. Given that the climate system is characterised by complex nonlinear interactions between many processes, the concept of (conventional) ICA may not be realistic. A more reasonable approach is to relax the ICA assumption in favour of looking for groups of independent sources, leading to the concept of independent subspace analysis (Pires and Ribeiro 2017; Pires and Hannachi 2017). In Pires and Hannachi (2017), for example, high-order cumulant tensors of coskewness and cokurtosis, given, respectively, by S(y)ij k = E(Yi Yj Yk ) K(y)ij kl = E(Yi Yj Yk Yl ) − E(Yi Yj )E(Yk Yl ) −E(Yi Yk )E(Yj Yl ) − E(Yi Yl )E(Yj Yk ),

(12.50)

for i, j, k, l in {1, . . . m}, where m is the dimension of the random vector y = (Y 1, . . . Ym )T , are analysed via the Tucker decomposition or high-order singular value decomposition (Tucker 1966; Lathauwer et al. 2000). A generalised mutual information was then constructed and minimised. To overcome the handicap of serial correlation, sample size and high dimensionality statistical tests were used to obtain robust non-Gaussian subspaces, which are taken to represent independent subspaces. Pires and Hannachi (2017) applied the methodology to global sea surface temperature (SST) anomalies and identified a number of independent components and independent subspaces. In particular, the second component IC2 was found to project onto the Atlantic Niño and the Atlantic Multidecadal Oscillation (AMO) and was significantly separated from the remaining four components. These latter components were not entirely independent and are dominated primarily by the dyad (IC1 and IC5). The pattern IC1 represents El-Niño conditions combined with negative Pacific Decadal Oscillation (PDO) and the positive phase of the North Pacific Gyre Oscillation (NPGO). The pattern IC5 has a positive sign when SST anomalies in South Pacific have a zonal dipole. The patterns associated with IC1 and IC5 share overlapping projection onto the NPGO.

Chapter 13

Kernel EOFs

Abstract This chapter describes a different way to obtain nonlinear EOFs via kernel EOFs based on kernel methods. The kernel EOF method is based on mapping the data onto a feature space and helps delineate complex structures. The chapter discusses various types of transformations to obtain kernel EOFs, with particular focus on the Gaussian kernel and its application to data from models and reanalyses. Keywords Discriminant analysis · Feature space · Kernel EOFs · Kernel trick · Reproducing kernels · Gaussian kernel · Spectral clustering · Modularity clustering · Quasi-geostrophic model · Flow tendency · Bimodality · Zonal flow · Blocking

13.1 Background It has been suggested that the large scale atmospheric flow lies on a nonlinear manifold due to the nonlinearity involved in the dynamics of the system. Nevertheless, it is always possible to embed this manifold in a high-dimensional linear space. The system may have two or more substructures, e.g. attractors, that one would like to identify and separate. This problem belongs to the field of pattern recognition. Linear spaces have the characteristic that “linear” patterns can be, in general, identified efficiently using, for example, linear discriminant analysis or LDA (e.g. McLachlan 2004). In general patterns are characterised by an equation of the form: f (x) = 0.

(13.1)

For linear patterns, the function f (.) is normally linear up to a constant. Figure 13.1 illustrates a simple example of discrimination where patterns are obtained as the solution of f (x) =< w, x > +b = 0, where “” denotes a scalar product in the linear space. Given, however, the complex and nonlinear nature involved in weather and climate, one expects in general the presence of different forms of nonlinearity and © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_13

295

296

13 Kernel EOFs

x

2

x

x

x x w

x

x x

b x

x

x

1

f( x) = < w, x> + b = 0

Fig. 13.1 Illustration of linear discriminant analysis in linear spaces permitting a separation between two sets or groups shown, respectively, by crosses and circles

where patterns or coherent structures in the input space are not easily separable. The possibility then arises of attempting to embed the data of the system into another space we label “feature space” where complex relationships can be simplified and discrimination becomes easier and efficient, e.g. linear, through the application of a nonlinear transformation φ(.). Figure 13.2 shows an example sketching of the situation where the (nonlinear) manifold separating two groups in the input space becomes linear in the feature space. Consider, as an example, and for the sake of illustration, the case where the input space contains data that are “polynomially” nonlinear. Then, it is possible that a transformation into the feature space involving separately all monomials constituting the nonlinear polynomial would lead to a simplification of the initial complex relationship. For example, if the input space (x1 , x2 ) is two-dimensional and contains quadratic nonlinearities, then the five-dimensional space obtained by considering all monomials of degree smaller than 3, i.e. (x1 , x2 , x12 , x22 , x1 x2 ), would decimate/dismantle the initial complex relationships. Figure 13.2 could be an (ideal) example of polynomial nonlinearity where the feature becomes a linear combination of the coordinates and linear discriminant analysis, for example, could be applied. In general, however, the separation will not be that trivial, and the hypersurface separating the different groups could be nonlinear, but the separation remains feasible. An illustration of this case is sketched in Fig. 13.3. This could be the case, for instance, when the nonlinearity is not polynomial, as will be detailed later. It is therefore the objective

13.1 Background

x

297

f 2 (x 1 ,x 2 )

f =( f 1 , f 2 )

2

x x

x

x x

x x

x

x

x x

x

x

x1

x x

f 1 (x 1 ,x 2 )

x

Fig. 13.2 Illustration of the effect of the nonlinear transformation φ(.) from the input space (left) into the feature space (right) for a trivial separation between groups/clusters in the feature space. Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

f =( f , f )

x2

1

2

f2 (x 1 ,x 2 ) x

x

x

x

x

x x x

x x

x

x1

x

x x

f1 (x 1 ,x 2 )

x x

Fig. 13.3 As in Fig. 13.2 but for a curved separation between groups or clusters in the feature space

of kernel EOF method (or kernel PCA) to find an embedding or a transformation that allows such easier discrimination or pattern identification to be revealed. The kernel method can also be extended to include other related approaches such as principal oscillation patterns (POPs), see Chap. 6, or maximum covariance analysis (MCA), see Sect. 14.3. Kernel PCA is not very well known in atmospheric science since it was first applied by Richman and Adrianto (2010). These authors applied kernel EOFs to

298

13 Kernel EOFs

classify sea level pressure over North America and Europe. They identified two clusters for each domain in January and also in July. Their study suggests that kernel PCA captures the essence of data more accurately compared to conventional PCA.

13.2 Kernel EOFs 13.2.1 Formulation of Kernel EOFs Let xt , t = 1, . . . , n be a sample of a p-dimensional time series with zero mean and covariance matrix S. Kernel EOF method (Schölkopf et al. 1998) is a type of nonlinear EOF analysis. It is based on an EOF analysis in a different space, the feature space, obtained using a (nonlinear) transformation φ(.) from the pdimensional input data space into a generally higher dimensional space F. This transformation associates to every element x from the input data space an element ξ = φ(x) from the feature space. Hence, if φ(xt ) = ξ t , t = 1, . . . , n, the kernel EOFs are obtained as the EOFs of this newly transformed data. The covariance matrix in the feature space F is computed in the usual way and is given by S=

1 1 ξ t ξ Tt = φ(xt )φ(xt )T . n n n

n

t=1

t=1

(13.2)

Now, as for the covariance matrix S in the input space, an eigenvector v of S with eigenvalue λ satisfies nλv =

n 



φ(xt ) φ (xt )T v ,

(13.3)

t=1

and therefore, any eigenvector must lie in the subspace spanned by φ(xt ),

t = 1, . . . , n. Denoting by αt = φ(xt )T v, and K = Kij , where Kij = ξ Ti ξ j = φ(xi )T φ(xj ), an eigenvector v of S, with non-zero eigenvalue λ, is found by solving the following eigenvalue problem: Kα = nλα.

(13.4)

T In fact,  letting α = (α1 , . . . , αn ) and assuming it satisfies (13.4), and letting v = ns=1 αs φ(xs ), it is straightforward to see that

 n   n  n n   1 1 T Sv = ξt ξ t ξ s αs = ξt Kts αs , n n t=1

which is precisely

1 n

n

s=1

t=1 nλξ t αt

t=1

= λv.

s=1

300

13 Kernel EOFs

Eqs. (13.7–13.9) then yield the expansion K (x, y) =

∞ 

λk φk (x)φk (y),

(13.10)

k=1

where λk ≥ 0 and φk (.), k = 1, 2, . . ., are, respectively, the eigenvalues and corresponding eigenfunctions of the integral operator1 K(.) given in Eq. (13.7) and satisfying the eigenvalue problem (13.9). Such kernels are known as reproducing kernels. Note that the eigenfunctions are orthonormal with respect to the scalar product (13.8). Remark The above result given in Eq. (13.10) is also known in the literature of functional analysis as the spectral decomposition theorem. Combining Eqs. (13.6) and (13.10), it is possible to define the mapping φ(.) by φ(x) =

*

T * λ1 φ1 (x), λ2 φ2 (x), . . . .

(13.11)

We can further simplify the problem of generating kernels by taking K(x, y) = K1T (x)K1 (y),

(13.12)

where K1 is any arbitrary integral operator. For example, if K1 (.) is self-adjoint, then K12 (.) is reproducing. There are two main classes of reproducing kernels: those with finite-dimensional feature spaces such as polynomial kernels and those with infinite-dimensional feature spaces involving non-polynomial kernels. A number of examples are given below for both types of kernels. These different kernels have different performances, which are discussed below in connection with an example of concentrated clusters, see Sect. 13.2.3. Examples of Kernels • Consider the case of the three-dimensional input space, i.e. p = 3, and K(x, y) = (xT y)2 , and then the transformation φ(.) is given by T  √ √ √ φ(x) = x12 , x22 , x32 , 2x1 x2 , 2x1 x3 , 2x2 x3 . The dimension of φ(x) is p(p + 1)/2 = 6. • The previous example can be extended to a polynomial of degree q by considering K(x, y) = (xT y)q . In this case the dimension of φ(x) (see above) is of the order O(pq ).

1 Note

that in (13.10) is the space of square integrable functions, i.e.   the convergence  |K (x, y) − ki=1 λi φi (x)φi (y)|2 dxdy = 0. limk→∞

13.2 Kernel EOFs

301 − x−y

2

2 • The Gaussian kernel: K(x, y) = K(x, y)  = e 2 2σ . T

x y • Other examples include K(x, y) = exp 2a 2 and K(x, y) = tanh αxT y + β .

In these examples the vectors φ(x) are infinite. One of the main benefits of using the kernel transformation is that the kernel K(.) avoids computing the large (and may be infinite) covariance matrix

S, see Eq. (13.2), and permits instead the computation of the n × n matrix K = Kij and the associated eigenvalues/eigenvectors λk and α k , k = 1, . . . , n. The kth extracted PC of a target point x from the input data space is then obtained using the expression: vTk φ(x) =

n 

αk,t K (xt , x) .

(13.13)

t=1

This means that the computation is not performed in the very high-dimensional feature space, but rather in the lower n-dimensional space spanned by the images of xt , t = 1, . . . , n. Remark One sees that in ordinary PCA, one obtains min(n, p) patterns, whereas in kernel PCA one can obtain up to max(n, p) patterns.

13.2.2 Practical Details of Kernel EOF Computation The computation of kernel EOFs/PCs within the feature space is based on the assumption that the transformed data φ(x1 ), . . . , φ(xn ) have zero mean, i.e. 1 φ(xt ) = 0, n n

(13.14)

t=1

so that S represents genuinely a covariance matrix. In general, however, the mapping φ(.) is rarely and explicitly accessible, and therefore the zero-mean property may not be verified directly on the images within the feature space. Since we are using the kernel K(.) to form the kernel matrix K and compute the kernel EOFs, we go back to the original expression of the kernel matrix K = (Kij ), see Eq. (13.6), i.e. Kij = φ(xi )T φ(xj ), and use φ(xt ) − φ(xt ) where φ(xt ) =  n t=1 φ(xt )/n. The original expression of the centred matrix becomes n  T    ˜ ij = 1 φ(xi ) − φ(xt ) φ(xj ) − φ(xt ) . K n t=1

This yields the (centred) Gram matrix:

(13.15)

302

13 Kernel EOFs

˜ = K − 1 1n×n K − 1 K1n×n + 1 1n×n K1n×n , K n n n2

(13.16)

where 1n×n is the n × n matrix containing only ones. Exercise Derive the above expression of the centred Gram matrix.   Indication The above expression yields K˜ ij = Kij − n1 nt=1 Ktj − n1 nt=1 Kit + 1 n t,s=1 Kts . In matrix form this expression is identical to the one given in n2 Eq. (13.16).

13.2.3 Illustration with Concentric Clusters We mentioned above that different kernels have different performances. Consider, for example, polynomial kernels. One main weakness of these kernels is the lack of a prior knowledge of the dimension of the feature space. And trying different dimensions is not practical and useful as the dimension increases as a power law of the degree of the polynomial. Another weakness is that polynomial kernels are not localised, which can be problematic given the importance of locality and neighbourhood in nonlinear dynamics. One way to overcome some of these weaknesses is to use of kernels with localised structures such as kernels with compact support. A typical localised kernel is the Gaussian kernel. Despite being not truly compact support, the Gaussian kernel is well localised, because of the exponential decay, a nice and very useful feature. Furthermore, compared to other kernels (e.g. tangent hyperbolic used in neural networks), the Gaussian kernel has only one single parameter σ . Another point to mention is that polynomial kernels are normally obtained by constructing polynomials of scalar products xT y, which is also the case for the tangent hyperbolic and similar functions like cumulative distribution functions used in neural networks. Within this context, another advantage of the Gaussian kernel over the above kernels is that it uses distances x − y 2 , which translates into (local) tendencies when applied to multivariate time series in weather and climate, as is illustrated later. The Gaussian kernel is used nearly everywhere such as in PDF estimation, neural network, spectral clustering, etc. Note also that there are nonlinear structures that cannot be resolved using, for example, polynomial kernels as shown next. Consider the example of three concentrix clusters. Two clusters are distributed on concentric spheres of radii 50 and 30, respectively, and the third one is a spherical cluster of radius 10. Figure 13.4a shows the data projected onto the first two coordinates. Note that the outer two clusters are distributed near the surface of the associated spheres. The conventional PC analysis (Fig. 13.4b) does not help as no projective technique can dismantle these clusters. The kernel PC analysis with a Gaussian kernel, however, is able to discriminate these clusters

13.2 Kernel EOFs

60

303

a) Original (x1, x2) data

b) (PC 1, PC 2) scatter

40

0.05

PC2

x2

20 0

0

-20 -0.05

-40 -60

0.06

-50

0

50

x1

-0.05

c) (KPC 1, KPC 2) scatter

1

0.02

KPC2

KPC2

0.05

2

0.04

0 -0.02

0 -1

-0.04 -0.06

0

PC1 d) PDF of scaled (KPC 1, KPC 2)

-2 -0.04 -0.02

0

KPC1

0.02

0.04

-2

-1

0

1

KPC1

Fig. 13.4 First two coordinates of a scatter plot of three concentric clusters (a), the same scatter projected onto the leading two PCs (b), and the leading two (Gaussian) kernel PCs (c), and the data PDF within the leading two kernel PCs (d). Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

(Fig. 13.4c). Figure 13.4c is obtained with σ 2 = 2.3, but the structure is quite robust to a wide range of the smoothing parameters σ . We note here that non-local kernels cannot discriminate between these structures, pointing to the importance of the local character of the Gaussian kernel in this regard. The kernel PDF of the data within the space spanned by kernel PC1 and kernel PC2 is shown in Fig. 13.4d. Note, in particular, the curved shape of the two outer clusters (Figs. 13.4c,d), reflecting the nonlinearity of the corresponding manifold. To compare the above result with the performance of polynomial kernels, we now apply the same procedure to the above concentric clusters using two polynomial kernels of respective degrees 4 and 9. Figure 13.5 shows the obtained results. The polynomial kernels fail drastically to identify the clusters. Instead, it seems that the polynomial kernels attempt to project the data onto the outer sphere resulting in the clusters being confounded. Higher polynomial degrees have also been applied with no improvement.

304

13 Kernel EOFs

a) Polynomial kernel PC space

0

-0.05 -0.04

0.05

KPC2

KPC2

0.05

-0.02

0

0.02

0.04

b) Polynomial kernel PC space

0

-0.05 -0.05

KPC1

0

0.05

KPC1

Fig. 13.5 Same as Fig. 13.4c, but with polynomial kernels of degree 4 (a) and 9 (b). Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

13.3 Relation to Other Approaches 13.3.1 Spectral Clustering The kernel EOF analysis is closely related to what is known as spectral clustering encountered, for example, in pattern recognition. Spectral clustering takes roots from the spectral analysis of linear operators through the construction of a set of orthonormal bases used to decompose the spatio-temporal field at hand. This basis is provided precisely by a set of orthonormal eigenfunctions of the Laplace– Beltrami differential operator, using the diffusion map algorithm (e.g. Coifman and Lafon 2006; Belkin and Niyogi 2003; Nadler et al. 2006). Spectral clustering (e.g. Shi and Malik 2000) is based on using a similarity matrix S = (Sij ), where Sij is the similarity, such as spatial correlation, between states xi and xj . There are various versions of spectral clustering such as the normalised and non-normalised (Shi and Malik 2000). The normalised version of spectral clustering considers the (normalised) Laplacian matrix: L = I − D−1/2 SD−1/2 ,

(13.17)

n where D = diag(d1 , . . . , dn ) with dk = j =1 Skj . This matrix is known as the degree matrix in the graph theory (see below). The wisdom here is that when the clusters are well separated (or independent), then L will have, after some rearrangement, a block diagonal form, i.e. L = diag (L1 , . . . , Lk ) .

(13.18)

Furthermore, it can be shown that L has k-zero eigenvalues (Luxburg 2007).

13.3 Relation to Other Approaches

305

Remark The non-normalised Laplacian corresponds to the matrix M = D − S instead of L. The eigenvectors u1 , . . . , uk associated with the k smallest eigenvalues of L are U = (u1 , . . . , uk ) = D1/2 E

(13.19)

with ⎛ ⎜ E=⎝



e1 ..

⎟ ⎠,

. ek

where e1 , . . . , ek are vectors of ones with different lengths. The procedure of cluster identification (Shi and Malik 2000) then consists of using any clustering procedure, e.g. k-means, applied to the first n rows of U into k groups. Now, by considering an isotropic similarity matrix S = (Sij ), e.g. based on the Gaussian kernel, the similarity matrix S is exactly the kernel matrix K. The eigenvalue problem associated with the (normalised) matrix L in (13.17), or similarly the nonnormalised matrix M, then becomes equivalent to the eigenvalue problem of the ˜ centred matrix K. Remark Denoting by M = D − S, the matrix L, Eq. (13.17), is related to the matrix P = (pij ) given by P = D−1 M = I − D−1 S. The matrix D−1 S appearing in the expression of P is linked to Markov chains. Precisely, this matrix represents a stochastic matrix whose rows add up to one, and thus each element pij of P can be interpreted as a transition probability from the ith state to the j th state of the chain. So spectral clustering has a natural probabilistic interpretation (Meila and Shi 2000).

13.3.2 Modularity Clustering The problem of cluster identification, as discussed above in spectral clustering, can also be viewed from a graph theory perspective. Graph theory is a theoretical framework used to study and investigate complex networks. Here, each data point xi in state space is considered as a vertex of some network, and edges between pairs of vertices become substitutes of similarities obtained precisely from the adjacency matrix A = (aij ) (see Chap. 3, Sect. 9.4). Each vertex i has a degree ki , defined as the number of vertices to which it is connected (number of edges linked to it) or the  number of adjacent vertices. It is given by ki = j aij .

306

13 Kernel EOFs

The idea of community or group identification was suggested by Newman and Girvan (2004) and Newman (2006). The procedure starts with a division of the network into two groups determined by a vector s containing only +1 and −1 depending on whether the vertex belongs to groups 1 and 2, respectively. The number of edges aij between vertices i and j is then compared to that ki kj obtained when edgesare placed randomly, 2m , where m is the total number of edges (m = 0.5 k ). Newman and Girvan (2004) defined the modularity  i   ki kj Q = ij aij − 2m si sj , that is,   1 T kk s, Q = s Bs = s A − 2m T

T

1 where B = A − 2m kkT is the modularity matrix. The modularity is a measure reflecting the extent to which nodes in a graph are connected to those of their own groups, reflecting thus the presence of clusters known as communities. The objective then is to maximise the modularity. The symmetric modularity matrix has the vector 1 (containing only ones) with associated zero eigenvalue, as in the Laplacian matrix in spectral  clustering. By expanding the vector s in terms of the eigenvectors uj of B, s = (sT uj )uj , the modularity reduces to

Q=

 (uTj s)2 λj

with λj , being the eigenvalues of B. It is clear that maximisation of Q requires s to be proportional to the leading eigenvector u1 yielding s = sign(u1 ), providing, hence, a clustering algorithm into two groups, given λ1 > 0. It is immediate to see that if the eigenvalues are non-positive, λj ≤ 0, the leading eigenvector is 1, meaning that the data cannot be clustered. Newman (2006) extended this to divide the network into more than two groups, by introducing the contribution Q due to dividing a group G (with nG elements or nodes) into two subgroups: Q = sT B(G) s, (G)

with the elements of the new nG × nG modularity matrix B(G) , given by Bij =  Bij − δij k in G Bik , with δij being the Kronecker delta. The procedure is then repeated by maximising the modularity using Q.

13.4 Pre-images in Kernel PCA The construction of the feature space and the computation of the EOF patterns and associated PCs within it are a big leap in the analysis of the system producing the data. What we need next is, of course, to be able to examine the associated

13.4 Pre-images in Kernel PCA

307

patterns in the input space. In many cases, therefore, one needs to be able to map back, i.e. perform an “inverse mapping”, onto the input space. The inverse mapping terminology is taken here in a rather broad sense not literally as the mapping may not be one-to-one. Let us first examine the case of the standard (linear) EOF analysis. In standard PCA the data xt is expanded as xt =

p  p    xTt .uk uk = ctk uk , k=1

(13.20)

k=1

where uk is the kth EOF and ctk is the kth PC of xt . Since this is basically a linear projection, given a point x in the EOF space for which only the l leading coordinates β1 , . . . , βl , (l ≤ p) are observed, the pre-image is simply obtained as x∗ =

l 

βk uk .

(13.21)

k=1

The above expression minimises x − x∗ 2 , i.e. the quadratic distance to the exact point. Of course, if all the EOFs are used, then the distance is zero and the pre-image is exact (see Chap. 3). Now, because the mapping may not be invertible,2 the reconstruction of the patterns back in the p-dimensional input data space can only be achieved approximately (or numerically). Let us assume again a nonlinear transformation φ(.) mapping the input (physical) space onto the (high-dimensional) feature space F, where the covariance matrix is obtained using Eq. (13.2). Let vk be the kth EOF, within the feature space, with associated eigenvalue λk , i.e. Svk = λk vk .

(13.22)

Like for the standard (linear) EOF analysis, and as pointed out above, the EOFs are combination of the data; that is, they lie within the space spanned by φ(x1 ), . . . , φ(xn ), e.g. vk =

n 

akt φ(xt ).

(13.23)

t=1

It can be seen, after inserting (13.23) back into (13.22), see also Eq. (13.3), that the vector ak = (ak1 , . . . , akn )T is an eigenvector of the matrix K = (Kij ), where Kij = K(xi , xj ) = φ(xi )T φ(xj ), i.e. Kak = nλk ak .

2 Even

if it is, it would be prohibitive to compute it.

(13.24)

308

13 Kernel EOFs

Now, given a point x in the input space, the kernel PC of x in the feature space is the usual PC of φ(x) within this feature space. Hence, the kth kernel PC is given by βk (x) = φ(x)T vk =

n 

akt K(x, xt ).

(13.25)

t=1

Remark Denoting by V = (v1 , . . . , vn ) and  = (φ(x1 ), . . . , φ(xn )), Eq. (13.23) is written as V = A, where A = (a1 , . . . , an ) and represents the matrix of the eigenvectors of K. Let us now assume that we have a pattern in the feature space: w=

m 

βk vk ,

(13.26)

k=1

where β1 , . . . , βm are scalar coefficients, and we would like to find a point x from the input space that maps onto w in Eq. (13.26). Following Schölkopf et al. (1999), one attempts to find the input x from the input space such that its image φ(x) approximates w through maximising the ratio r:

T 2 w φ(x) . r= φ(x)T φ(x)

(13.27)

Precisely, the problem can be solved approximately through a least square minimisation:  m  ∗ x = argmin J = φ(x) − βk vk 2 . (13.28) x k=1 To solve Eq. (13.28), one makes use of Eq. (13.23) and expresses w as w=

 m n   t=1

βk akt φ(xt ) =

k=1

n 

αt φ(xt )

(13.29)

t=1

 (αt = m k=1 βk akt ), which is then inserted into (13.28). Using the property of the kernel (kernel trick), one gets φ(x) − w 2 = K(x, x) − 2

n  t=1

where c is a constant independent on x.

αt K(x, xt ) + c,

(13.30)

13.5 Application to An Atmospheric Model and Reanalyses

309

If the kernel is isotropic, i.e. K(x, y) = H ( x−y 2 ), then the gradient of the error squared (13.30) is easy to compute, and the necessary condition of the minimum of (13.30) is given by ∇x J =

n 

  αt H  x − xt 2 (x − xt ) = 0,

(13.31)

t=1

where H  (u) = dH /du. The optimum then satisfies    1  2

x − x xt . α H t t  2 t=1 αt H x − xt t=1 n

x = n

(13.32)

The above equation can be solved using the fixed point algorithm via the iterative scheme:

(m) n dH − xt 2 xt t=1 αt du x (m+1) x = n

. dH (m) − x 2 t t=1 αt du x

For example, in the case of the Gaussian kernel K(x, y) = exp − x − y 2 /2σ 2 , we get the iterative scheme:    1 (m) 2 2 xt , α exp − z − x /2σ t t (m) − x 2 /2σ 2 ) t t=1 αt exp(− z t=1 (13.33) which can be used to find an optimal solution x∗ by taking x∗ ≈ z(m) for large enough m. Schölkopf et al. (1998, 1999) show that kernel PCA outperforms ordinary PCA. This can be understood since kernel PCA can include higher order moments, unlike PCA where only second-order moments are used. Note that PCA corresponds to kernel PCA with φ(.) being the identity. n

z(m+1) = n

13.5 Application to An Atmospheric Model and Reanalyses 13.5.1 Application to a Simplified Atmospheric Model The example discussed here consists of a 3-level quasi-geostrophic model of the atmosphere. The model describes the evolution of the potential vorticity qi at the ith level, where levels i = 1, 2 and 3 represent, respectively, 200-, 500- and 800hPa surfaces. The potential vorticity equations are given by

310

13 Kernel EOFs ∂q1 ∂t ∂q2 ∂t ∂q3 ∂t

= −J (ψ1 , q1 ) + D1 (ψ1 , ψ2 ) + S1 = −J (ψ2 , q2 ) + D2 (ψ1 , ψ2 , ψ3 ) + S2 = −J (ψ3 , q3 ) + D3 (ψ2 , ψ3 ) + S3 ,

(13.34)

where the potential vorticities are given by q1 = ∇ 2 ψ1 − R1−2 (ψ1 − ψ2 ) + f q2 = ∇ 2 ψ2 + R1−2 (ψ1 − ψ2 ) − R2−2 (ψ2 − ψ3 ) + f q3 = ∇ 2 ψ3 + R2−2 (ψ2 − ψ3 ) + f (1 + Hh0 ).

(13.35)

In the above equations ψi , i = 1, 2, 3, represents the streamfunction at level i, R1 = 700 km and R2 = 450 km and represent Rossby radii of deformation, f = 2 sin φ and represents the Coriolis parameter,  is the Earth’s rotation rate, φ is the latitude and H0 is height scale fixed to 9km. The terms Di , i = 1, 2, 3, represent the dissipation rates and include contributions from temperature relaxation, the Ekman dissipation and horizontal diffusion. The latter is a hyperdiffusion with e-folding time of 1.5 days. The term J () represents the nonlinear ∂q ∂ψ ∂q 2 Jacobian operator; J (ψ, q) = ∂ψ ∂x ∂y − ∂y ∂x , and ∇ is the horizontal Laplacian on the sphere. The forcing terms Si , i = 1, 2, 3, are calculated in such a way that the January climatology of the National Center for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) streamfunction fields at 200 hPa, 500 hPa and 800 hPa levels is stationary solution of system, Eq. (13.34). The term h() in Eq. (13.35) represents the real topography of the Earth in the northern hemisphere (NH). The model is spectral with a triangular truncation T21 resolution, i.e. 32 × 64 lat × lon grid resolution. The model is symmetrical with respect to the equator, leading to slightly more than 3000 grid points or 693 spherical harmonics (or spectral degrees of freedom). The model was shown by many authors to simulate faithfully the main dynamical processes in the extratropics, see Hannachi and Iqbal (2019) for more details and references. The point here is to show that the model reveals nonlinearity when analysed using kernel EOFs and is therefore consistent with conceptual low-order chaotic models such as the case of the Lorenz (1963) model (Fig. 13.6). The model run is described in Hannachi and Iqbal (2019) and consists of one million-day trajectory. The averaged flow tendencies are computed within the PC space and compared to those obtained from the kernel PC space. Mean flow tendencies have been applied in a number of studies using the leading modes of variability to reveal possible signature of nonlinearity, see e.g. Hannachi (1997), Branstator and Berner (2005) and Franzke et al. (2007). For example, using a simple (toy) stochastic climate model, Franzke et al. (2007) found that the interaction between the resolved planetary waves and the unresolved waves is the main responsible for the nonlinearity. Figure 13.7a shows an example of the mean tendencies of the mid-level (500-hPa) streamfunction within the PC1-PC5 state space. The flow tendencies (Fig. 13.7a) reveal clear nonlinearities, which can be identified by examining both the tendencies and their amplitudes. Note that

13.5 Application to An Atmospheric Model and Reanalyses

a) Trajectory and PDF

311

b) Tendencies and trajectory 4

3

0.35

3 2

0.3

2

0.25

1 0.2

z

z

1

0.4

0

0 0.15

-1

-1

0.1

-2

-2

0.05

-3 -3 -2

-1

0

1 x

2

-2

0 x

2

0

Fig. 13.6 PDF of a long simulation of the Lorenz (1963) model shown by shaded and solid contours within the (x, z) plane (a), and the flow tendencies plotted in terms of magnitude (shaded) and direction (normalised vectors) within the same (x, z) plane (b). A chunk of the model trajectory is also shown in both panels along with the fixed points. Note that the variables are scaled by 10, and the value z0 = 25.06 of the fixed point is subtracted from z. Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

in linear dynamics the tendencies are antisymmetric with respect to the origin, and the tendency amplitudes are normally elliptical, as shown in Fig. 13.7b. The linear model is fitted to the trajectory within the two-dimensional PC space using a first-order autoregressive model, as explained in Chap. 6. The departure of the linear tendencies from the total tendencies in Fig. 13.7c reveals two singular (or fixed) points representing (quasi-)stationary states. The PDF of the system trajectory is shown in Fig. 13.7d and is clearly unimodal, which is not consistent with the conceptual low-order chaotic models such as Fig. 13.6. The same procedure can be applied to the trajectory within the leading kernel PCs. Figure 13.8a shows the departure of the tendencies from the linear component within the kernel PC1/PC4 space and reveals again two fixed points. Figure 13.8b shows the PDF of the mid-level streamfunction within kernel PC1/PC4 state space. In agreement with low-order conceptual models, e.g. Fig. 13.6, the figure now reveals strong bimodality, where the modes correspond precisely to regions of low tendencies. Figure 13.9 displays the two circulation flows corresponding to the PDF modes of Fig. 13.8 showing the anomalies (top) and the total (bottom) flows. The first stationary state shows a low over the North Pacific associated with a dipole over the North Atlantic reflecting the negative NAO phase (Woollings et al. 2010). The second anomalous stationary solution represents approximately the opposite phase, with a high pressure over the North Pacific associated with an approximate positive

312

13 Kernel EOFs

a)

3

b) 0.4

0.4

0.3

0.3

2

0

0.2

PC4

PC4

1 0.2

-1 0.1

0.1

-2 -3

0

-3

-2

-1

c)

0 PC1

1

2

3

-3

-2

-1

d)

3 2

0 PC1

1

2

3

0

0.2

0.2

0.15

0.15

0

PC4

PC4

1 0.1

0.1

-1 0.05

0.05

-2 -3

0

0

-3

-2

-1

0 PC1

1

2

3

-3

-2

-1

0 PC1

1

2

3

Fig. 13.7 Total flow tendency of the mid-level streamfunction within the conventional PC1/PC5 state space. (b) Linear tendency based on a first-order autoregressive model fitted to the same data. (c) Difference between the two tendencies of (a) and (b) showing the departure of the total tendencies from the linear part. (d) Kernel PDF of the same data within the same two-dimensional state space. Adapted from Hannachi and Iqbal (2019)

a)

b)

2

0.12

0.25

2

0.1

1

0.2

1

0.08

0

0.15

0

0.06

-1

0.1

-1

0.04

-2

0.05

-2

0.02

-2

0

KPC1

2

KPC4

KPC4

0.3

-2

0

2

KPC1

Fig. 13.8 As in Fig. 13.7c(a) and 13.7d(b), but for the kernel PC1/PC4. Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

NAO phase. In both cases the anomalies over the North Atlantic are shifted slightly poleward compared to the NAO counterparts.

13.5 Application to An Atmospheric Model and Reanalyses

a)

b)

c)

d)

313

Fig. 13.9 Anomalies (a,b) and total (c,d) flows of mid-level streamfunction field obtained by compositing over states within the neighbourhood of the modes of the bimodal PDF. Contour interval 29.8 × 108 m2 /s (top) and 29.8 × 106 m2 /s (bottom). Adapted from Hannachi and Iqbal (2019). ©American Meteorological Society. Used with permission

The total flow of the stationary solutions, obtained by adding the climatology to the anomalous stationary states, is shown in the bottom panels of Fig. 13.9. The first solution shows a ridge over the western coast of North America associated with a diffluent flow over the North Atlantic with a strong ridge over the eastern North Atlantic. This latter flow is reminiscent of a blocked flow over the North Atlantic. Note the stronger North Atlantic ridge compared to that of the western

314

13 Kernel EOFs

North American continent. The second stationary state (Fig. 13.9) shows a clear zonal flow over both basins.

13.5.2 Application to Reanalyses

Probability density

a) SLP anomalies PDF B

0.1

A

0.05 0 5 0

KPC7

-5

-2

0

KPC1

2

Prob. density Difference

In the next example, kernel PCs are applied to reanalyses. The data used in this example consist of sea level pressure (SLP) anomalies from the Japanese Reanalyses, JRA-55 (Harada et al. 2016; Kobayashi et al. 2015). The anomalies are obtained as departure of the mean daily annual cycle from the SLP field and keeping unfiltered winter (December–January–February, DJF) daily anomalies over the northern hemisphere. The kernel PCs of daily SLP anomalies are computed, and the PDF is estimated. Figure 13.10a shows the daily PDF of SLP anomalies over the NH poleward of 27◦ N using KPC1/KPC7. Strong bimodality stands out from this PDF. To characterise the flow corresponding to the two modes, a composite analysis is performed by compositing over the points within the neighbourhood of the two modes A and B. The left mode (Fig. 13.11a) shows a polar high stretching south over Greenland accompanied with a low pressure system over the midlatitude North Atlantic stretching from eastern North America to most of Europe and the Mediterranean. This circulation regime projects strongly onto the negative NAO. The second mode (Fig. 13.11b) shows a polar low with high pressure over midlatitude North Atlantic, with a small high pressure over the northern North West Pacific, and projects onto positive NAO. The regimes are not exactly symmetrical of each other; regime A is stronger than regime B. Hannachi and Iqbal (2019) also examined the hemispheric 500-hPa geopotential height. Their two-dimensional PDFs (not shown) reveal again strong bimodality associated, respectively, with polar low and polar high. The mode associated with the polar high is stronger however.

b) PDF difference 0.05 0 -0.05 2

0

-2

KPC7

-2

0

2

KPC1

Fig. 13.10 (a) Kernel PDF of the daily winter JRA-55 SLP anomalies within the kernel PC1/PC7 state space. (b) Difference between the PDFs of winter daily SLP anomalies of the first and second halves of the JRA-55 record. Adapted from Hannachi and Iqbal (2019)

13.5 Application to An Atmospheric Model and Reanalyses a) SLP anomaly composite (mode A)

18

315

b) SLP anomaly composite (mode B)

16

6 4

14 2

12

0

10 8

-2

6 -4 4 -6

2 0

-8

-2 -10 -4 -6

-12

-8

-14

Fig. 13.11 SLP anomaly composites over states close to the left (a) and right (b) modes of the PDF shown in Fig. 13.10. Units hPa. Adapted from Hannachi and Iqbal (2019). Table 13.1 Correlation coefficients between the 10 leading KPCs and PCs for JRA-55 SLP anomalies. The R-square of the regression between each individual KPCs and the leading 10 PCs is also shown. Correlations larger than 0.2 are shown in bold faces PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 R-square

KPC1 0.928 0.031 −0.020 −0.035 0.097 0.011 0.027 −0.018 0.036 −0.003 0.88

KPC2 0.008 −0.904 −0.002 −0.172 0.168 0.079 −0.009 −0.027 0.002 −0.003 0.88

KPC3 −0.031 0.070 0.561 −0.287 0.359 −0.074 −0.104 −0.079 −0.129 −0.040 0.57

KPC4 −0.032 −0.037 −0.464 0.400 0.384 0.071 −0.021 0.072 0.155 0.072 0.57

KPC5 −0.090 0.251 −0.271 −0.453 0.554 0.191 0.261 −0.033 0.052 0.042 0.77

KPC6 −0.017 0.006 0.471 0.543 0.365 0.134 0.178 0.146 0.079 0.097 0.74

KPC7 0.009 −0.015 −0.029 0.210 −0.043 0.499 −0.091 −0.644 −0.108 −0.060 0.73

KPC8 −0.017 −0.054 0.146 −0.108 −0.303 0.291 0.615 0.120 0.293 0.163 0.72

KPC9 −0.015 0.063 0.117 −0.204 −0.050 0.386 −0.469 0.032 0.485 0.115 0.68

KPC10 −0.030 0.003 0.037 0.012 0.084 −0.489 0.092 −0.537 0.431 −0.052 0.73

The above results derived from the quasi-geostrophic model, and also reanalyses clearly show that the leading KPCs, like the leading PCs, reflect large scale structure and hence can explain substantial amount of variance. This explained variance is already there in the feature space but is not clear in the original space. Table 13.1 shows the correlations between the leading 10 KPCs and leading 10 PCs of the sea level pressure anomalies, along with the R-square obtained from multiple regression between each KPC and the leading 10 PCs. It is clear that these KPCs are large scale and also explain substantial amount of variance. The kernel PCs can also be used to check for any change in the signal over the reanalysis period. An example of this change is shown in Fig. 13.10b, which represents the change in the PDF between the first and last halves of the data. A clear regime shift is observed with a large decrease (increase) of the frequency of

316

13 Kernel EOFs

the polar high (low) between the two periods. This change projects onto the +NAO (and +AO) and is consistent with an increase in the probability of the zonal wind speed over the midlatitudes.

13.6 Other Extensions of Kernel EOFs 13.6.1 Extended Kernel EOFs Direct Formulation Extended EOFs seek propagating patterns from a multivariate times series xt , t = 1, . . . n (Chap. 7). By transforming the data into the feature space, there is the possibility that extended EOFs could yield a better signal. We let φ(xt ), t = 1, . . . n, be the transformed data and, as in standard extended EOFs, Z denotes the “extended” data matrix in the feature space: ⎛

⎞ ⎛ ⎞ ϕ T1 . . . φ(xM )T ⎜ ⎟ ⎜ ⎟ .. .. .. Z=⎝ ⎠=⎝ ⎠, . . . T T T ϕ n−M+1 φ(xn−M+1 ) . . . φ(xn ) φ(xt )T .. .

(13.36)



where M is the embedding dimension and ϕ Tk = φ(xk )T , . . . , φ(xk+M−1 )T , k = 1, . . . n − M + 1. The covariance matrix of (13.36) is S=

n−M+1 1  1 T Z Z= ϕ s ϕ Ts . n n

(13.37)

s=1

By applying the same argument as for kernel EOFs, any eigenvector of S is a linear combination of ϕ t , t = 1, . . . , n − M + 1, i.e. v=

n−M+1 

αt ϕ t .

(13.38)

t=1

Using (13.37) and (13.38), the eigenvector v satisfies nλv =

n−M+1  s=1

 ϕ s ϕ Ts

n−M+1  t=1

αt ϕ t

=

n−M+1  s=1

n−M+1 M−1   t=1

 Ks+k,t+k αt ϕ s .

k=0

(13.39) Keeping in mind the expression of v, see Eq. (13.38), Eq. (13.39) yields the following eigenvalue problem for α = (α1 , . . . , αn−M+1 ):

13.6 Other Extensions of Kernel EOFs

317

Kα = nλα,

(13.40)

 where K = (Kij ) with Kij = M−1 k=0 Ki+k,j +k . The vector α represents the (kernel) extended PC in the feature space. The reconstruction within the feature space can be done in a similar fashion to the standard extended EOFs (Chap. 7, Sect. 7.5.3), and the transformation back to the input space can again be used as in Sect. 13.4, but this is not expanded further here.

Alternative Formulations An easier alternative to applying extended EOFs in the feature space is to consider the kernel PCs within the latter space as given by the eigenvectors of Eq. (13.4). The next step consists of selecting, say, the leading N kernel PCs and applying extended EOFs. We can use the extended PCs and apply a transformation back to the kernel EOF space, using Eqs. (7.39) and (7.40) from Chap. 7. The reconstructed pattern will then be a combination of a number of kernel EOFs, see Eq. (13.26). The transformation back to the input space is then performed using the fixed point algorithm as outlined in Eq. (13.33). An alternative inverse approach can be followed here. Instead of constructing delay coordinates (extended data matrix) in the feature space, these coordinates can be constructed straightforward in the physical space, as in extended EOFs (Chap. 7), and then either kernel EOFs or its analogue, the Laplacian matrix of spectral clustering (Eq. 13.17) can be used. This latter approach was followed by Giannakis and Majda (2012, 2013) who labelled it “nonlinear Laplacian spectral analysis” (NLSA). They applied it to the upper ocean temperature in the North Pacific sector using a long control run of a climate model. They identified, in addition to the annual and decadal modes, a family of intermittent processes associated with the Kuroshio current and subtropical/subpolar gyres. Because these latter processes explain little variance, Giannakis and Majda (2013) suggest that the NLSA method can uncover processes that conventional singular spectrum analysis cannot capture.

13.6.2 Kernel POPs As described in Chap. 6, the POP analysis is based on a linear model, the first order autoregressive model AR(1). Although POPs were quite successful in various climate applications, the possibility remains that, as we discussed in the previous sections, the existence of nonlinearity can hinder the validity of the linear model. The POP, or AR(1), model can be defended when nonlinearity is weak. If, however, we think or we have evidence that nonlinearity is important and cannot be neglected, then one solution is to use the kernel transformation and get kernel POPs.

318

13 Kernel EOFs

Since the formulation of POPs involves inverting the covariance matrix, and to avoid this complication in the feature space, a simple way is to apply POPs using kernel PCs by selecting, say, the leading N KPCs. The computation, as it turns out, becomes greatly simplified as the KPCs are uncorrelated. Like for kernel extended EOFs, patterns obtained from the POP analysis are expressed as a combination of kernel EOFs. The transformation back to the input space can be obtained using again the fixed point algorithm.

Chapter 14

Functional and Regularised EOFs

Abstract Weather and climate data are in general discrete and result from sampling a continuous system. This chapter attempts to take this into account when computing EOFs. The first part of the chapter describes methods to construct EOFs/PCs of profiles with application to oceanography. The second part of the chapter describes regularised EOFs with application to reanalysis data. Keywords Functional PCs · Ocean salinity · Mixed layer · Regularised EOFs · Generalized eigenvalue problem · Lagrangian · Smoothing parameter · Cross-validation · Siberian high · AO

14.1 Functional EOFs Functional EOF/PC analysis (e.g. Ramsay and Silverman 2006; Jolliffe and Cadima 2016) is concerned with EOF/PC analysis applied to data which consist of curves or surfacsses. In atmospheric science, for example, the observations consist of spatio-temporal fields that represent a discrete sampling of continuous variables, such as pressure or temperature at a set of finite grid points. In a number of cases methods of coupled patterns (Chap. 15) can be applied to single fields simply by assuming that the two fields are identical. For instance, functional and smooth EOFs correspond to smooth maximum covariance analysis (MCA), as given by Eq. (15.30) in Chap. 15, when the left and right fields are identical. Precisely, we suppose that we are given a sample of n curves that constitutethe coordinates of a vector curve x(t) = (x1 (t), . . . , xn (t))T , with zero mean, i.e. nk=1 xk (t) = 0 for all values of t. The covariance function is given by S(s, t) =

1 xT (t)x(s), n−1

(14.1)

where t and s are in the domain of the definition of the curves. The question   then is to find smooth functions (EOFs) a(t) maximising < a, Sa >= S(s, t)a(s)a(t)dsdt subject to a normalisation constraint condition of the type © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_14

319

320

14 Functional and Regularised EOFs

< a, a > +α < D 2 a, D 2 a > −1 = 0. The solution to this problem is then given by the following integro-differential equation: 

  S (t, s) a(s)ds = μ 1 + αD 4 a(t).

(14.2)

Consider first the case of functional EOFs, that is when α = 0, which yields a homogeneous Fredholm equation of the second kind. We suppose that the curves can be expanded in terms of a number of basis functions φ1 (), . . . , φp () so that p xi (t) = k=1 λi,k φk (t), i = 1, . . . , n. In vector form this can be written as x(t) = φ(t), where we let  = (λij ). The covariance function becomes S(t, s) =

1 φ T (t)T φ(s). n−1

(14.3)

Assuming EOF a(t) is expanded using the basis functions as that the functional a(t) = ak φk (t) = φ T (t)a, the above integro-differential equation yields the following system: φ T (t)T a = μφ T (t)a,

(14.4)

where  = (n − 1)−1 < φ(s), φ T (s) >. This equality has to be satisfied for all t in its domain of definition. Hence the solution a is given by the solution to the eigenvalue problem: T a = μa.

(14.5)

Exercise 1. Derive the above eigenvalue problem satisfied by the vector coefficients a = (a1 , . . . , ap )T . 2. Now we can formulate the problem by going back to the original problem as we did above. Write the generalised eigenvalue problem using V = (< φi , Sφj >) and A = (< φi , φj >). 3. Are the two equations similar? Explain your answer. Given that the covariance-like matrix  is symmetric and semi-definite positive, one can compute its square root, or alternatively, one can use its Cholesky decomposition to transform the previous equation into a symmetric eigenvalue problem by multiplying both sides by the square root of . Exercise 1. 2. 3.

Show that  is symmetric.  T

1 Show that aT b = n−1 a φ(t) bT φ(t) dt. that  is semi-definite positive. Hint – ij =< φi (t), φj (s) >= Deduce  φi (t)φj (s)dtds = j i .

14.3 An Example of Functional PCs from Oceanography

321

14.2 Functional PCs and Discrete Sampling The above section presents functional EOFs applied to a finite number of curves x1 (t), . . . , xn (t), for t varying in a specified interval. The parameter may represent time or a conventional or curvilinear coordinate, e.g. height. In practice, however, continuous curves or profiles are not commonly observed, but can be obtained from a set of discrete values at a regular grid.1 To construct continuous curves or profiles from these samples a linear combination of a number of basis functions can be used as outlined above. Examples of basis functions commonly used include radial basis functions and splines (Appendix A). The profile xi (t) is projected onto the basis φk (t), k = 1, . . . K, as xi (t) =

K 

λi,k φk (t).

(14.6)

k=1

The functional PCs are then given by solving the eigenvalue problem (Eq. (14.5)). The problem is normally solved in two steps. First, the coefficients λi,k , i = 1, . . . n, k = 1, . . . K, are obtained from Eq. (14.6) using for example least squares estimation. The matrix  = (λi,k ) is then used as data matrix, and a SVD procedure can be applied to get the eigenvectors (functional PCs) of the covariance matrix of .

14.3 An Example of Functional PCs from Oceanography An appropriate area for functional PC analysis is when we have measurements of vertical profiles in the atmosphere or the ocean. Below we discuss precisely vertical profiles of ocean temperature and salinity, which were studied in detail by Pauthenet (2018). Temperature and salinity are two thermodynamic variables of great importance in ocean studies as they control the stability and thermohaline circulation of the ocean. They can also be used to identify frontal region in the ocean as a function of depth. The three-dimensional structure of the ocean is quite complex to analyse, e.g. via conventional EOFs. A simpler way to analyse for example the frontal structure of the ocean is to use the vertical profiles of salinity and temperature and draw a summary of the three-dimensional ocean structure. Pauthenet (2018) investigated the thermohaline structure of the southern ocean using functional PCs of vertical profiles of temperature and salinity using various data products.

1 When

the sampling is not regular an interpolation can be applied to obtain regular sampling.

322

14 Functional and Regularised EOFs

Given vertical profiles of temperature T and salinity S of the ocean at a given grid point in the ocean surface, defined at discrete depths, a continuous profile is obtained using Eq. (14.6) with the index i representing the location and plays the role of time in conventional EOFs. The n × K coefficient matrices T and S for temperature T and salinity S, respectively, are used to compute the functional PCs as the solution of the eigenvalue problem: T Mu = λu,

(14.7)

where  = (T , S ), and represents the n × (2K) coefficient matrix, and M is a (2K) × (2K) diagonal weighting (or scaling) matrix reflecting the different units (and variances) of variables. The eigenvectors can be used to construct vertical modes of salinity and temperature after being properly weighted, i.e. −1/2 M−1/2 u, and can also be used to filter the data using a few leading functional PCs, e.g. M−1/2 −1/2 u. Pauthenet used B-splines as basis functions. Figure 14.1 shows an example of two vertical profiles of T and S along with the leading 20 used B-splines used for fitting. The (non-equidistant) measurement locations are also shown. Note that the vertical profiles are interpolated into a regular vertical grid prior to B-spline fitting. Figure 14.2 shows an example of the leading three functional PCs obtained from Monthly Isopycnal and Mixed-layer Ocean Climatology (MIMOC), see Schmidtko et al. (2013). The figure reflects broadly known water masses distribution as indicated in Fig. 14.2e (Talley 2008). For example, functional PC2 and PC3 reflect the low- and high-salinity intermediate water mass while functional PC1 reflects more the wind-driven gyres at upper levels.

Fig. 14.1 An example of two vertical profiles of temperature (left), salinity (middle) constructed using 20 B-splines (right). The dots represent the measurements. Adapted from Pauthenet (2018)

14.3 An Example of Functional PCs from Oceanography

323

Fig. 14.2 MIMOC global ocean temperature at 340 dbar (a), salinity at 750 dbar (c) climatology, along with the spatial distribution of functional PC1 (b), PC2 (d) and PC3 (f). Panel (e) shows the low- and high-salinity intermediate water mass (Courtesy of Talley 2008). Adapted from Pauthenet (2018)

The functional PCs of vertical profiles can provide invaluable information on the ocean fronts. An example of the vertical profiles or PCs of the southern ocean is shown in Fig. 14.3 along with their explained variance. An example is given in Fig. 14.4 showing the spatial distribution of the leading four functional PCs in the Southern ocean. The functional PCs reflect well the oceanic fronts, e.g. the polar front. The leading functional PC correlates well with temperature at 250 m and salinity at 25 m and 1355 m, whereas functional PC2 correlates better with salinity at 610 m depth. Note, in particular, the low salinity near Antarctica. Salinity fronts shift northward with depth (Fig. 14.4). The other two functional PCs are associated with subantarctic mode water and southern ocean fronts.

324

14 Functional and Regularised EOFs

Fig. 14.3 The leading four vertical mode or PCs of temperature and salinity along with their percentage explained variance. Individual explained variance is also shown for T and S separately. The mean profile (solid) is shown along with the envelope of the corresponding PCs. (a) PC1 (72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4 (1.35%). Adapted from Pauthenet et al. (2017). ©American Meteorological Society. Used with permission

14.4 Regularised EOFs 14.4.1 General Setting We consider here the more general problem of regularised (or smooth) EOFs as discussed in Hannachi (2016). As for the smoothed MCA (see Chap. 15), we let

V = (Vij ) = (< φi , Sφj >), and  = ij = < D 2 φi , D 2 φj > , the vector a is then obtained by solving the generalised eigenvalue problem: Va = μ ( + α) a Remark Note that we used the original maximisation problem and not the previous integro-differential equation. It can be seen that the associated eigenvalues are real non-negative, and the eigenvectors are real. The only parameter required to solve this problem is the

14.4 Regularised EOFs

325

Fig. 14.4 Spatial structure of the leading four functional PCs of the vertical profiles of the 2007 annual mean temperature and salinity. Data are taken from the Southern Ocean State Estimate (SOSE, Mazloff et al. 2010). (a) PC1 (72.52%). (b) PC2 (19.89%). (c) PC3 (3.43%). (d) PC4 (1.35%). Adapted from Pauthenet et al. (2017). ©American Meteorological Society. Used with permission

smoothing parameter α. A number of methods can be used to get the regularisation or smoothing parameter. In general, smoothing parameters can be obtained, for example, from experience or using cross-validation. This is discussed below. Exercise Show that the operator V is symmetric semi-definite positive.    Hint Vij = n1 φ( t)xT (t)x(s)φj (s)dtds = φi (t)xT (t)dt φj (s)x(s)ds =    Vj i . For the positivity, consider aT Va = ij ai aj Vij = ij ai aj φi (t)xT (t)dt     φj x(s)ds, which equals = i,j,k ai aj  φi (t)xk (t)dt φj (s)xk (s)ds      k ij ai φi (t)xk (t)dt aj φj (s)xk (s)ds . This last expression is simply 2    ≥ 0. k i ai φi (t)xk (t)dt The procedure based on radial basis functions (RBFs, Appendix A) mentioned previously, see also Chap. 15, can also be used here in a similar fashion. The nice thing about these functions is that they can be used to smooth curves. Suppose, for

326

14 Functional and Regularised EOFs

example, that the curves x(t) = (x1 (t), . . . , xn (t)) are observed at discrete times t = tk , k = 1, . . . , p. The interpolation, or smoothing, using radial basis functions is similar to, but simpler than the B-spline smoothing and is given by xi (t) =

p 

λi,k φ(|t − tk |).

(14.8)

k=1

Perhaps one main advantage of using this kind of smoothing, compared to splines, is that it involves one single radial function, which can be chosen from a list of functions given in the Appendix A. The coefficients of the smoothed curves can be easily computed by solving a linear problem, as shown in Appendix A. The covariance matrix can also be computed easily in terms of the matrix  = λij , p and the radial function φ. The smoothed EOF curves a(t) = k=1 uk φ(|t − tk |) are then sought by solving the above generalised eigenvalue problem to get u = (u1 , . . . , up ). Note that the function φ(|t − tk |) now plays the role of the basis function φk (t) used previously.

14.4.2 Case of Spatial Fields Here we suppose that we observe a sample of space–time field F (x, tk ) at times tk , k = 1, . . . , n and x represents a spatial position. The covariance function of the field is given by 1  S (x, y) = F (x, tk ) F (y, tk ) . n−1 n

(14.9)

k=1

The objective of smooth EOFs (Hannachi 2016) is to find the “EOFs” of the covariance matrix (14.9). Denoting by  the spatial domain, which may represent the entire globe or a part of it, an EOF is a continuous pattern u(x) maximising:   u(x)S (x, y) u(y)dxdy  

   subject to  u(x)2 + α(∇ 2 u(x))2 dx = 1. This yields a similar integrodifferential equation to (14.2), namely 

 S (x, y) u(x)dx = 

  S (x, y) u(x)d = μ 1 + α∇ 4 u(y).

(14.10)



In spherical coordinates for example d = R 2 cos(φ)dφdλ, where R is the Earth’s radius, φ is the latitude and λ is the longitude. Here we suppose that the smoothing

14.5 Numerical Solution of the Full Regularised EOF Problem

327

parameter α is given, and we concentrate more on the application to observed fields. The choice of this parameter is discussed later using the Lagrangian function. There are two ways to solve the above integro-differential equation when we have a finite sample of data, namely the RBF and the direct numerical solutions. The RBF solution is briefly discussed below, and the full numerical solution is discussed in detail in Sect. 14.5.

The Example of the RBF Solution This is exactly similar to the approximation used in Sects. 14.1 and 14.2 above. We use two-dimensional RBFs φi (x) = φ( x − xi ) and expand u(x) in terms of  φi (x), e.g. u(x) = uk φk (x) = φ T (x)u. For example, for the case when  α = 0, the sample F(x) = (F1 (x), . . . , Fn (x))T is similarly expanded as Ft (x) = λt,k φk (x), i.e. F(x) = φ(x) and from S(x, y) = (n − 1)−1 FT (x)F(y) we get exactly a similar problem to that of Sect. 14.1, i.e. T u = μu, where  eigenvalue T  = S 2 φ(x)φ (x)dx. The set S 2 represents the spherical Earth. The advantage of this is that we can use spherical RBFs, which are specific to the sphere, see e.g. Hubbert and Baxter (2001).

14.5 Numerical Solution of the Full Regularised EOF Problem To integrate Eq. (14.10) one starts by defining the sampled space–time field through a (centred) data matrix X = (x1 , . . . , xd ), where xk = (x1k , . . . , xnk )T is the time series of the field at the kth grid point. The associated sample covariance matrix is designated by S. The pattern u = (u1 , . . . , ud )T satisfying the discretised version of Eq. (14.10) is given by the solution to the following generalised eigenvalue problem (see also Eq. (15.34)):   Su = μ Id + αD4 u.

(14.11)

In Eq. (14.11) D4 is the square of the (discretised) Laplacian operator ∇ 2 , which is self-adjoint (see, e.g. Roach 1970). The Laplacian in spherical coordinates takes the 2 ∂2f tan ϕ ∂f 1 form: ∇ 2 f (ϕ, λ) = R12 ∂∂ϕf2 + R 2 cos 2 ϕ ∂λ2 − R 2 ∂ϕ , where R is the Earth’s radius. The discretised Laplacian yields a matrix whose elements vary with the latitude ϕ. Let u(ϕ, λ) be a function on the sphere. At grid point (ϕk , λl ) the Laplacian ∇ 2 u can be explicitly computed and yields

328

14 Functional and Regularised EOFs





tan ϕk 1 2 2 uk,l u − (δϕ) + (δλ)2 cos 2 ϕ − δϕ (δϕ)2 k−1,l k 2

1 1 1 + δϕ δϕ − tan ϕk uk+1,l + (δλ)2 cos2 ϕ uk,l+1 + uk,l−1 k

R 2 (D 2 u)k,l =

(14.12) for k = 1, . . . , p, and l = 1, . . . , q and where uk,l = u (ϕk , λl ), and ϕk , k = 1, . . . , p, and λl , l = 1, . . . , q, are the discretised latitude and longitude coordinates respectively.2 Furthermore, the matrix S in Eq. (14.11) is given by a Hadamard product, S = S  , where  = φ1, with φ being the pq × 1 column vector containing q replicates of the vector (cos ϕ1 , . . . cos ϕp )T , i.e. φ = T

cos ϕ1 , . . . cos ϕp , . . . cos ϕ1 , . . . cos ϕp and 1 is the 1 × pq vector of ones. Remarks The above Hadamard product represents the area weighting used in conventional EOF analysis accounting for the poleward converging meridians. Another point worth mentioning is that the latitudinal discretisation in Eq. (14.12) is uniform. A non-uniform discretisation, e.g. Gaussian grid used in spectral methods, can be easily incorporated in Eq. (14.12). Because Eq. (14.10) is a integro-differential operator its integration requires boundary conditions to compute the matrix D4 in Eq. (14.11). Three basic types of boundary conditions are discussed here, as in Hannachi (2016). We consider first the case of hemispheric fields for which one can take ϕ0 = π/2, ϕk+1 = ϕk − δϕ and λq = λ1 (periodicity). One can also take u0,l = 0, and up+1,l = up,l , l = 1, . . . , q, plus the periodic boundary condition uk,q+1 = uk,1 . Letting u = T  uT1 , uT2 , . . . , uTq , which represents the spatial pattern measured at the grid points

(ϕk , λl ), k = 1, . . . , p, and l = 1, . . . , q, where uTl = u1,l , u2,l , . . . , up,l . A little algebra yields ⎛

A ⎜C ⎜ ⎜ ⎜O 2 2 R D u = Au = ⎜ ⎜ .. ⎜ . ⎜ ⎝O C

⎞⎛ ⎞ u1 C ⎜ ⎟ O⎟ ⎟ ⎜ u2 ⎟ ⎟⎜ ⎟ O ⎟ ⎜ u3 ⎟ ⎜ ⎟ .. ⎟ ⎜ .. ⎟ ⎟ . ⎟⎜ . ⎟ ⎟⎜ ⎟ O O O . . . A C ⎠ ⎝ uq−1 ⎠ O O O ... C A uq

C A C .. .

O C A .. .

O O C .. .

... ... ... .. .

O O O .. .

(14.13)

where C and A are p × p matrices given, respectively, by

C = Diag c1 , c2 , . . . , cp and

2 Note that here q and p represent respectively the resolutions in the zonal and meridional directions respectively, so the total number of grid points is pq.

14.5 Numerical Solution of the Full Regularised EOF Problem



a1 b1 ⎜ (δφ)−2 a ⎜ 2 ⎜ ⎜ 0 (δφ)−2 A=⎜ .. ⎜ .. . ⎜ . ⎜ ⎝ 0 0 0 0 where ak = − (δλ cos ϕk )

−2



2 (δϕ)2

+

0 b2 a3 .. .

... ... ... .. .

0 0 0 .. .

329

0 0 0 .. .



0 0 0 .. .

0 . . . (δφ)−2 ap−1 bp−1 0 . . . 0 (δφ)−2 ap + bp

2 (δλ)2 cos2 ϕk



tan ϕk δϕ

 , bk =

1 (δϕ)2



⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

tan ϕk δϕ ,

and ck =

. The eigenvalue problem (14.11) yields (Hannachi 2016):

    α Su = μ Ipq + αD4 u = μ Ipq + 4 A2 u, R

(14.14)

where Ipq is the pq × pq identity matrix. For a given smoothing parameter α Eq. (14.14) is a generalised eigenvalue problem. Exercise The objective of this exercise is to derive Eq. (14.13). 1.  Denote by v thevector R 2 D 2 u, and write v in a form similar to u, i.e. vT = vT1 , vT2 , . . . , vTq where vTl = (v1l , v2l , . . . , vpl ), for l = 1, . . . q. Write Eq. (14.12) for v1l , v2l and vpl for l = 1, 2 and q. 2. Show in particular that v1 = Au1 + Cu2 + Cuq , v2 = Au2 + Cu1 + Cu3 and vq = Auq + Cuq−1 + Cu1 . 3. Derive Eq. (14.13). (Hint. 1. v11 = δϕ1 2 u01 + a1 u11 + b1 u21 + c1 (u10 + u12 ) and v21 = δϕ1 2 u11 + a2 u21 + b2 u31 + c2 (u20 + u22 ), and similarly vp1 = δϕ1 2 up−1,1 + ap up1 + bp up+1,1 + cp (up0 + up2 ), and recall that u01 = 0, u10 = u1q , u20 = u2q , etc.). Note that in Eq. (14.14) it is assumed that u0l = 0, for l = 1, . . . q, meaning that the pattern is zero at the pole. As it is explained in Hannachi (2016) this may be reasonable for the wind field but not for other variables such as pressure. Another type of boundary condition is to consider u0l = u1,l (for l = 1, . . . , q), which is reasonable because of the poleward convergence of the meridians. This yields R 2 D2 u = Bu,

(14.15)

where B is similar to A, see Eq. (14.13), in which A = (aij ) is replaced by B = (bij ) where b11 = δϕ1 2 + a1 and bij = aij for all other i and j .

330

14 Functional and Regularised EOFs

Exercise Derive Eq. (14.15) for the case when u0l = u1l for l = 1, . . . q Hint Follow the previous exercise. The third type of boundary conditions represents the non-periodic conditions, such as the case of a local region on the globe. This means uk0 = uk1 and uk,q+1 = uk,q for k = 1, . . . p in the zonal direction, and one keeps the same condition in the meridional direction, u0l = u1l and up+1,l = upl for l = 1, . . . q. Using the expression of matrix B given in Eq. (14.15), Eq. (14.13) yields a bloc tridiagonal matrix as ⎛

B+C ⎜ C ⎜ ⎜ ⎜ O 2 2 R D u = Cu = ⎜ ⎜ .. ⎜ . ⎜ ⎝ O O

C B C .. .

O C B .. .

O O C .. .

... ... ... .. .

O O O .. .

O O O .. .

⎞⎛

u1 u2 u3 .. .



⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ ⎟⎜ ⎟ O O O . . . B C ⎠ ⎝ uq−1 ⎠ O O O ... C B + C uq

(14.16)

which goes into Eq. (14.11). Exercise Derive Eq. (14.16). Hint Follow the above exercises. In particular, we get v1 = (B + C)u1 + Cu2 and vq = Cuq−1 + (B + C)uq Choice of the Smoothing Parameter Generally, the smoothing parameters in regularisation problems can be obtained from experience using for example trial and error or using cross-validation (CV), as outlined previously, by minimising the total misfit obtained by removing each time onedata point from  the sample (leave-one-out approach). For a given α we let (k) (k) (k) U = u1 , . . . , um , which represents the set of the leading m (m ≤ pq = d) smooth EOFs obtained by discarding the kth observation (where the observations can be assumed to be independent). Note that U(k) is a function of α. The residuals obtained from approximating a data vector x using U(k) is ε (k) (α) = x −

m 

βi u(k) i ,

i=1 (k)

where βi , i = 1, . . . , m are obtained from the system of equations < x, ul >= m m  −1  (k) (k) (k) < x, uj >. Here j =1 βj < uj , ul >, l = 1, . . . , m, i.e. βi = j =1 G ij (k)

(k)

G is the scalar product matrix with elements [G]ij =< ui , uj >. The optimal

 value of α is then the one that minimises the CV, where CV = nk=1 tr  (k) , with

14.6 Application of Regularised EOFs to SLP Anomalies

331

 (k) being the covariance matrix3 of ε (k) . SVD can be used to efficiently extract the first few leading smooth EOFs then optimise the total sum of squared residuals. We can also use instead the explained variance as follows. If one designates by  (k) σ(k) = m i=1 μi the total variance explained by the leading m EOFs  when the kth observation is removed, then the best α is the one that maximises nk=1 σ(k) . Any descent algorithm can be used to optimise this one-dimensional problem. It turns out, however, and as pointed out by Hannachi (2016), that crossvalidation does not work in this particular case simply because EOFs minimise precisely the residual variance, see Chap. 3. Hannachi (2016) used the Lagrangian L of the original regularised EOF problem:   max   u(x)S (x, y) u(y)dxdy   subject to  u(x)2 + α(∇ 2 u(x))2 dx = 1,

(14.17)

that is     

2 L= ∇ 2 u(x) dx . u(x)S (x, y) u(y)dxdy − μ 1 − [u(x)]2 dx − α  





14.6 Application of Regularised EOFs to SLP Anomalies Regularised EOF analysis was applied by Hannachi (2016) to the 2◦ × 2◦ NCEP/NCAR SLP field. Monthly SLP anomalies for the winter (DJF) months for the period Jan 1948–Dec 2015 were used. The resolution was then reduced to obtain sparser data with respective resolutions 5◦ × 5◦ , 5◦ × 10◦ and 10◦ × 10◦ latitude– longitude grid in addition to the original data. Figure 14.5 shows the Lagrangian

Fig. 14.5 Lagrangian L of the optimisation problem eq (14.17) versus the smoothing parameter α, based on the leading smooth EOF of the extratropical NH SLP anomalies for 2.5◦ × 2.5◦ (a), 5◦ × 5◦ (b), 5◦ × 10◦ (c) and 10◦ × 10◦ (d) latitude–longitude grid. Adapted from Hannachi (2016)

3 Note

that ε (k) is a field, i.e. a residual data matrix.

332

14 Functional and Regularised EOFs

Fig. 14.6 Eigenvalues of the generalised eigenvalue problem Eq. (14.11) for the winter SLP anomalies over the northern hemisphere for the regularised problem (filled circles) and the conventional (α = 0) problem (open circle). The eigenvalues are scaled by the total variance of the SLP anomalies and transformed to a percentage, so for example the open circles provide the percentage of explained variance of the EOFs. Adapted from Hannachi (2016)

L computed based on the leading smooth or regularised EOF as a function of the smoothing parameter α. The figure shows indeed the existence of an optimum value of the parameter, which increases with decreasing resolution. Figure 14.6 also shows the percentage of explained variance for both the conventional and regularised EOFs for the 10◦ × 10◦ grid downgraded SLP anomalies. Interestingly, the leading explained variance of the leading smooth EOF is about one and a half times larger than that of the conventional EOF. The leading two conventional and smooth EOFs are shown in Fig. 14.7. The leading EOF is accepted in the literature as the Arctic Oscillation (AO). Figure 14.7 shows that EOF1 has a strong component of the NAO reflected by the asymmetry between the North Atlantic and the North Pacific centres of action. This is related to the mixing property, a characteristic feature of EOFs. The smoothing shows that the leading pattern (Fig. 14.7b) is quasi-symmetric and represents more the AO pattern, which explains the large explained variance compared to EOF1. If the smoothing parameter is increased the pattern becomes more symmetric (not shown). The second patterns (EOF2 and smooth EOF2) are quite similar. Smooth EOFs have also been applied to Hadley Centre SST anomalies, HadISST. But one of the clear applications was in connection to trend EOFs (Sect. 16.4). When the data resolution is degraded it is found that the TEOF method missed the second

14.6 Application of Regularised EOFs to SLP Anomalies

333

Fig. 14.7 Leading two conventional (a,c) and regularised (b,d) EOFs of the northern hemisphere SLP anomalies based on the 10◦ × 10◦ latitude–longitude grid. The smoothing parameter is based on the optimal value obtained from the Lagrangian (Fig. 14.5d). Adapted from Hannachi (2016)

trend pattern, namely the Scandinavian pattern. Figure 14.8 shows the eigenspectrum associated with the inverse rank matrix, see Eq. (16.21), corresponding to the 5◦ × 5◦ downgraded resolution of the winter SLP anomalies. It is clear that when the trend EOF method is applied with the regularisation procedure, i.e. applying Eq. (14.11) to the data matrix Z of Eq. (16.21), the eigenvalue of the second TEOF is raised off the “noise” floor (Fig. 14.8b), and the second trend pattern regained. The leading two smooth trend PCs are shown in Fig. 14.8a,b. The leading two smooth EOFs along with the associated smooth trend patterns are shown in Fig. 14.9, which can be compared to Fig. 16.8 in Chap. 16 (see also Hannachi 2016).

334

14 Functional and Regularised EOFs

Fig. 14.8 Eigenspectrum, given in terms of percentage of explained variance, of the covariance (or correlation) matrix of the inverse ranks of the SLP anomalies with a reduced resolution of 5◦ × 5◦ grid for the non-regularised (a) and regularised (b) cases, along with the regularised first (c) and second (d) trend PCs associated with the leading two eigenvalues shown in (b). The optimal smoothing parameter, α = 60, is used in (b). Adapted from Hannachi (2016)

14.6 Application of Regularised EOFs to SLP Anomalies

335

Fig. 14.9 Leading two regularised trend EOFs (a,b) of the SLP anomalies, corresponding to the leading two eigenvalues of Fig. 14.8b, and the associated trend patterns (c,d). Contour interval in (c,d) 1 hPa. Adapted from Hannachi (2016)

Chapter 15

Methods for Coupled Patterns

Abstract Previous chapters focussed mostly on single fields, such as EOFs of seal level pressure. This chapter is an extension of previous methods. It describes different methods that mostly deal with two fields to identify coupled patterns that covary coherently. The chapter discusses both the conventional and regularised problems. It also explores the predictive power of coupled pattern analysis. Keywords Canonical correlation analysis · Regularised CCA · Canonical covariance analysis · Redundancy analysis · Principal predictors · Functional CCA · Multivariate regression

15.1 Introduction We have presented in the previous chapters methods aiming at extracting various spatial patterns and associated time series satisfying specific properties. Those techniques were presented in the context of a single space–time field.1 The focus of this chapter is on presenting alternative methods that deal with finding coupled patterns of variability from two or more space time fields. PCA was first proposed by Pearson (1902), and a few years later Spearman (1904a,b) introduced techniques to deal with more than one set of variables, and in 1936, Hotelling formulated mathematically the problem, which has become known as canonical correlation analysis (CCA) and was mainly developed in social science. A closely related method is the maximum covariance analysis (MCA). Both MCA and CCA use both data sets in a symmetric way in the sense that we are not seeking to explain one data set as a function of the other. Methods that fit in the regression framework exist and are based on considering one data set as predictor and the other as predictand. Three main methods attempt to achieve this, namely, redundancy analysis (RDA),

1 It

is possible to apply these techniques to two combined fields, e.g. SLP and SST, by combining them into a single space–time field. In this way the method does not explicitly take into account the co-variability of both fields.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_15

337

338

15 Methods for Coupled Patterns

principal predictor analysis (PPA) and principal regression analysis (PRA). RDA (von Storch and Zwiers 1999; Wang and Zwiers 1999) aims at selecting predictors that maximise the explained variance. PPA (Thacker 1999) seeks to select predictors that maximise the sum of squared correlations. PRA (Yu et al. 1997), as its name indicates, fits regression models between the principal components of the predictor data and each of the predictand elements individually. Tippett et al. (2008) discuss the connection between the different methods of finding coupled patterns and multivariate regression through a SVD of the regression matrix. In atmospheric science, it seems that the first idea to combine two or more sets of variables in an EOF analysis was mentioned in Lorenz (1956) and was first applied by Glahn (1962) and few others in statistical prediction of weather, see e.g. Kutzbach (1967) and Bretherton et al. (1992). This combined EOF/PC analysis is obtained by applying a standard EOF analysis to the combined space–time field T

zt = xTt , yTt of xt and yt , t = 1, . . . n. The grand covariance matrix of zt , t = 1, . . . n, is given by 1 (zt − z) (zt − z)T , n n

Sx,y =

(15.1)

t=1

 where z = n−1 nt=1 zt is the mean of the combined field. However, because of the scaling problem, and in order for the combined field to be consistent the individual fields are normally scaled by their respective variances. Hence if Sxx and Syy are the respective covariance matrices of xt and yt , t = 1, . . . n, and Dx = Diag(Sxx ) and −1/2 −1/2 Dy = Diag(Syy ), the scaled variables become xt Dx and yt Dy , respectively. The grand covariance matrix of the combined scaled fields is 

−1/2

Dx O

 O

−1/2 Dy

Sx,y

−1/2

Dx O

O

−1/2 Dy

.

The leading combined EOFs/PCs provide patterns that maximise variance, or correlation to be more precise, of the combined fields. The obtained leading EOFs of each field, however, do not necessarily reflect particular association, e.g. covariability, between the fields. Canonical correlation analysis (CCA) and canonical covariance analysis (CCOVA) achieve this by maximising correlation and covariance between the fields.

15.2 Canonical Correlation Analysis

339

15.2 Canonical Correlation Analysis 15.2.1 Background Canonical correlation analysis (CCA) dates back to Hotelling (1936a) and attempts

to find relationships between two space–time fields. Let xt = xt1 , . . . , xtp1 , and yt = yt1 , . . . , ytp2 , t = 1, 2, . . ., be two multidimensional stationary time series with respective dimensions p1 and p2 . The objective of CCA is to find a pair of patterns a1 and b1 such that the time series obtained by projecting xt and yt onto a1 (1) and b1 , respectively, have maximum correlation. In the following we let at = aT1 xt (1) and bt = bT1 yt , t = 1, 2, . . ., be such time series. Definition   The patterns a1 and b1 maximising corr at(1) , bt(1) are the leading canonical (1)

correlation patterns, and the associated time series at the canonical variates.

(1)

and bt , t = 1, 2, . . ., are

15.2.2 Formulation of CCA We suppose in the sequel that both time series have zero mean. Let  xx and  yy be the respective covariance matrices of xt and yt . We also let  xy be the

crosscovariance matrix between xt and yt , and similarly for  yx , i.e.  xy = E xt yTt =  Tyx . The objective is to find a1 and b1 such that   aT1  xy b1 (1) (1) =

ρ = corr at , bt 1

1 aT1  xx a1 2 bT1  yy b1 2

(15.2)

is maximum. Note that if a1 and b1 maximise (15.2) so are also αa1 and βb1 . To (1) (1) overcome this indeterminacy, we suppose that the time series at and bt , t = 1, 2, . . ., are scaled to have unit variance, i.e. aT1  xx a1 = bT1  yy b1 = 1.

(15.3)

Another way to look at (15.3) is by noting that (15.2) is independent of the scaling of a1 and b1 , and therefore maximising (15.2) is also equivalent to maximising (15.2) subject to (15.3). The CCA problem (15.2) is then written as   maxa,b ρ = aT  xy b s.t aT  xx a = bT  yy b = 1.

(15.4)

340

15 Methods for Coupled Patterns

The solution is then provided by the following theorem. Theorem We suppose that  xx and  yy are of full rank. Then the canonical correlation patterns are given by the solution to the eigenvalue problems: −1 Mx a =  −1 xx  xy  yy  yx a = λa −1 −1 My b =  yy  yx  xx  xy b = λb .

(15.5)

Proof By introducing Lagrange multipliers α and β, Eq. (15.4) reduces to     max aT  xy b − α aT  xx a − 1 − β bT  yy b − 1 . a,b

After differentiating the above equation with respect to a and b, one gets  xy b = 2α xx a and  yx a = 2β yy b, which after combination yields (15.5). Note also that one could obtain the same result without Lagrange multipliers (left as an exercise). We notice here that α and β are necessarily equal. In fact, multiplying by aT and bT the respective previous equalities (obtained after differentiation), one gets, keeping in mind Eq. (15.3), 2α = 2β = aT  xy b = λ. Hence the obtained Eqs. (15.5) can be written as a single (generalised) eigenvalue problem: 

Op1 ,p1  xy  yx Op2 ,p2

     a a  xx Op1 ,p2 =λ . Op2 ,p1  yy b b

(15.6)

Remarks 1. The matrices involved in (15.5) have the same spectrum. In fact, we note first that the matrices Mx and My have the same rank, which is that of  xy . From the 1

2 SVD of  xx , one can easily compute a square root2  xx of  xx and similarly for  yy . The matrix Mx becomes then

−1

1

2 Mx =  xx2 AAT  xx ,

(15.7)

1

1

2  xx = UUT , where U is orthogonal, then one can define this square root by  xx = U 2 . In  1 T 1 2 2 this case we have  xx =  xx  xx . Note that this square root is not symmetric. A symmetric

2 If

1

1

1

1

2 2 2 = U 2 UT , in which case  xx =  xx  xx , and hence the square square root can be defined by  xx root of  xx is not unique.

15.2 Canonical Correlation Analysis −1

341

−1

where A =  xx2  xy  yy2 . Hence, the eigenvalues of Mx are identical to the −1

1

2 eigenvalues of AAT (see Appendix D). Similarly we have My =  yy2 AAT  yy , and this completes the proof. 2. The matrices Mx and My are positive semi-definite (why?). 1

1

2 2 3. Taking u =  xx a and v =  xx b, then (15.5) yields AAT u = λu and AAT v = λv, i.e. u and v are, respectively, the left and right singular vectors of A.

−1

−1

In conclusion, using the SVD of A, i.e.  xx2  xy  yy2 = UVT , where  = Diag (λ1 , . . . , λm ), and m is the rank of  xy , and where the eigenvalues have been arranged in decreasing order, the canonical correlation patterns are given by −1

−1

ak =  xx2 uk and bk =  yy2 vk ,

(15.8)

where uk and vk , k = 1, . . . , m, are the columns of U and V, respectively. Note that the canonical correlation patterns ak are not orthogonal, but their transforms uk are, and similarly for bk , k = 1, . . . , m. The eigenvalues λ1 ≥ λ2 ≥ . . . ≥ λm are the canonical correlations. Note that the eigenvalues of Mx and My are λ2k , k = 1, . . . , m.

15.2.3 Computational Aspect Estimation Using Sample Covariance Matrix Given a finite sample of two multivariate time series xt , and yt , t = 1, . . . , n, representing

two space–time fields, one can form the data matrices X = (xti ) and Y = ytj , t = 1, . . . n, i = 1, . . . p1 , and j = 1, . . . , p2 . We suppose that the respective data matrices were centred by removing the respective time averages, i.e. applying the centring operator In − n1 1n 1Tn to both matrices. The respective covariance and cross-covariance matrices are then computed Sxx = n1 XT X, Syy = n1 YT Y, Sxy = n1 XT Y.

(15.9)

  −1/2 −1/2 The pairs of canonical correlation patterns are then given by Sxx uk , Syy vk , where uk and vk , k = 1, . . . , m, are, respectively, the left and right singular vectors

−1/2 −1/2 of Sxx Syy Syy , and m = rank Sxy . Remarks The CCA problem, Eq. (15.5), is based on the orthogonal projection onto

−1 T the respective column space of the data matrices X and Y, i.e. X XT X X and similarly for Y. So the CCA problem simply consists of the spectral analysis of −1 T T −1 T

X y Y Y Y . X XT X

342

15 Methods for Coupled Patterns

−1 T Exercise Show that CCA indeed consists of the spectral analysis of X XT X X

T −1 T y Y Y Y . Hint Multiply both sides of the first equation of (15.5) by X.

Estimation Using EOFs The various operations outlined above can be expensive in terms of CPU time, particularly the matrix inversion in the case of a large number of variables. This is not the only problem that can occur in CCA. For example, it may happen that either Sxx or Syy or both can be rank deficient in which case the matrix inversion breaks down.3 The most common way to address these problems is to project both fields onto their respective leading EOFs and then apply CCA to the obtained PCs. The use of PCs can be regarded as a filtering operation and hence can reduce the effect of sampling fluctuation. This approach was first proposed by Barnett and Preisendorfer (1987) and also Bretherton et al. (1992) and is often used in climate research. When using the PCs, the covariance matrices Sxx and Syy become the identity matrices with respective orders corresponding to the number of PCs retained for each field. Hence the canonical correlation patterns simply reduce to the left and right singular vectors of the cross-covariance Sxy between PCs. For example, if T

α = α1 , . . . , αq is the left eigenvector of this cross-covariance, where q is the number of retained EOFs of the left field xt , t = 1, . . . , n, then the corresponding canonical correlation pattern is given by a=

q 

αk ek ,

(15.10)

k=1

where ek , k = 1, . . . , q are the q leading EOFs of the left field. The canonical correlation pattern for the right field is obtained in a similar way. Note that since the PCs are in general scaled to unit variance, in Eq. (15.10) the EOFs have to be scaled so they have the same units as the original fields. To test the significance of the canonical correlations, Bartlett (1939) proposed to test the null hypothesis that the leading, say r, canonical correlations are non-zero, i.e. H0 : λr+1 = λr+2 = . . . = λm = 0 using the statistic:   m   ! 1 1 − λˆ 2k . T = − n − (p1 + p2 + 3) Log 2 k=r+1

3 Generalised

inverse can be used as an alternative, see e.g. Khatri (1976), but as pointed out by Bretherton et al. (1992) the results in this case will be difficult to interpret.

15.2 Canonical Correlation Analysis

343

Under H0 and joint multinormality, one gets the asymptotic chi-square approximation, i.e. 2 , T ∼ χ(p 1 −r)(p2 −r)

see also Mardia et al. (1979). Alternatively, Monte Carlo simulations can be used to test the significance of the sample canonical correlations λˆ 1 , . . . λˆ m , see e.g. von Storch and Zwiers (1999) and references therein. Remark 1. It is possible to extend CCA to a set of more than two variables, see Kettenring (1971) for details. 2. Principal prediction patterns (PPPs) correspond to CCA applied to xt and yt = xt+τ , t = 1, . . . , n − τ . See, e.g. Dorn and von Storch (1999) and von Storch and Zwiers (1999) for examples. 3. The equations of CCA can also be derived using regression arguments. We first let xt = aT xt and yt = bT yt . Then to find a and b we minimize the error variance obtained from regressing yt onto xt , t = 1, . . . , n. Writing yt = αxt + εt , 2

[cov(xt ,yt )] t ,yt ) it is easy to show that α = cov(x var(xt ) and var (εt ) = var (yt )− var(xt ) ; hence, the sample variance estimate of the noise term is

2

T a Sxy b . σˆ = b Syy b − T a Sxx a 2

T

The patterns a and b minimising yielding

σˆ 2 bT Syy b

are then obtained after differentiation

Sxy b = λSxx a and Syx a = λSyy b , which is equivalent to (15.5).

15.2.4 Regularised CCA In several instances the data can have multi-colinearity, which occurs when there are near-linear relationships between the variables. When this happens, problems can occur when looking to invert the covariance matrix of X and/or Y. One common solution to this problem is to use regularised CCA (RCCA). This is very similar to

344

15 Methods for Coupled Patterns

ridge regression (Hoerl and Kennard 1970). In ridge regression4 the parameters of the model y = Xβ + ε are obtained by adding a regularising term in the residual sum of squares, RSS = (y − Xβ)T (y − Xβ) + λβ T β, leading to the estimate βˆ = (XT X + λI)−1 Xy. In regularised CCA the covariance matrices are similarly replaced by (XT X + λI)−1 and (XT X + λI)−1 . Therefore, as for CCA, RCCA consists of a spectral analysis of X(XT X + λ1 I)−1 XT Y(YT Y + λ2 I)−1 YT . In functional CCA, discussed later, regularisation (or smoothing) is required to get meaningful solution. The choice of the regularisation parameters is discussed later.

15.2.5 Use of Correlation Matrices An alternative to using the covariance and cross-covariance matrices is to use correlation and cross-correlation matrices Rxx , Ryy and Rxy . CCA obtained using correlation matrices is the same as covariance-based CCA but applied to the scaled fields. The equations remain the same, but the results will in general be different. The difference between the two can be compared to the difference between covariance-based or correlation-based EOFs. Note that CCA is invariant to scale changes in the variables.

15.3 Canonical Covariance Analysis Canonical covariance analysis (CCOVA) is similar to CCA except that canonical variates are constrained to have maximum covariance instead of maximum correlation. The method has been introduced into climate research by Wallace et al. (1992) and Bretherton et al. (1992) under the name of SVD analysis. Some authors, see e.g. von Storch and Zwiers (1999), have proposed to rename it as “maximum covariance analysis” (MCA). Using the same notation as before, the covariance between the respective canonical variates (or expansion coefficient), at = aT xt , and bt = bT yt , t = 1, . . . , n, obtained from the sample is cov(at , bt ) = aT Sxy b. The CCOVA, or MCA, problem is then formulated as follows. Find the patterns a and b maximising cov(at ,bt ) a b , or similarly

4 Ridge

regression is closely related to Tikhonov regularisation in Hilbert spaces. Tikhonov regularisation consists of finding an approximate solution to an “ill-posed” problem, Af = u, by solving instead a “regularised” problem, (A + λI) = u. This yields the approximate solution ˆf = (A∗ A + λI)−1 A∗ u, which is obtained using “penalised least squares” by minimising Af − u 2 + λ f 2 . The matrix A∗ is the adjoint of A.

15.3 Canonical Covariance Analysis

  maxu,v γ = uT Sxy v s.t uT u = vT v = 1.

345

(15.11)

The solution to Eq. (15.11) is provided by Sxy v = λu and uT Sxy = λvT .

(15.12)

Hence u and v correspond, respectively, to the leading eigenvectors of Sxy Syx and Syx Sxy , or equivalently to the left and right singular vectors of the cross-covariance matrix Sxy . In summary, from the cross-covariance data matrix Sxy = n1 XT Y where X and Y are the centred data matrices of the respective fields, the set of canonical covariances are provided by the singular values γ1 , . . . , γm , of Sxy arranged in decreasing order. The set of canonical covariance pattern pairs is provided by the associated left and right singular vectors U = (u1 , . . . , um ) and V = (v1 , . . . , vm ), respectively, where m is the rank of Sxy . Unlike CCA patterns, the canonical covariance pattern pairs are orthogonal to each other by construction. The CCOVA can be easily computed in Matlab. If X(n, p) and Y (n, q) are the two data matrices of both fields, the SVD of XT Y gives the left and right singular vectors along with the singular values, which can be arranged in decreasing order, as for EOFs (Chap. 3): >> [u s v] = svds (X’ * Y, 10, ‘L’); A typical example of regional climate and related teleconnection is the Mediterranean evaporation, which has been discussed, e.g. in Zveryaev and Hannachi (2012, 2016). For example, Zveryaev and Hannachi (2012) identified different teleconnection patterns associated with Mediterranean evaporation depending on the season. East Atlantic pattern (e.g. Woollings et al. 2010) was found in the winter season, whereas in the summer season a teleconnection of tropical origin was found. It has been suggested that the heating associated with the Asian summer monsoon (ASM) can initiate Rossby wave that can force climate variability over the east Mediterranean (Rodwell and Hoskins 1996). Correlation between all India rainfall, an index measuring the strength of the Indian summer monsoon, and summer MEVA reveals an east–west dipolar structure. A more general procedure is to apply SVD analysis between MEVA and outgoing long wave radiation (OLR) over the Asian summer monsoon region. Figure 15.1 shows the leading left and right singular vectors of the covariance matrix between September MEVA and August Asian summer monsoon OLR. The figure shows clear teleconnection between the Indian monsoon and MEVA with a lead–lag of 1 month. Stronger (than normal) Indian monsoon yields weaker (resp., stronger) than normal evaporation over the eastern (resp., western) Mediterranean. The time series associated with the left and right singular vectors (or left and right singular PCs) have maximum covariance. Figure 15.2 shows these time series, scaled to have unit standard deviation. The correlation between the time series is 0.64.

346

15 Methods for Coupled Patterns 9 7

Left singular vector (Sep Med. Evap.)

5 3 1 -1 -3 -5 -7 -9 -11

°

30 N °

0

-13 -15

Right singular vector (Aug. Asian Summer Monsoon OLR) 8

°

30 N

7 6 5 4 3 2 1 0 -1 -2 -3

°

0

-4 -5 -6 -7

°

60 E

Fig. 15.1 Left (top) and right (bottom) leading singular vectors of the covariance matrix between September Mediterranean evaporation and August Asian monsoon outgoing long wave radiation

Remark Both CCA and CCOVA can be used as predictive tools. In fact, if, for example, the right field yt lags the left field xt , t = 1, . . . , n, with a lag τ , then both these techniques can be used to predict yt from xt as in the case of principal predictive patterns (PPPs), see e.g. Dorn and von Storch (1999). In this case, the leading pair of canonical correlation patterns, for example, yield the corresponding time series that are most cross-correlated at lag τ , see also Barnston and Ropelewski (1992).

15.4 Redundancy Analysis

2

347

Scaled left/right singular PCs of Aug. OLR/Sep. MEVA singular vector

1

0

-1

-2

-3

1958

1969

1979

1989

1999

Fig. 15.2 Leading time series associated with the leading singular vectors of Fig. 15.1. Red: left singular vector (MEVA), blue: right singular vector (OLR)

15.4 Redundancy Analysis 15.4.1 Redundancy Index We have seen in Sect. 15.2.3 (see the remarks in that section) that CCA can be obtained from a regression perspective by minimising the error variance. This error variance, however, is not the only way to express the degree of dependence between two variables. Consider two multivariate time series xt and yt , t = 1, 2, . . . (with zero mean for simplicity). The regression of yt on xt is written as yt = xt + εt ,

(15.13)

where  =  yx  −1 xx is the regression matrix obtained by multiplying (15.13) by xt and then taking expectation. The error covariance matrix can also be obtained from (15.13) and is given by  εε =  yx  −1 xx  xy . The error variance is then given by tr ( εε ). An alternative way to measure the dependence or the redundancy of one variable relative to another is through the redundancy index (Stewart and Love 1968):



tr  yy −  εε tr  yx  −1 xx  xy



R (yt , xt ) = = . tr  yy tr  yy 2

(15.14)

348

15 Methods for Coupled Patterns

Hence the redundancy index represents the proportion of the variance in yt explained by xt . Remark • The redundancy index is invariant under nonsingular transformation of the independent variable xt and orthogonal transformation of the dependent variable yt . • The redundancy index represents the fraction of variance explained by the regression. In one dimension the redundancy index represents the R-square.

15.4.2 Redundancy Analysis Redundancy analysis was introduced first by van den Wollenberg (1977) and extended later by Johansson (1981) and put in a unified frame by Tyler (1982). It aims at finding pattern transformations (matrices) P and Q such that R 2 (Qyt , Pxt ) is maximised. To simplify we will reduce the search to one single

the calculation, Now from (15.14) this redundancy pattern p such that R 2 y t , pT xt is maximised.

−1 tr pT  xx p  yx ppT  xy

, which, after simplification using a index takes the form: tr ( yy ) little algebra, yields the redundancy analysis problem: 

   pT   p xy yx T max R yt , p xt = . p pT  xx p 2

(15.15)

The solution to Eq. (15.15) is obtained by solving the eigenvalue problem:  −1 xx  xy  yx p = λp.

(15.16)

A better way to solve (15.16) is to transform it to a symmetric eigenvalue problem by multiplying (15.16) through by  yx . In fact, letting q = η yx p, where η is a number to be found later, Eq. (15.16) becomes Aq =  yx  −1 xx  xy q = λq.

(15.17)

The qs are then the orthogonal eigenvectors of the symmetric positive semi-definite matrix A =  yx  −1 xx  xy . Furthermore,

from (15.16) the redundancy (actually minus redundancy) becomes R 2 yt , pT xt = λ. van den Wollenberg (1977) solved (15.16) and its equivalent, where the roles of x and y are exchanged, i.e. the eigenvalue problem of  −1 yy  yx  xy . As pointed out by Johansson (1981) and Tyler (1982), the transformations of xt and yt are not related. Johansson (1981) suggested using successive linear transformations of yt , i.e. bT yt , such that bT  yx b is maximised with bs being orthogonal. These vectors are in fact

15.5 Application: Optimal Lag Between Two Fields and Other Extensions

349

the q vectors defined above, which are to be unitary and orthogonal. Since one wishes the vectors qk = η yx pk to be unitary, i.e. η2 pTk  xy  yx pk = 1, we must have ηk2 = λ−1 k where the λk s are the eigenvalues of the matrix A (see Eq. (15.17)). Here we have taken pTk  xx pk = 1, which does not change the optimisation −1/2 problem (15.15). Therefore the q vectors are given by qk = λk  yx pk , and −1/2 similarly pk = λk  −1 xx  xy qk . Exercise Given that q = λ−1/2  yx p, show that p = λ−1/2  −1 xx  xy q. (Hint. Use Eq. (15.16).) Remark It can be seen from (15.16) that if xt = yt one gets the singular vectors of the (common) covariance matrix of xt and yt . Hence EOFs represent a particular case of redundancy analysis.

15.5 Application: Optimal Lag Between Two Fields and Other Extensions 15.5.1 Application of CCA 1. Given two zero-mean fields xt and yt , t = 1, . . . n, one wishes to find the optimal lag τ between the two fields along with the associated patterns. If one chooses the correlation between the two time series aT xt and bT yt+τ as a measure of association, then the problem becomes equivalent to finding patterns a and b satisfying 

2 

T a  xy (τ )b

, max φ (a, b) = T a,b a  xx a bT  yy b

(15.18)

where  xy (τ ) is the lagged covariance matrix, i.e.  xy (τ ) = E (xt yt+τ ). The problem becomes similar to (15.4) and (15.5) except that  xy is now replaced by  xy (τ ) and  yx by  yx (−τ ) =  Txy (τ ). By using the following SVD decomposition: −1

−1

 =  xx2  xy (τ ) yy2 = UVT , we get max [φ (a, b)] = λ21 , where λ1 is the largest singular value of . Exercise Derive the above result, i.e. max [φ (a, b)] = λ21 . In conclusion a simple way to find the best τ is to plot λ21 (τ ) versus τ and find the maximum, if it exists. This can be achieved by differentiating this univariate function and looking for its zeros. The associated patterns are then given by a =

350 −1/2

15 Methods for Coupled Patterns −1/2

 xx u1 and b =  yy v1 , where u1 and v1 are the leading left and right singular vectors of , respectively. 2. Note that one could also have taken  xy (τ ) +  yx (−τ ) instead of  xy (τ ) in (15.18) so the problem becomes symmetric. This extension simply means considering the case y leading x by (−τ ) beside x leading y by τ , which are the same.  3. Another extension is to look for patterns that maximise ρxy (τ )dτ . This is like in OPP (chap. 8), which looks for patterns maximising the persistence time for a single field, but applied to coupled patterns. In this case the numerator in (15.18)   2 M is replaced by aT for some lag M. The matrix involved τ =−M  xy (τ ) b here is also symmetric. Remark If in the previous extension one considers the case where yt = xt , then one obviously recovers the OPP.

15.5.2 Application of Redundancy The previous extension can also be accomplished using the redundancy index (15.14). For example, Eq. (15.16) applied to xt and yt+τ yields  −1 xx  xy (τ ) yx (−τ )u = λu.

(15.19)

If one takes yt = xt+τ as a particular case, the redundancy problem (15.19) yields the generalised eigenvalue problem:  xx (τ ) xx (−τ )u = λ xx u. Note also that the matrix involved in (15.8) and (15.19), e.g.  xy (τ ) yx (−τ ), is also symmetric positive semi-definite. Therefore, to find the best τ (see above), one can −1/2 plot λ2 (τ ) versus τ , where λ(τ ) is the leading singular value of  xx  xy (τ ), and choose the lag associated with the maximum (if there is one). Exercise Derive the leading solution of (15.19) and show that it corresponds to the −1/2 leading singular value of  xx  xy (τ ).

15.6 Principal Predictors As it can be seen from (15.5), the time series xt and yt play exactly symmetric roles, and so in CCA both the time series are treated equally. In redundancy analysis, however, the first (or left) field xt plays the role of the predictor variable,

15.6 Principal Predictors

351

whereas the second field represents the predictand or response variable. Like redundancy analysis, principal predictors (Thacker 1999) are based on finding a linear combination of the predictor variables that efficiently describe collectively the response variable. Unlike redundancy analysis, however, in principal predictors the newly derived variables are required to be uncorrelated. Principal predictors can be used to predict one field from another as presented by Thacker (1999). A principal predictor is therefore required to be maximally correlated with all the response variables. This can be achieved by maximising the sum of the squared

correlations with these response variables. Let yt = yt1 , . . . , ytp be the response field, where p is the dimension of the problem or the number of variables of the response field, and let a be a principal predictor. The squared correlation between xt = aT xt and the kth response variable ytk , t = 1, 2, . . . n, is  rk2

=

2

cov {ytk }t=1,...n , {aT xt }t=1,...n 2 aT S a σyk xx

,

(15.20)

  2 = S where σyk yy kk and represents the variance of the kth response variable ytk , t = 1, . . . n. The numerator of (15.20) is aT sk sTk a, where sk is the kth column of the cross-covariance matrix Sxy . Letting Dyy = Diag Syy , we then have p  1 s sT = Sxy D−1 yy Syx . 2 k k σ k=1 yk

Exercise Derive the above identity. (Hint. Take Dyy to be the identity matrix [for simplicity] and compute p the (i, j )th element of Sxy Syx and compare it to the corresponding element of k=1 sk sTk .) p The maximisation of k=1 rk2 yields  max aT a



Sxy D−1 yy Syx aT Sxx a

 a .

(15.21)

The principal predictors are therefore given by the solution to the generalised eigenvalue problem: Sxy D−1 yy Syx a = λSxx a.

(15.22)

If μk and uk represent, respectively, the kth eigenvalue and associated left singular −1/2 −1/2 vector of Sxx Sxy Dyy , then the kth eigenvalue λk of (15.22) is μ2k and the kth −1/2 principal predictor is ak = Sxx uk . Furthermore, since aTk Sxx al = uk ul = δkl , the new variables aTk xt are standardised and uncorrelated.

352

15 Methods for Coupled Patterns

Remark The eigenvalue problem associated with EOFs using the correlation matrix, e.g. Eqs. (2.21) or (3.25), can be written as D−1/2 Sxx D−1/2 u = λu, where D = Diag (Sxx ) can also be written, after the transformation u = D1/2 v, as D−1 Sxx v = λv. This latter eigenvalue problem is identical to (15.22) when xt = yt . Therefore principal predictors reduce, when the two fields are identical, to a simple linear (diagonal) transformation of the correlation-based EOFs.

15.7 Extension: Functional Smooth CCA 15.7.1 Introduction Conventional CCA and related methods are basically formulated to deal with multivariate observations of classical statistics where the data are sampled at discrete, often regular, time intervals. In a number of cases in various branches of science, however, the data can be observed/monitored continuously. In medicine, for example, we have EEG records, etc. In meteorology, barometric pressure records at a given location provide a good example. In the latter case, we can in fact have barometric pressure records at various locations. Similarly, we can have a continuous function of space, e.g. continuous surface temperature observed at different times. Most space–time atmospheric fields belong to this category when the coverage is dense enough. This also applies to slowly varying fields in space and time where, likewise, the obtained patterns are expected to be smooth, see e.g. Ramsay and Silverman (2006) for an introduction to the subject and for further examples. When time series are looked at from the angle of conceptual stochastic processes, then one could attempt to look for smooth time series. The point is that there is no universal definition of such a process, although, some definitions of smoothness have been proposed. For example, if the first difference of a random field is zero mean multivariate normal, then the field can be considered as smooth (Pawitan 2001, chap. 1). Leurgans et al. (1993) point out that in the context of continuous CCA, smoothing is particularly essential and that without smoothing every possible function can become a canonical variate with perfect canonical correlation.

15.7.2 Functional Non-smooth CCA and Indeterminacy In functional or continuous CCA one assumes that the discrete space–time fields xk and yk , k = 1, . . . n, are replaced by continuous curves xk (t) and yk (t), k = 1, . . . n, and t is now a continuous parameter, in some finite interval T. For simplicity we suppose that the curves have been centred to have zero mean, i.e. nk=1 xk (t) =

15.7 Extension: Functional Smooth CCA

353

n

5 k=1 yk (t) = 0, for all values of t within the above interval. Linear combinations of xk (t) using, for example, a curve or a continuous function a(t) take now the form

of an integral, i.e.  < a, xk >=

T

a(t)xk (t)dt.

In the following we suppose that x(t) and y(t) are two random functions and that xk (t) and yk (t), k = 1, . . . n, are two finite sample realisations (of length n) drawn from x(t) and y(t), respectively.6 The covariance between < a, x > and < b, y > is given by   E [< a, x >< b, y >] = =

T T   T T

a(t)E [x(t)y(s)] b(s)dtds a(t)Sxy (t, s)b(s)dtds,

(15.23)

where Sxy (t, s) = E [x(t)y(s)] represents the cross-covariance between x(t) and y(s). The sampleestimate  of this cross-covariance is defined in the usual way by n 1 ˆ Sˆxy (t, s) = n−1 k=1 T T a(t)xk (t)yk (s)b(s)dtds = t T a(t)Sxy (t, s)a(s)dtds. A similar expression can be obtained also for the remaining covariances, i.e. Sˆyx (t, s), Sˆxx (t, s) and Sˆyy (t, s). Remarks • The functions Sxx (t, s) and Syy (t, s) are symmetric functions and similarly for their sample estimates. • Sxy (t, s) = Syx (s, t) • Note that by comparison to the standard statistics of covariances the index k plays the role of “time” or sample (realisation) as pointed out earlier, whereas the variables t and s in the above integral mimic “space” or variable. In a similar manner to the standard discrete case, the objective of functional CCA is to find functions a(t) and b(t) such that the cross-correlation between linear combination < a, xk > and < b, yk > is maximised. The optimisation problem applied to the population yields   max a,b

T T

a(t)Sxy (t, s)b(s)dtds

(15.24)

is like removing the ensemble mean of each field from each curve. Note that  t  here plays the role of variables in the discrete case and the index k refers to observations or realisation. 6 In the standard notation of stochastic processes x(t) may be better noted as x(t, ω), where ω refers to the random part. That is, for fixed ω, i.e. ω = ω0 (a realisation), we get a usual function x(t, ω0 ) of t, and for fixed t, i.e. t = t0 , we get a random variable x(t0 , ω). 5 This

354

15 Methods for Coupled Patterns

subject to the condition:  

  T T

a(t)Sxx (t, s)a(s)dtds =

T T

b(t)Syy (t, s)b(s)dtds = 1.

(15.25)

The system of equations (15.24)–(15.25) is equivalent to maximising   T T

2 a(t)Sxy (t, s)b(s)dtds   T T

−1

  a(t)Sxx (t, s)a(s)dtds

T T

b(t)Syy (t, s)b(s)dtds

.

If one is interested in functional canonical (or maximum) covariance analysis, then we obtain the same equations except that the covariances of the individual variables are reduced to unity, i.e. Sxx = Syy = 1. As for conventional CCA, the optimisation problem given by Eqs. (15.24)– (15.25) yields the eigenvalue problem:   T Sxy (t, s)b(s)ds = μ  T Sxx (t, s)a(s)ds T Sxy (t, s)a(t)dt = μ T Syy (t, s)b(t)dt,

(15.26)

which can be written as a single generalised eigenvalue problem (as for the discrete case, Sect. 15.2); see next section for a proof outline of a similar result. When the above result is applied to the sample curves, it turns out that there are always functions a(t) and b(t) that guarantee perfect correlation between the corresponding linear combinations < a, xk > and < b, yk >. Furthermore, Leurgans et al. (1993) show that any linear combination of xk can be made perfectly correlated with the corresponding linear combination of yk , k = 1, . . . n. This points to a conceptual problem in estimating functional CCA from a finite sample using Eqs. (15.24)–(15.25). This problem is overcome by introducing some sort of smoothing in the estimation as is shown next.

15.7.3 Smooth CCA/MCA Canonical Correlation When one deals with continuous or functional data, smoothing becomes a useful tool to gain some insights into the data and can also ease the interpretation of the results. An example of a widely known nonlinear smooth surface fitting of a scatter of data points is spline. For a given scatter of data points, smoothing spline attempts to minimise a penalised residual sum of squares, using a smoothing parameter that controls the balance between goodness of fit and smoothness. In general terms, the

15.7 Extension: Functional Smooth CCA

355

smoothing takes the form of an integral of the squared second derivative of the smoothing function. This smoothing derives from the theory of elastic rods and is proportional to the energy of the rod when stressed, see Appendix A for more details. The smoothing procedure in CCA is similar to the idea of spline smoothing. To achieve smooth CCA, the constraints (15.25) are penalised by a smoothing condition taking the following form:

2    d2 a(t)S (t, s)a(s)dtds + α a(t) dt xx T T T dt 2

2    d2 = T T b(t)Syy (t, s)b(s)dtds + α T dt 2 b(t) dt = 1,

(15.27)

where α is a smoothing parameter and is also unknown, see also Ramsay and Silverman (2006). To solve the optimisation problem (15.24) subject to the smoothing constraints (15.27), a few assumptions on the regularity of the functions involved are  required. To ease things, one considers the notations < a, b >1 = T a(t)b(t)dt for the  natural scalar product between smooth functions a() and b(), and < a, b >S = T T a(t)S(t, s)b(s)dsdt as the “weighted” scalar product between a() and b(). The functions involved, as well as their mth derivatives, m = 1, . . . 4, are supposed to be square integrable over the interval T. It is also required that the functions and their four derivatives have periodic boundary conditions, i.e. if T=[τ0 , τ1 ], e.g. dαa dαa dt α (τ0 ) = dt α (τ1 ), α = 1, . . . 4. With these conditions, we have the following result. Theorem The solution to the problem (15.24), (15.27), i.e. max < a, b >Sxy s.t. < a, a >Sxx +α < D 2 a, D 2 a >1 = 1 =< b, b >Syy +α < D 2 b, D 2 b >1 , α

where D α a(t) = ddt αa (t) and similarly for D α b, is necessarily given by the solution to the following eigenvalue problem:    4 T Sxy (t, s)a(t)dt = μ T Syy (t, s)b(t)dt + αD 4b(s)  (15.28) T Sxy (t, s)b(s)ds = μ T Sxx (t, s)a(s)ds + αD a(t) . Proof Outline 1 We present here an outline of the proof using arguments from the calculus of variation (e.g. Cupta 2004). The general approach used in the calculus of variation is to assume the solution to be known and then work out a way to find the conditions that it satisfies. Let a(t) and b(t) be the solutions to (15.24) and (15.27), ˆ defined on T and satisfying the above properties then for any function a(t) ˆ and b(t) the function g (1 , 2 ) =< a + 1 a, ˆ b + 2 bˆ >Sxy is maximised when 1 = 2 = 0, subject, of course, to the constraint (15.27). In fact, letting

356

15 Methods for Coupled Patterns

G1 (a, a, ˆ , S) =< a +  a, ˆ a +  aˆ >S +α < D 2 a + D 2 a, ˆ D 2 a + D 2 aˆ >1 −1, the extended function to be maximised is given by 1 1 ˆ 2 , Syy ), G (1 , 2 ) = g (1 , 2 ) − μ1 G1 (a, a, ˆ 1 , Sxx ) − μ2 G1 (b, b, 2 2 where μ1 and μ2 are Lagrange multipliers. The necessary conditions of the maximum of G (1 , 2 ) obtained using the gradient ∇G at 1 = 2 = 0 yields

< a, ˆ b >Sxy −μ1 < a, aˆ >Sxx +α < D 2 a, D 2 aˆ >1  = 0 < a, bˆ >Sxy −μ2 < b, bˆ >Syy +α < D 2 b, D 2 bˆ >1 = 0, and this is true for all aˆ and bˆ satisfying the required properties. Now the periodic boundary conditions imply, using integration by parts: 

  D aD aˆ = 2

2

aD ˆ 4 a,

where the integration is over T. This result is a direct consequence of the fact that d2 the operator D 2 = dt 2 in the space of the functions satisfying the above properties is self-adjoint. Therefore the first of the two above equations leads to  

 Sxy (t, s) b(s)ds − μ1

 Sxx (t, s) a(s)ds + αD a(t) 4

a(t)dt ˆ =0

for all functions a() ˆ with the required properties, and similarly for the second equation. Hence the functions a(t) and b(t) are solution to the integro-differential equations:    4  Sxy (t, s) b(s)ds = μ1  Sxx (t, s) a(s)ds + αD4 a(t) Sxy (t, s) a(t)dt = μ2 Syy (t, s) b(t)dt + αD b(s) . Furthermore, after multiplying these two integro-differential equations by a(t) and b(s), respectively, and then integrating, using the periodic boundary conditions, one gets μ1 = μ2 . Proof Outline 2 Another shortcut to the proof can be used as follows. One considers the extended functional G() with 1 = 2 = 0, considered as a function of a() and b(). After differentiation7 with respect to a() and b(), the (necessary) optimality condition is

7 This

is a formal differentiation noted as δa and operates as in the usual case. Note that the differential δa is also a function of the same type.

15.7 Extension: Functional Smooth CCA

357

  2 2    δaS aS b − μ δa + α D aD + (δa) xy 1 xx  2 2     bSyy δb + α D bD (δb) = 0. aSxy δb − μ2 The first part of the equation yields in particular  

  δaSxy b − μ1



 aSxx δa + α

D aD (δa) = 0 2

2

and similarly for the second part. These equalities are satisfied for all perturbations δa and δb having the required properties. Expanding the integrals using the periodic boundary conditions yields (15.28).

Application Ramsay and Silverman (2006), for example, discuss an approximation based on a numerical integration scheme. In practice one does not need to solve the integrodifferential equations (15.28). The problem can be substantially simplified by expanding the functions as well as the functional and smooth canonical functions in terms of basis functions (Appendix . . . , φp (t))T and pA) φ(t) = (φ1 (t), T T ψ(t) = (ψ1 (t), . . . , ψq (t)) as a(t) = k=1 uk φk (t) = u φ(t) and b(s) = q T ψ(s). This method was particularly explored by Ramsay and v ψ (s) = v k=1 k k Silverman (2006). Basis functions include Fourier and radial basis functions (Appendix A). In the next section we consider the case of radial basis functions. Here I would like to go back to the original problem based on (15.24) and (15.27) in the above theorem. We also consider here the case of different spaces for x and y, hence φ and ψ. Let us define the following matrices: V = (vij ), A = (aij ), B = (bij ), C = (cij ) and D = (dij ) given, respectively, by vij =< φi , Sxy ψj >, aij =< φi , Sxx φj >, bij =< D 2 φi , D 2 φj >, cij =< ψi , Syy ψj > and dij =< D 2 ψi , D 2 ψj >, where the notation refers to the natural scalar product. Then the optimisation problem (15.24), (15.27) can be transformed (see exercises below) to yield the following generalised eigenvalue problem for the coefficients u = (u1 , . . . , up )T and v = (v1 , . . . , vq )T : 

VT O O V

     u O C + αD u =μ . v A + αB O v

There are various ways to determine the smoothing parameter α, and we discuss this in the next two sections. In the following exercises we attempt to derive the above system and propose a simple solution.

358

15 Methods for Coupled Patterns

Exercise 1 1. Using the above notation associated with the expansion in terms of basis functions, show that the maximisation problem (15.24) and (15.27) (see also the above theorem) boils down to maxu,v uT Vv s.t uT (A + αB) u = 1 = vT (C + αD) v. 2. Using Lagrangian multipliers μ1 and μ2 , transform the above system to an unconstrained maximisation problem. To simplify notation, take P = A + αB and Q = C + αD. 3. Show that μ1 = μ2 . 4. Derive the generalised eigenvalue problem satisfied by the vector a = (uT , vT )T . (Hint. 3. Use the fact that uT Qu = vT Pv = 1 in association with the gradient, with respect to u and also v, of the function to be maximised.) Exercise 2 Since P and Q are symmetric and also semi-definite positive, they have 1

1T

square roots, e.g. P = P 2 P 2 . 1. Show that the above maximisation problem can be transformed to max α T Eβ s.t. α T α = 1 = β T β. 2. Give the expression of E. 3. Find the solution α and β to the above problem. 4. Derive the expression of u and v as a function of α and β. 1T

1

(Hint. 2. E = P− 2 VQ− 2 ; 3. Use the SVD theorem.)

Maximum Covariance Smooth maximum covariance analysis (SMCA) is similar to SCCA except that the constraint conditions (15.27) are reduced to      2  2 2 2 2 D a = b +α D 2 b = 1. (15.29) a +α The optimisation problem (15.24) subject to (15.29) yields the following system of integro-differential equations: 

4  Sxy (t, s) b(s)ds = μ 1 + αD4 a(t) Sxy (t, s) a(t)dt = μ 1 + αD b(s).

(15.30)

15.7 Extension: Functional Smooth CCA

359

Exercise Using (15.24) and (15.29), derive (15.30). (Hint. Use the remarks above.) In those equations the smoothing parameter is also unknown, but in practice various values can be tested to have an idea of the range of appropriate values. The eigenvalue α and the eigenfunctions a() and b() are strictly unknown. Note also that when α = 0, Eqs. (15.28) and (15.30) yield examples of (non-smooth) functional CCA and functional MCA. In particular, the functional MCA (15.30) when α = 0 is given by  K (s, t) u(t)dt = μu(s),

(15.31)

where  K (s, t) =

0 Sxy (s, t) 0 Sxy (t, s)



and u(s) = (a(s), b(s))T . Equation (15.31) is a vector version of Fredholm homogeneous integral equation of second kind8 (see e.g. Roach 1970). Note that when the continuous variables are equal, i.e. x = y, then Eq. (15.30) yields smooth/functional PCA, see next section. Contrary to conventional CCA/MCA where the obtained solution is a finite set of vectors, here the solution is a set of infinite, in general denumerable, functions. In application, however, the integro-differential equations (15.28) or (15.30) will have to be discretised, yielding hence a finite set of equations. In a similar manner to the case of canonical correlation, the previous integrodifferential equations can be simplified by using similar basis functions. The obtained generalised eigenvalue problem is similar to the generalised eigenvalue problem shown above (see also Exercise 1 above) except that here A =< φ, φ T > and C =< ψ, ψ T >. Note that in general terms, the basis functions can be chosen to be identical, i.e. φ = ψ, here and in canonical correlations. Exercise Consider the above Exercise 2 again. Derive the maximisation problem for this case and give the expression of E.

15.7.4 Application of SMCA to Space–Time Fields We now suppose that we have two continuous space–time fields F (x, tk ) and G (y, tk ), observed at times tk , k = 1, . . . n, where x and y represent spatial locations. We also suppose that x and y vary in two spatial domains Dx and Dy ,

8 If

μ is fixed to 1, (15.31) becomes of first kind.

360

15 Methods for Coupled Patterns

respectively. The covariance function between fields F and G (with zero mean) at x and y is given by 1 F (x, tk ) G (y, tk ) . n n

S (x, y) =

(15.32)

k=1

The objective is to find (continuous) spatial patterns (functions) u (x) and v (y) maximising the integrated covariance:  Dx

 Dy

u(x)S (x, y) v(y)dxdy

subject to the smoothing constraint  Dx

   2 u2 (x) + ∇ 2 u(x) dx = 1 =

 Dy

 2 v 2 (y) + ∇ 2 v(y) dy,

where ∇ 2 u = u is the Laplacian of u(). In two dimensions, for example, where   ∂2 ∂2 2 u(x1 , x2 ). A x = (x1 , x2 ), the Laplacian of u(x) is ∇ u(x1 , x2 ) = 2 + 2 ∂x1

∂x2

similar constraint for v(y) is also required. The situation is exactly similar to (15.24) and (15.29) except that now one is dealing with spatial fields, and one gets the following integro-differential system: 

S (x, y) u(x)dx = μ 1 + α∇ 4 v(y) Dx

4 Dy S (x, y) v(y)dy = μ 1 + α∇ u(x).

(15.33)

To solve (15.33), one can apply, for example, the method of expansion in terms of radial basis functions (RBFs), see Appendix A. For global fields over the spherical earth, one can use spherical RBFs, and one ends up with a generalised eigenvalue problem similar to that presented in the application in Sect. 15.7.3. This will be discussed further in more detail in the next chapter in connection with smooth EOFs. An alternative (easy) method is to discretise the left hand side of the system. In practice, the fields are provided by their respective (centred) data matrices X =

(xtk ), t = 1, . . . n, k = 1, . . . p, and Y = ytj , t = 1, . . . n, j = 1, . . . q. The cross-covariance matrix is then approximated by Sxy =

1 T X Y, n

where the fields to have The objective is then to find

are supposed

been centred. patterns a = a1 , . . . , ap and b = b1 , . . . , bq satisfying

15.7 Extension: Functional Smooth CCA

361



STxy a = μ Iq + αD4 b

Sxy b = μ Ip + αD4 a,

(15.34)

which can be written in the form of a single generalised eigenvalue problem. This the “easy” solution and is also discussed further in the next chapter. Note that if one is interested in smooth CCA, the identity matrices Ip and Iq above are to be replaced by Sxx and Syy , respectively. The Laplacian operator ∇ 2 in (15.33) is approximated using a finite difference scheme. In one dimension, for example, if the real function a(x) is observed at discrete positions x1 , . . . xm , then d2 1 1 a(xk ) ≈ (ak+1 − 2ak + ak−1 ) , [a(xk+1 ) − 2a(xk ) + a(xk−1 )] = dt 2 (δx)2 (δx)2 where δx = xk+1 − xk is the length of the discretised mesh. Hence D2 will be a triangular matrix. With appropriate boundary conditions, taking for simplicity a(x0 ) = a(xp+1 ) = 0, one gets ⎛

⎞ ... 0 ... 0 ⎟ ⎟ ⎟ .. . . 0 ⎟ ⎟ . . . 1 −2 1 ⎠ 0 . . . 0 1 −2

−2 ⎜ 1 1 ⎜ ⎜ .. D2 = ⎜ (δx)2 ⎜ . ⎝ 0

1 −2 .. .

0 1 .. .

Once a and b are found, the corresponding smooth functions can be obtained using, for example, radial basis functions as a(x) =

p

k=1 ak φ (|x

− xk |) and b(y) =

q

l=1 bk

(φ|y − yl |) ,

(15.35)

where φ() is a radial basis function. One could also use other smoothing procedures such as splines or kernel smoothers. In two dimensions, the discretised Laplacian in the plane is approximated by ∇ 2 u(k, l)=

u(k − 1, l)−2u(k, l)+u(k + 1, l) u(k, l−1)−2u(k, l)+u(k, l + 1) + . (δx)2 (δy)2

In spherical geometry where λ and ϕ are the longitudinal and latitudinal coordinates, i.e. x = r cos ϕ cos λ, y = r cos ϕ sin λ and z = r sin ϕ, the Laplacian takes the form: ∇ 2u =

∂ 2 u tanϕ ∂u 1 ∂ 2u 1 , + − 2 r 2 ∂ϕ 2 r 2 cos2 ϕ ∂λ2 r ∂ϕ

which can be easily discretised. The discretised Laplacian in spherical geometry is given in Chap. 14. Note that in (15.34) the discretised Laplacian depends on the

362

15 Methods for Coupled Patterns

variables. For example, xt and yt may be observed at different grid points and from different regions, e.g. hemispheric vs. regional, as in the example when hemispheric SLP and tropical SST fields are used. Consequently, in (15.34) the Laplacian with respect to x and y may be denoted, for example, by D1 and D2 , respectively. The choice of the smoothing parameter α is slightly more complicated. The experimentalist can always choose α based on previous experience. Another more efficient approach is based on cross-validation, see e.g. chap. 7 of Ramsay and Silverman (2006). The cross-validation procedure, introduced originally in statistical estimation problems, can be extended in the same way to deal with SMCA. Letting T 

zTt = xTt , yTt , we decompose the field zt using the eigenvectors uj = aTj , bTj obtained from the generalised eigenvalue problem (15.34) and then compute the  (k) “residuals”9 ε t = zt − m j =1 βj uj . If uj , j = 1, . . . , m, are the eigenvectors obtained after removing the kth observation, and εt,k , are the resulting residuals, then the cross-validation is computed as Cv (α) =

m  k=1

 tr



1  ε t,k (α)ε Tt,k (α) n−1 n



t=1

=

m  k=1



 n 1  T ε t,k ε t,k , n−1 t=1

which can be minimised to yield the cross-validation estimate of the smoothing parameter α. As for the case of EOFs, one can use instead the eigenvalues to find (k) α. If μj , j = 1, . . . , m, are the m leading eigenvalues of (15.34) when the kth  μ(k) , then the optimal observation is removed from the analysis, and σ (k) = m n j =1(k) j value of α corresponds to the value that maximises k=1 σ . In these methods, the smoothing parameter is either known a priori or estimated separately. Another interesting approach used in SMCA was presented by Salim et al. (2005). It attempts to find the maximum covariance patterns and the smoothing parameter simultaneously using penalised likelihood approach. Their smoothing definition for spatial fields a and b is based on the statistical assumption of multivariate normal IID (Pawitan 2001, chap. 18). Assuming X and Y to be realisation of p-dimensional random variable Salim et al. (2005), consider the regression model X = vaT + ε X Y = ubT + ε Y ,

(15.36)





with εX ∼ N 0, σ 2 Ip and εY ∼ N 0, ζ 2 Iq , and attempt to maximise a profile likelihood. The procedure is iterative and attempts and aims at maximising the normalised covariance between Xa and Yb, see Salim et al. (2005) for the algorithm.

9 The

residuals here do not have the same meaning as those used to construct EOFs via minimisation. Here instead, these residuals are used as an approximation to compute the “misfit”.

15.8 Some Points on Coupled Patterns and Multivariate Regression

363

Smooth MCA can be used to derive a new set of EOFs, smooth EOFs. These represent, in fact, a particular case of smooth MCA where the spatial fields are identical. This is discussed further in the next chapter.

15.8 Some Points on Coupled Patterns and Multivariate Regression In multivariate regression one attempts to explain the variability of the predictand in terms of a linear model of the predictor variables: y = Ax + ε  .

(15.37)

The regression matrix A is given by A =< yxT >< xxT >−1 ,

(15.38)

where the bracket stands for the expectation operator, i.e. < . >= E(.). The least squares fitted value of the predictand is given by yˆ = Ax =< yxT >< xxT >−1 x.

(15.39)

Now, given a p×n data matrix X of predictors and a q ×n matrix Y of predictand or dependent variables, then the least squares regression matrix is given by  −1 . A = YXT XXT

(15.40)

√ Note that in (15.40) X and Y are assumed to be scaled by n − 1. In many instances one usually transforms one or both data sets using a linear transformation for various reasons purposes, such as reducing the number of variables etc., and one would like to identify the regression matrix of the transformed variables. If, in model (15.37), x and y are replaced, respectively, by x = Lx and y = My, where L and M are two matrices, then the model becomes y = A x + ε  .

(15.41)

Using the data matrices X = LX and Y = MY, the new regression matrix is A = Y X

T

  

−1  T −1 X X = M YXT LT LXXT LT .

(15.42)

364

15 Methods for Coupled Patterns

In the particular case where L is invertible, one gets A = MAL−1 .

(15.43)

Remarks When L is invertible, one gets a number of useful properties: • The predicted value of the transformed variables is directly linked to the predicted value of the transformed variables as yˆ = A x = Mˆy.

(15.44)

• We also get, via Eqs. (15.37) and (15.41), a relationship between the sum of squared errors: T ε ε  = (y − yˆ )T (y − yˆ ) = (y − yˆ )T MMT (y − yˆ ).

(15.45)

The last equality in (15.45) means, in particular, that both the squared error (y − yˆ )T (y − yˆ ) and the positive semi-definite quadratic functions of the error (y − yˆ )T MMT (y − yˆ ) are minimised. From the above remarks, it can be seen, by choosing particular forms for the matrix M, such as rows of the identity matrix, that the error covariance of each variable separately is also minimised. Consequently the full regression model also embeds the individual models for the predictand variables (Tippett et al. 2008). By using the SVD of the regression matrix A, the regression coefficients can be interpreted in terms of correlation, explained variance, standardised explained variance and covariance (Tippett et al. 2008). For example, by using a whitening transformation, i.e. using the scaled PCs of both variables, then (as in the unidimensional case) the regression matrix A simply becomes the correlation matrix between the (scaled) predictand and predictors. We now consider the SVD of the matrix A , i.e. A = USVT , then UT Y and VT X decompose the (pre-whitened) data into time series that are maximally correlated, and uncorrelated with subsequent ones. In terms of the original variables the weight vectors satisfy, respectively, QTx X = VT X and QTy Y = VT Y and are given by −1/2 −1/2



Qx = XXT V and Qy = YYT U.

(15.46)

Associated with these weight vectors are the pattern vectors 1/2 1/2



V and Py = YYT U. Px = XXT

(15.47)

15.8 Some Points on Coupled Patterns and Multivariate Regression

365

Remark The pattern vectors are similar to EOFs (associated with PCs) and satisfy Px QTx X = X, and similarly for Y. These equations are solved in a least square sense, see e.g. Tippett et al. (2008). Note that the above condition leads to PTx Qx = I, and similarly for the predictand variables. Hence the data are decomposed into patterns with maximally correlated time series and uncorrelated with subsequent predictor and predictand ones. The regression matrix is also decomposed as A = Py SQTx , and hence CCA diagonalises the regression (Tippett et al. 2008). As for the univariate case, when only the predictor variables are pre-whitened the regression matrix10 represents the explained variance of the corresponding (individual) regression. This is what RDA attempts to achieve. If in addition the predictand variables are scaled by the corresponding standard deviation, then the regression matrix is precisely what PPA is about. When the predictor variables are scaled by the covariance matrix, i.e. X =

T −1 X, then the regression matrix becomes XX A = YXT , which represents the covariances between predictand and predictors, hence MCA. Tippett et al. (2008) applied different transformation (or filtering) to a statistical Fig. 15.3 CCA, RDA and MCA within the α − β plane of scaled SVD

Scaled SVD and coupled patterns

MCA

RDA

β

1

0.5 CCA 0 0

10 In

fact, the absolute value of the elements of the matrix.

0.5

α

1

366

15 Methods for Coupled Patterns

downscaling problem of precipitation over Brazil using a GCM. They found that CCA provided the best overall results based on correlation as a measure of skill. Remark MCA, CCA and RDA can be brought together in a unified-like approach through the scaled SVD (Swenson 2015). If X = Ux Sx VTx is a SVD of the data matrix X, then the data are scaled as   T α T X∗ = Ux Sα−1 x Vx X = Ux Sx Vx and similarly for Y. Scaling SVD is then obtained by applying the SVD to the cross-covariance matrix X∗ Y∗T . It can be seen that CCA, MCA and RDA can be recovered from scaled SVD by using, respectively, α = β = 1, α = β = 0 and α = 0, β = 1 (Fig. 15.3). Swenson (2015) points out that other intermediate values of 0 ≤ α, β ≤ 1 can isolate coupled signals better. This is discussed in more detail in the next chapter.

Chapter 16

Further Topics

Abstract This chapter describes a number of further methods that have been developed and applied to weather and climate. They include random projection, which deals with very large data size; trend EOFs, which finds trend patterns in gridded data; common EOFs, which identifies common patterns between several fields; and archetypal analysis, which finds extremes in gridded data. The chapter also discusses other methods that deal with nonlinearity. Keywords Random projection · Cyclo-stationary EOFs · Trend EOFs · NAO · Siberian high · Common EOF analysis · Aleutian low · Continuum power CCA · Kernel MCA · Kernel CCA · Archetypal analysis · Riemannian manifold · Simplex visualisation · El-Nino · La-Nina · Western boundary currents · Principal nonlinear dynamical modes · Nonlinear PCs

16.1 Introduction The research in multivariate data analysis has led to further development in various topics in EOF analysis. Examples of such development include EOFs of large datasets or data containing quasi-stationary signals. Also, sometimes we seek to identify trends from gridded climate data without resorting to simple linear regression. Another example includes the case when we seek to compute, for instance, common EOFs from different groups of (similar) datasets. Computer power has witnessed lately an unprecedented explosion, which impacted on different branches of science. In atmospheric science climate modelling has increased in complexity, which led to the possibility of running climate models at a high resolution. Datasets with large spatial and/or temporal resolution are being produced currently by various weather and climate centres across the world. This has led to the need for ways to analyse these data efficiently. Projection methods can be used to address this problem. When the data contain quasi-stationary signals then results from the theory of cyclo-stationary processes can be applied yielding cyclo-stationary EOFs. Trend EOF analysis is another method that can be used to identify trend patterns from spatio-temporal climate data. Also, when we have © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_16

367

368

16 Further Topics

several datasets, as is frequently encountered in climate simulations from CMIP (Climate Modeling Intercomparison Project), then common EOF analysis can be used to efficiently compare these data. This chapter discusses these and other new methods not so much applied in weather and climate, including smooth EOFs, kernel CCA and archetypal analysis.

16.2 EOFs and Random Projection EOF analysis has proved to be an easy and cheap way to reduce the dimension of climate data retaining only a small set of the leading modes of variability usually explaining a substantial amount of variance. This is particularly the case when the size of the data matrix X is not too high, e.g. O(103 ). Advances in high performance computers has led to the recent explosion in the volume of data from climate model simulations, which beg for analysis tools. In particular, dimensionality reduction is required in order to handle and analyse these climate simulations. There are various ways to reduce the dimension of a data matrix. Perhaps the most straightforward method is that based on “random projection” (RP). In simple terms, RP is based on some sort of “sampling”. Precisely, given a n × p data matrix X, RP is based on constructing a p × k data matrix R (k < p) referred to as random projection matrix then projecting X onto R, i.e. P = XR.

(16.1)

By choosing k much smaller than p, the new n × k data matrix P becomes much smaller than X where EOF analysis, or any other type of pattern identification method, can be applied much more efficiently. Note that the “rotation matrix” is approximately orthogonal because the vectors are drawn randomly. It can, however, be made exactly orthogonal but this will be at the expense of saving memory and CPU time. Random projection takes its origin from the so-called Johnson and Lindenstrauss (1984) lemma (see also Dasgupta and Gupta 2003): Johnson-Lindenstrauss Lemma Given a n × p data matrix X = (x1 , x2 , . . . xn )T , for any ε > 0 and integer n k > O( log ), there exists a mapping f : Rp → Rk , such that for any 1 ≤ i, j ≤ n, ε2 we have: (1 − ε) xi − xj 2 ≤ f (xi ) − f (xj ) 2 ≤ (1 + ε) xi − xj 2 .

(16.2)

The message from the above lemma is that it is always possible to embed the data into a lower dimensional space such that the interpoint distance is conserved up to any desired accuracy. One way to construct such a mapping is to generate random vectors that make up the rows of the projection matrix R. Seitola et al. (2014) used

16.2 EOFs and Random Projection

369

the standard normal distribution N (0, 1) and normalised the random row vectors of R to have unit-length. Other distributions have also been used, see e.g. Achlioptas (2003), and Frankl and Maehara (1988). Refinement of the lower limit value of the dimension k, provided in the lemma, was also provided by a number of authors. For example, the values k = 1 + 9(ε2 − 2ε3 /3)−1 , and k = 4(ε2 /2 − ε3 /3)−1 log n were provided, respectively, by Frankl and Maehara (1988) and Dasgupta and Gupta (2003). Seitola et al. (2014) applied EOF analysis to the randomly projected data. They reduced the data volume down to 10% and 1% of the original volume, and recovered the spatial structures of the modes of variability and their associated PCs. Let us assume that the SVD of X be X = USVT , where U and V are the PCs and EOFs of the full data matrix X, respectively. When the spatial dimension p is to be reduced, one first obtains an approximation of the PCs U of X by taking the PCs of the reduced/projected matrix P, i.e. U = Upr where P = Upr Spr VTpr . The EOFs of X are then approximated by projecting the PCs of the projected matrix P onto the data matrix X, i.e. V ≈ XT Upr D−1 pr .

(16.3)

When the time dimension is to be reduced the same procedure can be applied to XT using P = XT R where R is now n × k. Remark Note that random projection can be applied only to reduce one dimension but not both. This, however, is not a major obstacle since it is sufficient to reduce only one dimension. Seitola et al. (2014) applied the procedure to monthly surface temperature from a millennial Earth System Model simulation (Jungclaus 2008) using two cases with n × p = 4608 × 4607 and n × p = 4608 × 78336. They compared the results obtained using 10% and 1% reductions. An example of plot of EOF patterns of the original and reduced data is shown in Fig. 16.1. Figure 16.2 shows the (spatial) correlation between the EOFs of the original data and those approximated from the 10% (top) and 1% (bottom) reduction of the spatial dimension. Clearly, the leading EOFs are well reproduced even with a 1% reduction. Figure 16.3 compares the spectra of the PCs from the original data and those from 10% and 1% reduction, respectively. The main peaks associated with periods 1, 1/2, 1/3 and 1/4 yr are very well reproduced in the PCs of the projected data. They also obtained similar results with the reduction of the 4608 × 78336 data matrix. When the time dimension is reduced, Seitola et al. (2014) find similar results to those obtained by reducing the spatial dimension. The random projection was also extended to delay coordinates yielding randomised multichannel SSA by Seitola et al. (2015). They applied it to the twentieth century reanalysis data and two climate model (HadGEM2-ES and MPI-ESM-MR) simulations from the CMIP5 data archive. They found, in particular, that the 2–6 year timescale variability in the centre Pacific was well captured by HadGEM2.

370

16 Further Topics RP10%

RP1%

0

0.00

−50

0.05

−0.05

0.10 0.05 0.00

0

PC10

50

PC9

50

Original

−50

−0.05 −0.10

0.05

0

0.00 −0.05

−50

PC11

50

0.10

−0.10

0.05

0

0.00 −0.05

−50

PC12

50

0.10

−0.10 −150

−100

−50

0

50

100

150

−150

−100

−50

0

50

100

150

−150

−100

−50

0

50

100

150

Fig. 16.1 Ninth to the twelfth EOF patterns obtained from the original model simulation (left), randomly projected data with 10% (middle) and 1% (left) reduction. Adapted from Seitola et al. (2014)

16.3 Cyclo-stationary EOFs 16.3.1 Background Conventional EOF analysis decomposes a space-time field into a finite sum of “stationary” patterns modulated by time series amplitudes. The stationarity of EOFs simply refers to the fact that these patterns do not change with time, but only their amplitude and sign do. It is argued, however, that weather and climate variables are not stationary but are characterised by specific temporal scales/cycles, such as the annual cycles, making them rather cyclo-stationary. The conventional way to deal with this is to either subtract a smoothed or modulated version of the cycle or to focus on a specific season, such as the winter season. The argument, however, is that the various cycles present in weather and climate variables are not entirely additive but do interact with the variability of other time scales, i.e. non-cyclic variations. This makes the variance as well as the spatial and temporal autocorrelation characteristics season-dependent. This suggests a wholistic analysis of weather and climate variables instead of, e.g. subtracting the various cycles, as it was discussed in cyclo-stationary POP (CSPOP) analysis, see Chap. 6. One way to achieve this wholistic analysis is through the application of cyclo-stationary EOF (CSEOF) analysis as it was suggested by, e.g. Kim et al. (1996) and Kim and North (1997).

16.3 Cyclo-stationary EOFs

371

20

19

18

17

16

15

14

13

12

11

10

9

8

6

7

4

5

3

2

1

ORIGINAL 1

1 2

0.9

3 4

0.8

5 6

0.7

7 8

0.6

9

RP10%

10 0.5

11 12

0.4

13 14

0.3

15 16

0.2

17 18

0.1

19 20

0

1

20

19

18

17

16

15

14

13

12

11

10

9

8

6

7

5

4

3

2

1

ORIGINAL 1

2 3

0.9

4 5

0.8

6 7

0.7

8 9

RP1%

0.6

10 11

0.5

12 13

0.4

14 15

0.3

16 17

0.2

18 19

0.1

20 0

Fig. 16.2 Spatial correlations between the original EOFs and RP with 10% (top) and 1% (bottom) reduction. Adapted from Seitola et al. (2014)

372

16 Further Topics c

b

12

11

10

9

8

7

6

5

4

3

2

1

a

10

3

2

1

1/2

1/3

1/4

10

3

2

1

1/2

1/3

1/4

Period in years

Period in years

10

3

2

1

1/2

1/3

1/4

Period in years

Fig. 16.3 Spectra of the 10 leading original (a), and RP with 10% (b) and 1% (c) reduction. Adapted from Seitola et al. (2014)

16.3.2 Theory of Cyclo-stationary EOFs In its simplest form cyclo-stationary EOF analysis bears some similarities with CSPOP analysis in the sense that the variables have two distinct time scales representing, respectively, the cycle and the nested time within the cycle, as in the theory of cyclo-stationary processes encountered in signal processing (Gardner and Franks 1975; Gardner 1994; Gardner et al. 2006). Perhaps the main difference between conventional EOFs and CSEOFs is that the loading patterns (or EOFs) in the latter method depend on time and are periodic. If T is the nested period, then the field X(x, t) is decomposed as: X(x, t) =



Ek (x, t)vk (t),

(16.4)

k

with the property Ek (x, t) = Ek (x, t + T ).

(16.5)

The cyclo-stationary loading vectors, i.e. the CSEOFs, are obtained as the solution of a Karhunen–Loéve equation (Loève 1978):  

K(x, t; x , t  )E(x , t  )dt  dx = λE(x, t),

(16.6)

16.3 Cyclo-stationary EOFs

373

where K(.) is the covariance function, i.e. K(x, t; x , t  ) = cov(X(x, t), X(x , t  )).

(16.7)

Cyclo-stationarity implies that the covariance function is doubly periodic, i.e. K(x, t + T ; x , t  + T ) = K(x, t; x , t  ).

(16.8)

Theoretically speaking the periodicity in t × t  , Eq. (16.8), implies that K(.) can be expanded using double Fourier series in t × t  , and the CSEOFs can then be computed in spectral space, which are then back-transformed into physical space. This is feasible in simple examples such as the unidimensional case of Kim et al. (1996) who constructed CSEOFs using Bloch’s wave functions encountered in solid state physics. The CSEOFs are defined by ψnm (t) = e2π int/N Unm (t),

(16.9)

where Unm (.) is periodic with period T , and can be expanded as: Unm (t) =



unmk e2π ikt/T .

(16.10)

k

The coefficients unmk are obtained by solving an eigenvalue problem involving the cyclic spectrum. The application to weather and climate fields is, however, very expensive in terms of memory and CPU time. An approximate solution was suggested by Kim and North (1997), based on the assumption of independence of the PCs. The method is based on the Fourier expansion of the field X(x, t): X(x, t) =

T −1

ak (x, t)e2π ikt/T .

(16.11)

k=0

The CSEOFs are then obtained as the eigenvectors of the covariance matrix of the extended data (χ (x, t)), t = 1, . . . , N, where: χ (x, t) = (a0 (x, t), a1 (x, t), . . . , aT −1 (x, t)) .

(16.12)

16.3.3 Application of CSEOFs CSEOFs have been applied to various atmospheric fields. Hamlington et al. (2014), for example, argue that CSEOFs are able to minimise mode mixing, a common problem in conventional EOFs. Hamlington et al. (2011, 2014) applied CSEOFs

374

16 Further Topics

to reconstructed sea level. They suggest, using a nested period of 1 year, that CSEOF analysis is able to extract the modulated annual cycle and the ENSO signals from the Archiving, Validation and Interpretation of Satellite Oceanographic (AVISO) altimetry data. Like many other methods, CSEOF analysis has been used in forecasting (Kim and North 1999; Lim and Kim 2006) and downscaling (Lim et al. 2010). Kim and Wu (1999) conducted a comparative study between CSEOF analysis and other methods based on EOFs and related techniques including extended EOFs, POPs and cyclo-stationary POPs. Their study suggests that CSEOFs is quite akin to extended EOFs where the lag is not unity, as in extended EOFs, but is equal to the nested period T . Precisely, the extended data take the form: χ (x, t) = (X(x, 1 + tT ), X(x, 2 + tT ), . . . , X(x, T + tT )) , and the associated covariance matrix takes the form: ⎛ ⎞ C11 . . . C1T ⎜ ⎟ C = ⎝ ... . . . ... ⎠ ,

(16.13)

(16.14)

CT 1 . . . CT T where Ckl , k, l = 1, . . . , T , is the spatial covariance matrix between X(x, k + tT ) and X(x , l + tT ), i.e. Ckl = cov (X(x, k + tT ), X(x, l + tT )) .

(16.15)

16.4 Trend EOFs 16.4.1 Motivation The original context of EOFs (Obukhov 1947; Fukuoka 1951; Lorenz 1956) was to achieve a decomposition of a continuous space-time field X(t, s), such as sea level pressure, where t and s denote, respectively, time and spatial location, as X(t, s) =

M 

ck (t)ak (s)

(16.16)

k=1

using an optimal set of basis functions of space ak () and expansion functions of time ck (). As it was discussed in Chap. 3 the EOF method has some useful properties, e.g. orthogonality in space and time. These properties yield, however, a number of difficulties, as it is discussed in Chap. 3, such as:

16.4 Trend EOFs

375

• Physical interpretability—this is caused mainly by the predictable geometric constraints and the mixing property of EOFs. • Truncation order—which comes about when we seek to reduce the dimensionality of the data. Ideally, the best truncation order is provided by the “elbow” of the spectrum of the covariance matrix. This, however, does not occur in practice as the spectrum most often looks rather “continuous”. • Second-order statistics and ellipticity—it is straightforward to see that the restriction to using the second-order statistics underlies indirect assumptions regarding the distribution of the data, namely the Gaussianity of the data1 since the probability distribution function (pdf) of a multivariate Gaussian random variable is completely specified by its mean and its covariance matrix. In fact, if the data come from such a distribution, then the EOFs are simply the directions of the principal axes of the distribution (Chatfield and Collins 1980; Jolliffe 2002). This interpretation extends to a large class of probability distributions, namely elliptically contoured distributions (Fang and Zhang 1990). A pdf is elliptical if there exist a linear transformation that transforms the distribution to be spherical, i.e. a function of the radial coordinate only. In brief, an elliptical pdf is one whose contour body, i.e. points having the same value of the probability density function, is an ellipsoid. Elliptically contoured distributions behave somewhat like Gaussian distributions, but allow for fat tails; the t-distribution is an example. So EOFs can be easily interpreted when the data are elliptically distributed. There are other remaining problems with the EOF method related, for example, to linearity, since the covariance or correlation are “linear” measures of association and nonlinearity is not taken into consideration. Consequently, information from high-order moments is excluded from the analysis. Furthermore, EOFs based on the covariance matrix are in general different from those obtained from the correlation matrix hence the choice between the two matrices remains arbitrary (Wilks 2011; Jolliffe 2002). All the above difficulties are detailed in Hannachi (2007). Ideally, of course, one would like to find a method that can address all those issues. Note also that as the measure of association based on conventional covariance or correlation is characterised by a non-invariance to monotonic transformation, it is desirable to have a measure of association that is invariant under such a transformation. Such a method may not exist in reality, and if it does it may not be useful in data reduction. Now, we know that trends exist in the atmosphere and can characterise weather and climate variables. We also know that, in general, EOFs do not capture trends as they are not conceived for that purpose,2 in addition to the mixing problem characterising EOFs. It turns out, in fact, that the method that addresses the difficulties listed above identifies trends, and was dubbed trend EOF (TEOF) method by Hannachi (2007). 1 In practice, of course, EOFs of any data, not necessarily Gaussian, can be computed. But the point

is that using only the covariance matrix is consistent with normality. are of course exceptions such as when the trend has the largest explained variance.

2 There

376

16 Further Topics

16.4.2 Trend EOFs The trend EOF method was introduced as a way to find trend patterns from gridded data through overcoming the drawbacks of EOFs by addressing somehow the difficulties listed above, see Hannachi (2007) for details. In essence, the method uses some concepts from rank correlation. it is based on the rank correlation between the time position of the sorted data. Precisely, Let x1 , x2 , . . . xp , where xk = (x1k , . . . xnk ), k = 1, . . . p, be the p variables, i.e. the p time series, or the rows forming our spatio-temporal field or data matrix X = [x 1 , . . . x n ]T = (xij ), i = 1, . . . n, j = 1, . . . p. For each k, 1 ≤ k ≤ p, we also designate by p1k , p2k , . . . , pnk the ranks of the corresponding kth time series x1k , . . . xnk . The matrix of rank correlations is obtained by constructing first the new variables: yk = gk (xk ) = (p1k , p2k , . . . , pnk )

(16.17)

for k = 1, 2, . . . , p, where gk () is the transformation mapping the time series onto its ranks. These ranks are given by yk = (p1k , p2k , . . . , pnk ) = (pk (1), pk (2), . . . , pk (n)) ,

(16.18)

where pk () is a permutation of {1, 2, . . . , n}. As an illustration, let us consider a simple time series with only five elements; x = (5, 0, 1, −3, 2). Then the new variable is given by the ranks of this time series, i.e. y = (5, 2, 3, 1, 4), which is a permutation p1 () of {1, 2, 3, 4, 5}. The original data are first sorted in increasing order. Then the position in time of each datum from the sorted series is taken. These (time) positions (from the sorted data) constitute our new data. So the newly transformed data z1 , z2 , . . . zp are composed of p time series, each of which is some permutation of {1, 2, . . . , n}, i.e. zk = (q1k , q2k , . . . , qnk ) = (qk (1), qk (2), . . . , qk (n))

(16.19)

for some permutation qk () of {1, 2, . . . , n}. It can be seen (see the appendix in Hannachi (2007)) that this permutation is precisely the reciprocal of the rank permutation pk (), i.e. qk = pk−1 = pk(n−2) , (m)

(16.20)

where pk = pk opk o . . . opk = pk (pk (. . . (pk )) . . .) is the mth iteration of the permutation pk (). As an illustration consider again the previous simple 5-element time series. The new transformed time series is obtained by sorting first x to yield (−3, 0, 1, 2, 5). Then, z consists of the (time) position of these sorted elements as given in the

16.4 Trend EOFs

377

original time series x, to yield z = (4, 2, 3, 5, 1), which is also a permutation q1 () of {1, 2, 3, 4, 5}. It can be easily checked that the permutations p1 () and q1 () are simply reciprocal of each other, i.e. p1 oq1 () is the identity over the set {1, 2, 3, 4, 5}. Now by looking for maximum correlation between the time positions of the sorted data we are attempting to find times when the different time series are increasing (or decreasing) altogether, i.e. coherently. The leading modes based on this new correlation (or covariance) are expected therefore to capture these slowly varying structures or trends. We should note obviously that if there is a very strong trend with a large explained variance, then most probably it will be captured by ordinary EOFs as the leading mode. However, this is the exception rather than the rule. In summary our new data matrix is now given by ⎛

q11   ⎜ q21 ⎜ Z = zT1 , zT2 , . . . , zTp = ⎜ . ⎝ ..

q12 . . . q22 . . . .. .

⎞ q1p q2p ⎟ ⎟ .. ⎟ . ⎠

(16.21)

qn1 qn2 . . . qnp and we are looking for correlations (or covariances) between (time) positions from the sorted data as: ρT (xk , xl ) = cov (zk , zl )

(16.22)

for k, l = 1, 2, . . . p. The trend EOFs are then obtained as the “EOFs/PCs” of the newly obtained covariance matrix, which is also identical to the correlation matrix (up to a multiplicative constant):  T = (ρT (xk , xl )) =

1 T T Z H HZ, n

(16.23)

where H = In − n1 1n 1Tn , and represents the centring operator, In being the n × n identity matrix, and 1n = (1, 1, . . . , 1)T and is the column vector of length n containing only ones. The application of TEOFs is then based on the SVD of the new data matrix HZ to obtain the trend EOFs/PCs. The Matlab code for trend EOFs is quite straightforward. Again if the data matrix is X(n, p12), the following code obtains the required field T X, which is submitted to SVD exactly as in the conventional EOFs of the data matrix X presented in Chap. 3: >> [X, TX] = sort (X); >> TX = scale (TX);

378

16 Further Topics

b) 4

2

2

2

0

-4

x3

4

-2

0 -2

0

200

-4

400

0

200

-4

400

2

2

2

-2

PC4

4

0

0 -2

0

200

-4

400

-2 0

g)

200

-4

400

2

TPC3

2

TPC2

2

0 -2

0

200

400

-4

200

400

i) 4

-4

0

h) 4

-2

400

0

4

0

200

f)

4

-4

0

e)

4

PC2

PC1

0 -2

d)

TPC1

c)

4

x2

x1

a)

0 -2

0

200

400

-4

0

200

400

Fig. 16.4 Time series of the first variables simulated from Eq. (16.24) (first row), PCs 1, 2 and 4 (second row) and the leading three trend PCs (third row). (a) Time series wt . (b) Time series xt . (c) Time series yt . (d) PC1. (e) PC2. (f) PC4. (g) Trend PC1. (h) Trend PC2. (i) Trend PC3. Adapted from Hannachi (2007)

16.4.3 Application of Trend EOFs Illustration with a Simple Example The TEOF method was illustrated with using simple examples by Hannachi (2007), as shown in Fig. 16.4. The first row in Fig. 16.4 shows an example of time series from the following 4-variable model containing a quadratic trend plus a periodic wave contaminated by an additive AR(1) noise. ⎧ wt = 1.8at + 2βb(t) + 1.6εt1 ⎪ ⎪ ⎨ xt = 1.8at + 1.8βb(t) + 2.4εt2 ⎪ y = 0.5at + 1.7βb(t) + 1.5εt3 ⎪ ⎩ t zt = 0.5at + 1.5βb(t) + 1.7εt4

(16.24)

16.4 Trend EOFs

379

b) Histogram

20

20

20

15 10

0

Frequency

25

5

15 10 5

0

0.5

0

1

15 10 5

0

Correlation

0.5

0

1

e) Histogram 20

20

5

Frequency

20

Frequency

25

10

15 10 5

0

0.5

Correlation

1

0

1

f) Histogram

25

15

0.5

Correlation

25

0

0

Correlation

d) Histogram

Frequency

c) Histogram

25

Frequency

Frequency

a) Histogram 25

15 10 5

0

0.5

Correlation

1

0

0

0.5

1

Correlation

Fig. 16.5 Histograms of the correlation coefficients between the quadratic trend, Eq. (16.24), and the correlation-based PCs 1 (a), 2 (b) and 4 (c), and the trend PCs 1 (d), 2 (e) and 3 (f). Adapted from Hannachi (2007)

where at is a quadratic trend proportional to t 2 , at ∝ t 2 , β = 2, and b(t) = cos 4t5 + sin 5t . The noises εtk , k = 1, . . . 4, are again AR(1) with respective lag1 autocorrelations 0.5, 0.6, 0.3 and 0.35. The second row in Fig. 16.4 shows PC1, PC2 and PC3 of the model whereas the last row shows the leading three “PCs” obtained from the new data matrix, Eq. (16.21). The trend is clearly captured by the first (trend) PC (Fig. 16.4g,h,i). To check the significance of this result a Monte-Carlo approach is used to calculate the correlation between the quadratic trend and the different PCs. Figure 16.5 shows the histogram of these correlations obtained using the leading (conventional) PCs (Fig. 16.5a,b,c) and the trend PCs (Fig. 16.4d,e,f). Clearly the trend is shared between all PCs. However, when the PCs from the new data matrix, Eq. (16.21), are used the figure is different (Fig. 16.5d,e,f), where now only the leading PC entirely captures the trend.

380

16 Further Topics

Fig. 16.6 Eigenspectra of the covariance (or correlation) matrix given in Eq. (16.23) of SLP anomalies

Fig. 16.7 As in Fig. 16.6 but for the sea level pressure anomalies

Application to Reanalysis Data The method was applied to reanalysis data using SLP and 1000-mb, 925-mb and 500-mb geopotential heights from NCAR/NCEP (Hannachi 2007). The application to the NH SLP anomaly field for the (DJFM) winter months shows a discontinuous spectrum of the new matrix (Eq. (16.21)) with two well separated eigenvalues and a noise floor (Fig. 16.6). By contrast, the spectrum of the original data matrix (Fig. 16.7) is “continuous”, i.e. with no noise floor. It can be seen that the third eigenvalue is in this noise floor. In fact, Fig. 16.8 shows the third EOF of the new data matrix where it is seen that clearly there is no coherent structure (or noise), i.e. no trend. The leading two trend EOFs, with eigenvalues above the noise floor (Fig. 16.6) are shown in Fig. 16.9. It is clear that the structure of these EOFs is different from the first trend EOF. Of course, to obtain the “physical” EOF pattern from those trend EOFs some form of back-transformation is applied from the space of the transformed data

16.4 Trend EOFs

381

Fig. 16.8 Third EOF of winter monthly (DJFM) NCEP/NCAR SLP anomalies based on Eq. (16.23)

Fig. 16.9 As in Fig. 16.8 but for the first (left) and second (right) trend EOFs of SLP anomalies

matrix (Eq. (16.21)) into the original (data) space. Hannachi (2007) applied a simple regression between the trend PC and the original field to obtain the trend pattern, but it is possible to apply more sophisticated methods for this back-transformation. The obtained patterns are quite similar to the same patterns obtained based on winter monthly (DJF) SLP anomalies (Hannachi 2007). These leading two patterns associated with the leading two eigenvalues are shown in Fig. 16.10, and reveal, respectively, the NAO pattern as the first trend (Fig. 16.10a), and the the Siberian high as the second trend (Fig. 16.10b). This latter is known to have a strong trend in

382

16 Further Topics

a) 4 3 2 1 0 -1 -2 -3 -4 -5 -6

b) 3 2 1 0 -1 -2 -3 -4 Fig. 16.10 The two leading trend modes obtained by projecting the winter SLP anomalies onto the first and second EOFs of Eq. (16.23) then regressing back onto the same anomalies. (a) Trend mode 1 (SLP). (b) Trend mode 2 (SLP). Adapted from Hannachi (2007)

the winter time, see e.g. Panagiotopoulos et al. (2005) that is not found in the midor upper troposphere. The application to geopotential height shows somehow a different signature to that obtained from the SLP. At 500-mb, for example, there is one single eigenvalue separated from the noise floor. The pattern of the trend represents the NAO. However, the eigenspectrum at 925-mb and 1000-mb yield two eigenvalues well

16.5 Common EOF Analysis

383

separated from the noise floor. The leading one is associated with the NAO as above, but the second one is associated with the Siberian heigh, a well known surface feature (Panagiotopoulos et al. 2005). The method was applied also to identify the trend structures of global sea surface temperature by Barbosa and Andersen (2009), regional mean sea level (Losada et al. 2013), and latent heat fluxes over the equatorial and subtropical Pacific Ocean (Li et al. 2011a). The method was also applied to other fields involving diurnal and seasonal time scales by Fischer and Paterson (2014). They used the trend EOFs in combination with a linear trend model for diurnal and seasonal time scales of Vinnikov et al. (2004). The TEOF method was extended by Li et al. (2011b) to identify coherent structures between two fields, and applied it to global ocean surface latent heat fluxes and SST anomalies.

16.5 Common EOF Analysis 16.5.1 Background EOF (or PC) analysis deals with one data matrix, and as such it is also known as one-sample method. The idea then arose to extend this to two- or more samples, which is the objective of common EOF (or PC) analysis. The idea of using twosample PCA goes back to Krzanowski (1979) who computed the angles between the leading EOFs for each group. Flury (1983) extended the two-sample EOF analysis by computing the eigenvectors of  −1 1  2 , where  k , k = 1, 2, is the covariance matrix of the kth group, to obtain, simultaneously, uncorrelated variables in the two groups. Common PC (or EOF) analysis was suggested back in the late 1980s with Flury (1984, 1988). Given a number of groups, or population, common EOFs arise when we think that the covariance matrices of these groups may have the same EOFs but with different weights (or importance) in different groups. Sometimes this condition is relaxed in favour of only a subset of similar EOFs with different weights in different groups. This problem is quite common in weather and climate analysis. Comparing, for example, the large scale flow patterns from a number of climate model simulations, such as those from CMIP5, is a common problem in weather and climate research. The basic belief underlying these models is that they may have the same modes of variability, but with different prominence, e.g. explained variance, in different models. This can also be used to compare different reanalysis products from various weather centres, such as the National Center for Environmental Prediction (NCEP), and the National Center for Atmospheric Research or the Japan Meteorological Agency. Another direction where common EOF analysis can be explored is the analysis of ensemble forecast. This can help identify any systematic error in the forecasts. The problem of common EOF analysis consists of finding common EOFs of a set of M data matrices X1 , . . . , XM , where Xk , k = 1, . . . , M, is the nk × p data

384

16 Further Topics

matrix of the kth group. Note that data matrices can have different sample sizes. The problem of common EOFs emerges naturally in climate research, particularly in climate model evaluation. In CMIP5, for example, one important topic is often to seek a comparison between the modes of variability of large scale flow of different models. The way this is done in climate research literature is via a simple comparison between the (conventional) EOFs of the different climate model simulations. One of the main weaknesses of this approach is that the EOFs tend to be model-dependent, leading to difficulties of comparison. It would be more objective if the modes of variability are constructed on a common ground, and a natural way to do this is via common EOFs.

16.5.2 Formulation of Common EOFs We assume here that we are given M different groups or populations with corresponding covariance matrices  k , k = 1, . . . M. If the covariance matrices are equal, then the within-population EOFs are also the same for the M different populations. The more interesting case is when these covariance matrices are different. In this case an attempt by statisticians was to identify an orthogonal matrix A and M diagonal matrices k , k = 1, . . . M, satisfying the following hypothesis Hc : Hc :

AT  k A = k ,

k = 1, . . . M.

(16.25)

  The column vectors of A = a1 . . . ap are the common EOFs, and Uk = Xk A are the common PCs. Remark Note that in the case when the covariance matrices are different there is no unique way to define the set of within-population EOFs (or PCs), i.e. the common EOFs. Note, however, that unlike conventional EOFs, the diagonal elements of k , k = 1, . . . M, need not have the same order, and may not be monotonic simultaneously for the different groups. The common PCA model (16.25) appeared in Flury (1988) as a particular model among five models with varying complexity. The first model (level-1) is based on the equality of all covariances  k , k = 1, . . . , M. The second level model assumes that covariance matrices  k , k = 1, . . . , M, are proportional to  1 . The third model is precisely Eq. (16.25), whereas in the fourth model only a subset of EOFs are common eigenvectors, hence partial common PC model. The last model has no restriction on the covariance matrices. Those models are described in Jolliffe (2002). Flury (1984) estimated the common EOFs from Eq. (16.25) using maximum likelihood based on the normality assumption N (μk ,  k ), k = 1, . . . M, of the kth p-variate random vector generating the data matrix Xk , k = 1, . . . M. Letting

16.5 Common EOF Analysis

385

Sk , k = 1, . . . M, denote the sample (unbiased) estimate3 of the covariance matrix  k then the likelihood function (Flury 1984) takes the form: L( 1 , . . . ,  M ) = α

M ! k=1

  n 1 k −1  , exp tr − S k 2 k | k |nk /2

(16.26)

where

tr(.) is the trace function and α is a constant. Denoting by k = diag λk1 , . . . , λkp , the maximisation of the logarithm of the likelihood L(.) in Eq. (16.26) yields the following system of equations: aTk



M λik −λil i=1 ni λik λil

subject to



al = 0, 1 ≤ k < l ≤ p AT A = Ip .

(16.27)

Flury (1984) computed the likelihood ratio LR to test the hypothesis Hc , see Eq. (16.25): ˆ k| ˆ M)  ˆ k, . . . ,  L( | = , nk log L(S1 , . . . , SM ) |Sk | M

LR = −2 log

(16.28)

k=1

ˆ T , and represents the likelihood estimator of  k , k = 1, . . . , M. ˆ ˆ kA ˆk =A where  This ratio behaves, asymptotically, as a chi-square: 2 LR ∼ χ(M−1)(p−1)/2 .

(16.29)

As it is mentioned above, the estimator maximising the likelihood Eq. (16.26) yields no guarantee on the simultaneous monotonic order of the eigenvalues of k , k = 1, . . . , M. This can be seen as a weakness of the method, particularly if the objective is to use the method for dimensionality reduction. Various attempts have been applied to overcome this drawback. For example, Krzanowski (1984) computed the EOFs of the pooled sample covariance matrix and the total sample covariance matrix followed by a comparison of their EOFs. In fact, Krzanowski (1984) computed the EOFs of S1 + . . . + SM , and compared them with those obtained using some kind of weighted sums of Sk , k = 1, . . . M, in order to assess whether the hypothesis Eq. (16.25) is true, see also the description in Jolliffe (2002). Schott (1988) developed an approximate method for testing the equality of the EOF subspaces from two groups, and extended it later to several groups (Schott 1991). The method is based on using the sum of the corresponding covariance matrices, which is somehow related to that of Krzanowski (1984). Other extensions of, and tests for common EOF analysis are given in Jolliffe (2002).

3 The

sample covariance matrix follows a Wishart distribution, i.e. nk Sk ∼ Wp (nk ,  k ).

386

16 Further Topics

All the methods described above including those mentioned in Jolliffe (2002) share, however, the same drawback mentioned earlier, related to the lack of simultaneous monotonic change of the eigenvalues for all groups. A more precise method to deal with the problem was proposed later (Trendafilov 2010) by computing the common EOFs based on a stepwise procedure. The common EOFs are estimated sequentially one-by-one allowing hence for a monotonic decrease (or increase) of the eigenvalues of k , k = 1, . . . , M, in all groups simultaneously. The method is similar to the stepwise procedure applied in simplifying EOFs (Hannachi et al. 2006), and is based on projecting the gradient of the common EOF objective function onto the orthogonal of the space composed of the common EOFs identified in the previous step. The reformulation of the common EOF problem is again based on the likelihood Eq. (16.26), which takes a similar form, namely:  T minA M k=1 nk log |diag(A Sk A)| T subject to A A = Ip .

(16.30)

If the eigenvalues are to decrease monotonically in all groups simultaneously, then Eq. (16.30) can be transformed (Trendafilov 2010) to yield

T  maxa M k=1 nk log a Sk a subject to aT a = 1, and aT Qj −1 = 0T ,

(16.31)

  where Qj −1 = a1 , a2 , . . . , aj −1 and Q0 = 0. The first-order optimality condition of (16.31) is M  nk k=1

n

Sk T a1 Sk a1

− Ip a1 = 0,

(16.32)

and for j = 2, . . . p:  

Ip − Qj −1 QTj−1

M   nk Sk n aTj Sk aj k=1

 − Ip aj = 0.

(16.33)

Trendafilov (2010) solved Eqs. (16.32–16.33) using the standard power method (Golub and van Loan 1996), which is a special case of the more general gradient ascent method of quadratic forms on the unit sphere (e.g. Faddeev and Faddeeva 1963). The solution to Eqs. (16.32–16.33) is applied to the NCAR/NCEP monthly SLP anomalies over the period 1950–2015. The data are divided into 4 seasons to yield 4 datasets for monthly DJF, MAM, JJA and SON, respectively. The data used here are on a 5◦ × 5◦ grid north of 20◦ N. The objective is to illustrate the common EOFs from these data, that is to get the matrix of loadings that simultaneously diagonalise

16.5 Common EOF Analysis

387

Fig. 16.11 Explained variance of the common EOFs for the winter (dot), spring (circle), summer (asterisk), and fall (plus)

the covariance matrices of those 4 datasets. The algorithm is quite fast with 4 1080× 1080 covariance matrices, and takes a few minutes on a MacPro. The results are illustrated in Figs. 16.11 and 16.12. The obtained diagonal matrix (or spectrum) for each dataset, is transformed into percentage of explained variance (Fig. 16.11). There is a clear jump in the spectrum between the first and the remaining modes for the four datasets. The leading common mode explains a little more variance for spring and summer compared to winter and autumn. The spatial patterns of the leading 4 common EOFs are shown in Fig. 16.12. The first common EOF (top left) projects well onto the Arctic oscillation, mainly over the polar region, with an opposite centre in the North Atlantic centred around 40N. Common EOF2 (top right) shows a kind of wavetrain with main centre of action located near North Russia. The third common EOF (bottom right) reflects more the NAO pattern, with around 10% explained variance. The NAO is weak in the spring and summer seasons, which explains the reduced explained variance. The fourth common EOF shows mainly the North Pacific centre, associated most likely with the Aleutian low variability. As is pointed out earlier, The common EOF approach is more suited, for example, to the analysis of outputs from comparable GCMs, such as the case of CMIP models, where the objective is to evaluate and quantify what is common in those models in terms of modes of variability.

388

16 Further Topics

Fig. 16.12 The leading 4 common EOFs of the four datasets, namely (monthly) DJF, MAM, JJA, and SON NCEP/NCAR sea level pressure anomalies over the period 1950–2015

16.6 Continuum Power CCA 16.6.1 Background Conventional CCA, MCA and RDA are standard linear methods that are used to isolates pair of coupled patterns from two datasets. They are all based on SVD and the obtained patterns from the different methods are linked through linear transformations. In fact, it is possible to view CCA, MCA and RDA within a unified frame where each of them becomes a particular case. This is obtained through what is known as partial whitening transformation. Partial whitening, with degree α, aims at partially decorrelating the variables. This transformation is used in continuum power regression, which links partial least squares (PLS) regression, ordinary least squares regression and principal component regression (PCR), e.g. Stone and Brooks (1990). Swenson (2015) extended continuum power regression to CCA to get continuum power CCA (CPCCA).

16.6 Continuum Power CCA

389

16.6.2 Continuum Power CCA Let X be a n × p (anomaly) data matrix (p is the number of variables and n is the sample size), C = n−1 XT X the covariance matrix. CPCCA (Swenson 2015) is based on the partial whitening (with power α) of X, i.e. X∗ = Aα,x XT ,

(16.34)

1−α −1 Aα,x = C 2 .

(16.35)

where

We suppose here that C is full rank so that its non-integer power exists4 Remark The standard whitening transformation corresponds to α = 0. CPCCA patterns u and v are obtained via:   max uT XT Yv   1−α 1−β s.t. uT XT X u = vT YT Y v = 1.

(16.36)

Remark It is straightforward to check that: • α = β = 0 corresponds to conventional CCA. • α = β = 1 corresponds to MCA. • α = 0, β = 1 corresponds to RDA. The above optimisation problem is equivalent to the following “MCA” problem:   max uT X∗ YT∗ v s.t. uT u = vT v = 1,

(16.37)

where X∗ = Aα,x XT and Y∗ = Aβ,y YT . As for CCA, the CPCCA patterns (in the partially whitened space) are given by the SVD of X∗ YT∗ , i.e. X∗ YT∗ = U+ SVT+ , and the associated cross-covariance by the diagonal of S. The CPCCA time series are provided by projecting the partially whitened variables onto the singular vectors U+ and V+ yielding Tx = XT∗ U+ and Ty = YT∗ V+ . The CPCCA patterns within the original space are obtained by using the inverse of Aα,x and Aβ,y , i.e.   1−α   1−β U = XT X 2 U+ and V = YT Y 2 V+ .

4 If

(16.38)

C = UUT , with  = diag(λ1 , . . . , λp ) then Cδ = Uδ UT where δ = diag(λδ1 , . . . , λδp ).

390

16 Further Topics

Fig. 16.13 Fraction of signal variance explained (FSVE) for the leading CPCCA mode in the partial whitening parameters (α,β) plane for different values of the signal amplitudes a and b (maximum shown by ‘*’) for a = b = 1(a), 0.75(b), 0.6(c). Also shown are the maxima of the cross-correlation, squared covariance fraction and the fraction of variance of Y explained by X, associated, respectively, with CCA, MCA and RDA. Adapted from Swenson (2015). ©American Meteorological Society. Used with permission

Remark The time series TTx and Ty are (cross-) uncorrelated, i.e. TTx Ty = S. The time series Tx , however, are not uncorrelated. Partial whitening provides more flexibility through varying the parameters α and β. CPCCA is similar, in some way, to the partial phase transform, PHAT-β, encountered in signal processing (Donohue et al. 2007). Partial whitening can be shown to yield an increase in performance when applied to CCA regarding the S/N ratio. Figure 16.13 shows the performance of CPCCA as a function of α and β for an artificial example in which the “common component” is weighted by specific numbers a and b in the data matrices X and Y, respectively.

16.6.3 Determination of the Degree Parameter Various methods can be used to determine the degree of partial whitening (or regularisation) parameter α. Perhaps an intuitive approach is to consider the simultaneous optimisation of uT X∗ YT∗ v with respect to u, v α and β, where the optimum solution is given by   {uo , vo , αo , βo } = argmax uT X∗ YT∗ v s.t. uT u = vT v = 1.

(16.39)

It is possible to solve the above equation numerically. This was used by Salim et al. (2005) to estimate the smoothing parameter in regularised MCA in addition to the

16.6 Continuum Power CCA

391

spatial patterns. They applied the method to analysing the association between the Irish winter precipitation and sea surface temperature. They found clear association between Irish precipitation anomalies, El-Niño Southern Oscillation and the North Atlantic Oscillation. We note, however, that the calculation can be cumbersome. The other, and common, method is to use cross-validation (CV). The CV is feasible in practice but requires a relatively extended computation as it is based on leave-one-out procedure. We explain the procedure here for the conventional CCA. The application to the CPCCA is similar. In CCA we seek the spectral

−1 T T −1 T analysis of X XT X X Y Y Y Y whereas in regularised CCA, with regularisation parameter λ = (λ1 , λ2 ), we are interested in the spectral analysis −1 T T −1 T

X Y Y Y + λ2 I Y , where I is the identity matrix. We of X XT X + λ1 I designate by X−i and Y−i the data matrices derived from X and Y, respectively, by removing the ith rows xi and yi of X and Y, respectively. We also let ρλ(−i) be the leading canonical correlation from CCA of X−i and Y−i with corresponding (−i) (−i) patterns (eigenvectors) uλ and vλ . The cross-validation score can be defined, in general, as a measure of the squared error of a test set evaluated for an eigenvector from a training set. The CV score is defined (Leurgans et al. 1993) by   (−i) (−i) CV () = corr {xi uλ }i=1,...n , {yi vλ }i=1,...n .

(16.40)

Note that in the above equation we consider xi as a 1 × p row vector. The crossvalidated parameter λˆ is then given by λˆ = argmax CV (λ). λ=(λ1 ,λ2 )

(16.41)

The third method to estimate the optimal parameter is related to ridge regression in which a regularisation parameter, in the form of λI, is added to the covariance matrix of the predictor variables before computing the inverse. In ridge regression

−1/2 a transformation is applied using Tridge = (1 − ρ)XT X + ρμI , with μ = 1 T 2 . An estimate of ρ is derived by Ledoit and Wolf (2004): X F p n ρLW =

T 2 T t=1 (n − 1)xi xi − X X F XT X − μI 2F

(16.42)

with . F being the Fröbenius norm ( C F = tr(CCT )). For CPCCA, Swenson (2015) suggests the following estimator for the parameter α:  1−α − (1 − ρLW )XT X − ρLW μI 2F αˆ = argmin ν XT X α

with ν = XT 2F / (XT X)

1−α 2

2F .

(16.43)

392

16 Further Topics

16.7 Kernel MCA 16.7.1 Background Given two data matrices X and Y, classical or standard MCA looks for patterns a and b such that Xa and Yb have maximum covariance. These patterns are given, respectively, by the left and right singular vectors of the cross-covariance5 matrix B = XT Y. These vectors satisfy XT Yb = nλa and YT Xa = nλb. In addition, the associated time series x = Xa and y = Yb satisfy, respectively: XXT YYT x = n2 λ2 x . YYT XXT y = n2 λ2 y

(16.44)

Exercise Derive Eq. (16.44) In practice, of course, we do not solve Eq. (16.44), but we apply the SVD algorithm to XT Y. The above derivation is useful for what follows.

16.7.2 Kernel MCA Kernel MCA takes its roots from kernel EOF where a transformation φ(.) is used to map the input data space onto a feature space, then EOF analysis applied to the transformed data. In kernel MCA the X and Y feature spaces are spanned, respectively, by φ(x1 ), . . . , φ(xn ) and φ(y1 ), . . . , φ(yn ), respectively. The objective is similar to standard MCA but applied to the feature spaces. We designate by X and Y the matrices (or rather operators) defined, respectively, by ⎞ ⎛ ⎞ φ(y1 )T φ(x1 )T ⎟ ⎜ ⎟ ⎜ X = ⎝ ... ⎠ and Y = ⎝ ... ⎠ , ⎛

φ(xn

)T

φ(yn

(16.45)

)T

and we seek “feature” patterns u and v from the feature space such that X u and Yv have maximum covariance. The cross-covariance matrix between φ(xk ) and φ(yk ), k = 1, . . . , n, is 1 1 φ(xt )φ(yt )T = X T Y n n n

C=

t=1

5X

and Y are supposed to be centred.

(16.46)

16.7 Kernel MCA

393

and the left and right singular vectors satisfy Cv = λu CT u = λv.

(16.47)

As in kernel EOF we see that u and v take, respectively, the following forms: u=

n

t=1 at φ(xt )

and v =

n

t=1 bt φ(yt ).

(16.48)

Inserting (16.48) into (16.47), using (16.46), and denoting by Kx and Ky the y matrices with respective elements Kijx = φ(xi )T φ(xj ) and Kij = φ(yi )T φ(yj ), we get Kx Ky b = nλKx a Ky Kx a = nλKy b.

(16.49)

One can solve (16.49) simply by considering the necessary condition, i.e. Ky b = nλa Kx a = nλb,

(16.50)

which yields Kx Ky b = n2 λ2 b and similarly for a. Alternatively, we can still use Eq. (16.49) to obtain, for example: Ky Kx Ky b = nλKy Kx a = n2 λ2 Ky b,

(16.51)

which is an eigenvalue problem with respect to Ky b. Remark • Note that Eq. (16.51) can also be seen as a generalised eigenvalue problem. • With isotropic kernels of the form H ( x − y 2 ), such as the Gaussian kernel, with H (0) = 0, Kx and Ky are, in general, invertible and Eq. (16.49) becomes straightforward.

16.7.3 An Alternative Way One can use the (data) matrices within the feature space, as in the standard case (i.e. without transformation), and directly solve the system: 1 T n X Yv = λu 1 T n Y X u = λv,

(16.52)

394

16 Further Topics

which, as in the standard MCA, leads to X X T YY T X u = n2 λ2 X u YY T X X T Yv = n2 λ2 Yv.

(16.53)

Now, x = X u is a time series of length n, and similarly for y = Yv. Also, we have X X T = Kx and YY T = Ky , and then Eq. (16.53) becomes Kx Ky x = n2 λ2 x Ky Kx y = n2 λ2 y.

(16.54)

So the time series x and y having maximum covariance are given, respectively, by the right and left eigenvectors of Kx Ky . Remark Comparing Eqs. (16.51) and (16.54) one can see thatx = Kx a and n y =Ky b, which can be verified keeping in mind that u = t=1 at φ(xt ) and n v = t=1 bt φ(yt ) in addition to the fact that x = X u and y = Yv. One finds either a and b (Eq. (16.51)) or x and y (Eq. (16.54)). We then construct the feature patterns u and v using Eq. (16.48). The corresponding patterns from the input spaces can be obtained be seeking x and y such that uT φ(x) and vT φ(y) are maximised. This leads to the maximisation problem: max

n

t=1 at K(x, xt )

and max

n

t=1 bt K(y, yt ).

(16.55)

This is exactly like the pre-image for Kernel EOFs, and therefore the same fixed point algorithm can be used.

16.8 Kernel CCA and Its Regularisation 16.8.1 Primal and Dual CCA Formulation As above we let X and Y denote n × p and n × q two (anomaly) data matrices. The conventional CCA is written in the primal form as: uT XT Yv ρ = max * . u,v (uT XT Xu)(vT YT Yv)

(16.56)

By denoting u = XT α and v = YT β, the above form can be cast in the dual form: ρ = max α,β

α T Kx Ky β (α T K2x α)(β T K2y β)

,

(16.57)

16.8 Kernel CCA and Its Regularisation

395

where Kx = XXT and Ky = YYT . Exercise Show that the above problem is equivalent to max α T Kx Ky β s.t. α T K2x α = β T K2y β = 1.

(16.58)

This system can be analysed using Lagrange multipliers yielding a system of linear equations in α and β:

Kx Ky β − λ1 Kx α = 0

Ky Kx α − λ2 Ky β = 0.

(16.59)

Verify that λ1 = λ2 , which can be denoted by λ. Show that ρ 2 is indeed the maximum of the Raleigh quotient:   u 0 Kx Ky Ky Kx 0 v    2 Kx 0 u (uT vT ) . 0 K2y v 

(uT vT ) R=

(16.60)

Remark Note that in the dual formulation the Raleigh quotient (16.60) and also (16.57) the computation of the cross-correlation (or cross-covariance) is avoided. This has implication when computing kernel CCA as shown later. Exercise Assume that Kx and Ky are invertible, show that we have λ = 1. The conclusion from the above exercise is that when Kx and Ky are invertible perfect correlation can be obtained and the CCA problem becomes useless. This is a kind “overfitting”. Remark In CCA this problem occurs whenever Kx and Ky are invertible. This means that rank(X) = n =rank(Y), i.e. n < q and n < p. This also means that the covariance matrices XT X and YT Y are singular. The solution to this problem is regularisation as it was discussed in Sect. 16.6, (see also Chap. 15). by adding λ1 I and λ2 I to the correlation matrices of X and Y, respectively, as in ridge regression. In ridge regression with a regression model Y = −1 T

X Y is replaced by (R + λI)−1 XT Y, XB + the estimated matrix Bˆ = XT X with λ > 0, and R is the correlation matrix. The diagonal elements of R are increased by λ, and this is where the name ridge comes from. Remark The standard CCA problem can be cast into a generalised eigenvalue problem as

396

16 Further Topics



O Cxy Cyx O

     u u 2 Cxx O =ρ O Cyy v v

(see exercise above). The above form can be used to extend CCA to multiple datasets. For example, for three data one form of this generalisation is given by the following generalised eigenvalue problem: ⎛ ⎞⎛ ⎞ ⎞⎛ ⎞ Cxx O O u u O Cxy Cxz ⎝ Cyx O Cyz ⎠ ⎝ v ⎠ = ρ 2 ⎝ O Cyy O ⎠ ⎝ v ⎠ . w Czx Czy O w O O Czz ⎛

16.8.2 Regularised KCCA In canonical covariance analysis no scaling, (i.e. correlation) was used, and therefore no regularisation was required. As with conventional CCA we denote, respectively, the Gram matrices of X and Y by K = (Kij ) and L = (Lij ), with Kij = φ(xi )T φ(xj ) and Lij = ψ(xi )T ψ(xj ). Note that here  map for  we can use a different Y. The solution of KCCA looks for patterns a = i αi φ(xi ) and b = i βi ψ(yi ) that are maximally correlated. This leads to maximising the Lagrangian:  1   1  L = α T KLβ − λ α T K2 α − 1 − λ β T L2 β − 1 2 2

(16.61)

and also maximising the Raleigh quotient (in the dual form). The obtained system of equation is similar to Eq. (16.59). Again, if, for example, K is of full rank, which is typically the case in practice, then a naive application of KCCA leads to λ = 1. This shows the need to regularise the kernel, which leads to the regularised Lagrangian  1   1  L = α T KLβ − λ α T K2 α + η1 α T α − 1 − λ β T L2 β + η1 β T β − 1 . 2 2 (16.62) The associated Raleigh quotient is similar to that shown in the exercise above except that K2 and L2 are replaced by K2 + η1 I and L2 + η2 I, respectively, and associated generalised eigenvalue problem. Note that we can also take η1 = η2 = η. Remarks • The dual formulation allows us to use different kernels, e.g. φ(.) and ψ(.) φ(.) for X and Y, respectively. For example, one can kernelize only one variable and leave the other without kernel. • The regularised parameter η can be estimated using the cross-validation procedure.

16.9 Archetypal Analysis

397

16.8.3 Some Computational Issues The solution to the regularised KCCA is given, e.g. for α, assuming that K is invertible, by the standard eigenvalue problem: (K + ηI)−1 L (L + ηI)−1 Kα = λ2 α.

(16.63)

The above eigenvalue problem can be solved by standard Cholesky decomposition (Golub and van Loan 1996) when the sample size is not very large. When we have a large dataset an alternative is to use the incomplete Cholesky decomposition of kernel matrices (Bach and Jordan 2002). Unlike the standard decomposition, in the incomplete Cholesky decomposition all pivots below a selected threshold are skipped. This leads to a lower triangular matrix with only m non-zero columns if m is the number of pivots used, i.e. non-skipped. Another alternative to incomplete Cholesky decomposition is to use the partial Gram-Schmidt orthogonalization (Cristianini et al. 2001). This orthogonalization was applied with KCCA by Hardoon et al. (2004) to analyse semantic representation of web images.

16.9 Archetypal Analysis 16.9.1 Background Archetypal analysis (Cutler and Breiman 1994) is another method of multivariate data exploration. Given a multivariate time series xt = (xt1 , . . . , xtm )T , t = 1, . . . n, of m-dimensional variables we know that EOF analysis provides directions in this m-dimensional space that maximise variance, making them not directly interpretable in terms of the data values themselves. Clustering, for example, via k-means yields centroids or typical prototypes of the observations. In archetypal analysis (AA) the objective is to express the data in terms of a small number of “pure” types or archetypes. The data are then approximately expressed as weighted average of these archetypes. In addition, the archetypes themselves are also weighted average of the data, and are not necessarily observed. The main feature, however, is that the archetypes are in general extremal and this what distinguishes them from EOFs and other closely related methods. AA therefore attempts to combine the virtues of EOF analysis and clustering, in addition to dealing with extremes or corners of the data in its state space. The archetypes are obtained by estimating the convex hull or envelope of the data in state space. AA was applied mostly in pattern recognition, benchmarking and market research, physics (astronomy spectra), computer vision and neuro-imaging and biology, but not much in weather and climate research. In climate research only recently AA was applied, see Steinschneider and Lall (2015) and Hannachi and Trendafilov (2017).

398

16 Further Topics

16.9.2 Derivation of Archetypes We let our n × m data matrix denoted by X = (x1 , . . . , xn )T . For a given number p, 1 ≤ p ≤ n, AA finds archetypes z1 , . . . , zp , that are mixture, or convex combination, of the data as: zk = s.t.

n

j =1 βkj xj ,

βkj

k = 1, . . . , p  ≥ 0 and nj=1 βkj = 1.

(16.64)

The above equations make the patterns z1 , . . . , zp of data (or pure) types. The data, in turn, are also approximated by a similar weighted average of the archetypes. That p is, each xt , t = 1, . . . , n is approximated by a convex combination j =1 αtj zj , with p αtj ≥ 0 and j =1 αtj = 1. The archetypes are therefore the solution of a convex least square problem obtained by minimising a residual sum of squares (RSS):   p p  {z1 , . . . , zp } = argmin t xt − k=1 αtk zk 22 = argmin t xt − k=1 nj=1 αtk βkj xj 22 α,β p n s.t. αtk ≥ 0, j =1 βkj = 1, k = 1, . . . p, k=1 αtk = 1, t = 1, . . . n, βkj ≥ 0, and (16.65)

where . 2 stands for the Euclidean norm. The above formulation of archetypes can be cast in terms of matrices. Letting AT = (αij ) and BT = (βij ), (A in Rp×n , B in Rn×p ) the above equation transforms into the following matrix optimisation problem:   minA,B R = X − AT BT X 2F A, B ≥ 0, AT 1p = 1n , and BT 1n = 1p .

(16.66)

In the above system A and B are row stochastic matrices, 1x stands for the x-column vector of ones and . F stands for the Fröbenius norm (Appendix D). The inferred archetypes are then

convex combination of the observations, which are given by Z = z1 , . . . , zp = XT B, and they exist on the convex hull of the data x1 , . . . , xn . Furthermore, letting A = (α 1 , . . . , α n ), then for each data xt , t = 1, . . . , n, Zα t represents its projection on the convex hull of the archetypes as each α t is a probability vector. For a given p Cutler and Breiman (1994) show  that the minimisers of RSS R, Eq. (16.66), provide archetypes Z = z1 , . . . , zp that are, theoretically, located on the boundary of the convex hull (or envelope) of the data. The convex hull of a given data is the smallest convex set containing the data. Archetypes provide therefore typical representation of the “corners” or extremes of the observations. Figure 16.14 shows an illustration of a two-dimensional example of a set of points with its convex  hull and its approximation using five archetypes. The sample mean x = n1 xt provides the unique archetype for p = 1, and for p = 2 the pattern z2 −z1 coincides with the leading EOF of the data. Unlike EOFs archetypes are not required to be nested (Cutler and Breiman 1994; Bauckhage and Thurau 2009). However, like k-

16.9 Archetypal Analysis

399

Fig. 16.14 Two-dimensional illustration of a set of 25 points along with the convex hull (dashed), an approximate convex hull (solid) and 5 archetypes (yellow). The blue colour refers to points that contribute to the RSS. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

means clustering (and unlike EOFs) AA is invariant to translation and scaling and to rotational ambiguity (Morup and Hansen 2012). In summary, AA combines the virtues of EOFs and clustering and, most importantly, deals with extremes in high dimensions. Exercise Show that the mean x is the unique archetype for p = 1. Hint Letting X = (x , xn )T = (y1 , . . . , ym ), β = (β1 , . . . , βn )T , and ε2 = 1n, . . . T 2 X − 1n β X F = t=1 m (x − β T yk )2 , and differentiating ε2 with respect k=1 m tk to β one obtains the system k=1 (yk − β T yk )xtk = 0, for t = 1, . . . n, that is  T  xi xj β = n1 1Tn XXT . The only solution (satisfying the constraint) is β = n1 1n , that is z = βX.

16.9.3 Numerical Solution of Archetypes There are mainly two algorithms to solve the archetypes problem, which are discussed below. The first one is based on the alternating algorithm (Cutler and Breiman 1994), and the second one is based on a optimisation algorithm on Riemannian manifolds (Hannachi and Trendafilov 2017).

Alternating Algorithm To solve the above optimisation problem Cutler and Breiman (1994) used, starting from an initial set of archetypes, an alternating minimisation between finding the best A for a given set of archetypes, or equivalently B, and the best B for a given A. The algorithm has the following multi-steps:

400

16 Further Topics

(1) Determine A, for fixed Z, by solving a constrained least squares problem. (2) Using the obtained A, from step (1), solve, based on Eq. (16.66), for the archetypes ZA = XT , i.e. Z = XT AT (AAT )−1 . (3) Using the obtained archetypes, from step (2), estimate the matrix B again by solving a constrained least squares problem. (4) Obtain an update of Z through Z = XT B, then go to step (1) unless the residual sum of squares is smaller than a prescribed threshold. Basically each time one solves several convex least square problems, as follows:   ATl+1 = argmin X − AT ZTl 2F + λ AT 1p − 1n 22 A≥0

and   BTl+1 = argmin ZTl − BT X 2F + μ BT 1n − 1p 22 .

(16.67)

B≥0

After each iteration the archetypes are updated. For example, after finding Al+1 from Eq. (16.67) and before solving the second equation, Z is updated using −1

Al+1 X, which is then used in the second X = ATl+1 ZT , i.e. ZT = Al+1 ATl+1 equation of (16.67). After optimising this second equation, Z is then updated using Z = XT Bl+1 . This algorithm has been widely used since it was proposed by Cutler and Breiman (1994). Remark Both equations in (16.67) can be transformed to n + p individual convex least square problems. For example, the first equation is equivalent to min 12 aTi ZT Zai − (xTi Z)ai , i = 1, . . . , n s.t. ai ≥ 0 and 1T ai = 1, and similarly for the second equation. The above optimisation problem is of the order O(n2 ) where n is the sample size. Bauckhage and Thurau (2009) reduced the complexity of the problem to O(n2 ) with n < n by using only the set of data points that are not exact, but approximate, convex combination of the archetypes, referred to as working set.

Riemannian Manifold-Based Optimisation Hannachi and Trendafilov (2017) proposed a new non-alternating algorithm based on Riemannian manifold optimisation. Riemannian optimisation on a differential manifold M, e.g. Smith (1994) and Absil et al. (2010), seeks solutions to the problem min f (x).

x∈M

(16.68)

16.9 Archetypal Analysis

401

Examples of differential manifolds include the (n − 1)-dimensional sphere S n−1 , a submanifold of Rn . Of particular interest here is the set of matrices with unit-vector rows (or columns), i.e. Ob(n, p) = {X ∈ Rn×p , ddiag(XT X) = Ip }, where ddiag(Y) is the double diagonal operator, which transforms a square matrix Y into a diagonal matrix with the same diagonal elements as Y. This manifold is known as the oblique manifold and is topologically equivalent to the Cartesian product of spheres, with a natural inner product given by < X, Y >= tr(XYT ).

(16.69)

The problem is transformed into an optimisation onto the oblique manifold Ob(n, p) and Ob(p, n). Let A  C denote the “Hadamard” or element-wise product of A = (aij ) and C = (cij ), i.e. A  C = (aij cij ). Because the weight matrices A and B involved in the residual R, Eq. (16.66), are positive and satisfy the convex (or stochasticity) constraint, i.e. AT 1p = 1n and BT 1n = 1p , these matrices can be written as (element-wise) squares of two matrices from Ob(n, p) and Ob(p, n), respectively, e.g. A = E  E and B = F  F. For convenience, we will be using again the notation A and B instead of E and F. Therefore, the problem can be transformed by replacing A and B by A  A and B  B, with A and B in Ob(p, n) and Ob(n, p), respectively, that is: R = X−(AA)T (BB)T X 2F = tr(Z)−2tr(ZW)+tr(ZWT W),

(16.70)

where Z = XXT and W = (A  A)T (B  B)T . Exercise Derive the following identities: 1 2 ∇tr 1 2 ∇tr 1 2 ∇tr

C(Y  Y)T D = Y  (DC)

  C(Y  Y)(Y  Y)T D = (DC)T (Y  Y)  Y + [(DC)(Y  Y)]  Y .  

C(Y  Y)D(Y  Y)T = CT (Y  Y)DT  Y + [C(Y  Y)D]  Y (16.71)

Exercise Consider the hypersphere S n−1 in Rn , S n−1 = {x ∈ Rn , x = 1}, and the oblique manifold, Ob(n, p) = {X ∈ Rn×p , ddiag(XT X) = Ip }. The tangent space at x ∈ S n−1 is the set of all vectors orthogonal to x, i.e. Tx S n−1 ≡ {u ∈ Rn , uT x = 0},

(16.72)

and the orthogonal projection u∗ of any vector u onto Tx S n−1 is u∗ = u − (xT u)x = (I − xxT )u.

(16.73)

402

16 Further Topics

Using the topological equivalence between Ob(n, p) and the Cartesian product of p hyperspheres S n−1 , i.e. Ob(n, p) ∼ S n−1 × · · · × S n−1 derive the projection U∗ of any U from Rn×p onto TX Ob(n, p), namely   U∗ = U − X ddiag XT U .

(16.74)

Let us denote by A2. = A  A and similarly for B. Then we have the following expression of the gradient of the costfunction R (see Appendix D).   ∇A R = 4 (B2. )T Z(−In + WT )  A   . ∇B R = 4 Z(−In + WT )(A2. )T  B

(16.75)

Finally, the projection of the gradient of R, ∇A,B R, onto the tangent space of the oblique manifolds yields the final gradient gradA,B R, namely: gradA R = ∇A R − A ddiag(AT ∇A R) . gradB R = ∇B R − B ddiag(BT ∇B R)

(16.76)

After minimising the costfunction R using the expression of the gradient Eq. (16.76), the archetypes are then given by Z = XT (B  B) = XT B2.

(16.77)

Exercise Derive the above expressions of the gradient. Other developments have been obtained including weighted and robust AA (Eugster and Leisch 2011), and probabilistic archetype analysis (Seth and Eugster 2015). In weighted AA a weight matrix W is applied to X − AT BT X in Eq. (16.66) in order to reduce the sensitivity of AA to outliers. The problem is still equivalent to Eq. (16.66) though with X replaced by WX, i.e. RSS = W(X − AT BT X) 2F = WX − AT BT WX 2F , and hence the same algorithm can be used. Exercise Show that indeed W(X − AT Z) = WX − AT BT WX, and hence is of the ˜ − AT Z˜ with Z˜ = BT X. ˜ form X Indication Proceed columnwise, i.e. use the columns X − AT Z. For example, the p of ith column of this matrix is of the form yi = xi − j =1 nk=1 aij bj k xk , and, consep  quently the ith column of W(X−AT Z) is Wyi = Wxi − j =1 nk=1 aij bj k (Wxk ). In probabilistic AA the procedure is based on analysing the convex hull in the parameter space instead of the sample space. That is, the archetypes and associated factors are expressed in terms of the parameters of the distribution of the data. For example, the original AA problem can be formulated in terms of a (simplex) latent variable model with normal observations. Eugster and Leisch (2011) extended the probabilistic formulation to other distributions from the exponential family.

16.9 Archetypal Analysis

403

Remark (Relation to Non-negative Matrix Factorisation) AA has some similarities to what is known as non-negative matrix factorisation (NMF), e.g. Lee and Seung (1999). NMF seeks to decompose a non-negative n × p matrix X into a product of two non-negative n × q and q × p matrices Z and H, i.e. X ≈ ZH, through, e.g. the following optimisation problem: {Z, H} = argmin X − ZH 2F .

(16.78)

Z,H≥0

Lee and Seung (1999) minimised the cost function: p n     F (Z, H) = xij log(ZH)ij − (ZH)ij , i=1 j =1

subject to the non-negativity constraint, and used a multiplicative updating rule. But other algorithms, e.g. alternating rules as in AA, can also be used. It is clear that the main difference between AA and NMF is the stochasticity of the matrices Z and H. In terms of patterns, AA yields archetypes whereas NMF yields characteristic parts (Bauckhage and Thurau 2009). To bring it closer to AA, NMF has been extended to convex NMF (C-NMF), see e.g. Ding et al. (2010), where the non-negativity of X is relaxed, and the non-negative matrix Z takes the form Z = XW, with W a non-negative matrix.

16.9.4 Archetypes and Simplex Visualisation One of the nice and elegant features of simplexes is the two-dimensional visualisation of any m-simplex,6 i.e. m-dimensional polytope that is the convex hull of its m + 1 vertices. This projection is well known in algebraic topology, sometimes referred to as “skew orthogonal” projection7 and shows all the vertices of a regular simplex on a circle where all vertices pairs are connected by edges. For example, the regular 3-simplex (tetrahedron) projects onto a square (Fig. 16.15). Any point y = (y1 , . . . , ym+1 )T in Rm+1 can be projected onto the m-simplex. The projection of  y onto the standard m-simplex is the closest point t = (t1 , . . . , tm+1 ) ≥ 0, i ti = 1, to y. The point t satisfies ti = max(yi + e, 0), i = 1, . . . , m + 1. The number e can be obtained through a sorting algorithm of complexity O(n log n) (Michelot 1986; Malozemov and Pevnyi 1992; Causa and Raciti 2013). As pointed out above Zα i , i = 1, . . . , n is the best approximation of xi on the convex hull of the archetypes Z, i.e. the (p − 1)-simplex with vertices the in Rm+1 , the m-simplex is the set of points of the form m+1 i=1 θi ci , with θi ≥ 0, i = 1, . . . , m + 1, and i=1 θi = 1. 7 Also known as Petrie polygon. 6 Given m+1 points (vertices) c

m+1

1 , . . . , cm+1

404

16 Further Topics

Fig. 16.15 An example of a 3-simplex or tetrahedron (a), its two-dimensional projection (b), and a two-dimensional projection of a 4-simplex (c)

p archetypes z1 , . . . , zp . The components of α i represent the “coordinates” with respect to these vertices, and can be visualised on a 2D plane using a skew orthogonal projection. Cutler and Breiman (1994) provide an illustration of the simplex visualisation with only three archetypes, known as ternary plot. Later authors extended the visualisation to more than three archetypes (Bauckhage and Thurau 2009; Eugster and Leisch 2013; Seth and Eugster 2015). The following example (Hannachi and Trendafilov 2017) shows an application of AA to a 5-cluster model located on the vertices of a 3D polytope. As described in Sect. 16.2, AA is not nested, and therefore various values of p, the number of archetypes, are used. For a given value of p the archetypes are obtained and the residual sum of squares are computed. Figure 16.16a shows a kind of scree plot of the (relative) RSS versus p. As Fig. 16.16a shows the most probable number of archetypes can be suggested from the scree plot. Figure 16.16b shows the costfunction R along with the gradient norm. The costfunction reaches its floor during the first few hundred iterations although the gradient norm continues to decrease with increasing iteration. Figure 16.16c shows the skew projection of the elements of the probability matrix A2. = A  A, namely the two-dimensional simplex projection where the archetypes hold the simplex vertices. The clusters are associated with some extent with the archetypes although the centroids are, as expected, different from the archetypes.

16.9.5 An Application of AA to Climate Hannachi and Trendafilov (2017) applied AA to sea surface temperature (SST) and Asian summer monsoon. The SST anomaly data come from the Hadley Centre Sea Ice and Sea Surface Temperature8 (Rayner et al. 2003). The data are on a 1◦ ×

8 www.metoffice.gov.uk/hadobs/hadisst/.

16.9 Archetypal Analysis

100

Relative RSS

a)

405

80 60 40 20 0 1

5

10

15

Number of archetypes p

Gradient norm and costfunction

b) 100 10

gradient costfunction

-1

10-2 10-3 10-4

0

100

200

300

400

500

Iteration number

c) 1

0

-1 -1

0

Fig. 16.16 Scree plot (a) of a five Gaussian-clusters, costfunction and gradient norm of R (b) for 5 archetypes, and the skew projection (c) using the same 5 archetypes. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

1◦ latitude–longitude grid from Jan 1870 to Dec 2014, over the region 45.5◦ S– 45.5◦ N. The scree plot (Fig. 16.17) shows a break- (or elbow-)like feature at p = 3 suggesting three archetypes. The three archetypes suggested by Fig. 16.17 are shown in Fig. 16.18. The first two archetypes show, respectively, El-Niño and La-Niña, the third archetype shows the western boundary currents, namely Kuroshio, Gulf Stream and Agulhas currents, in addition to the Brazil, East Australian and few other currents. It is known that western boundary currents are driven by major gyres, which transport warm tropical waters poleward along narrow, and sometimes deep, currents. These

16 Further Topics

Relative RSS (%)

406 100 90 80 70 60 50 40

0

5

10 15 Number of archetypes

20

Fig. 16.17 Scree plot of the SST anomalies showing the relative RSS versus the archetypes number. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

a) Archetype 1

b) Archetype 2

c) Archetype 3

24 22 20 18 16 14 12 10 86 42 0 -2 -4 -6 -8 -10 -12 -14 8 6 4 2 0 -2 -4 -6 -8 -10 -12 -14 -16 18 16 14 12 10 8 6 4 2 0 -2 -4

Fig. 16.18 The obtained three archetypes of the SST anomalies showing El-Niño (a), La-Niña (b) and the western currents (c). Contour interval 0.2◦ C. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

Mixture weights

Mixture weights

Mixture weights

16.9 Archetypal Analysis

407

0.2 0.15

a

0.1 0.05 0 1/1870

1/1886

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

1/1886

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

1/1886

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

0.2 0.15

b

0.1 0.05 0 1/1870 0.08 0.06

c

0.04 0.02 0 1/1870

Time

Fig. 16.19 Mixture weights of the three archetypes of SST anomalies, El-Niño (a), La-Niña (b), and the western boundary currents (c). Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

currents are normally fast and are referred to as the western intensification (e.g. Stommel 1948, Munk 1950). This strongly suggests that these water boundary currents project on extreme events, which are located on the outer boundary in the system state space. It should be reminded here that the SST field is different from the surface currents, which better capture the boundary currents. Records of surface currents, however, are not long enough, in addition to the non-negligible uncertainties in these currents. The mixture weights of these archetypes are shown in Fig. 16.19. For El-Niño archetype (Fig. 16.19a) the contribution comes from various observations scattered over the observation period and most notably from the first half of the record. Those events correspond to prototype El-Niño’s, with largest weights taking place end of the nineteenth and early twentieth centuries and in the last few decades. For the La-Niña archetype (Fig. 16.19b) there is a decreasing contribution with time, with most weights located in the first half of the record, with particularly high contribution from the event of the year 1916–1917. One can also see contributions from La-Niña events of 1955 and 1975. It is interesting to note that these contributing weights are clustered (in time). Unlike the previous two archetypes

408

16 Further Topics

Amplitude

1

a)

0.5

0 1/1870

Amplitude

1

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

1/1886

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

1/1886

1/1902

1/1918

1/1934

1/1950

1/1966

1/1982

1/1998

1/2014

0.5

0 1/1870 1

Amplitude

1/1886

b)

c)

0.5

0 1/1870

Time

Fig. 16.20 Time series amplitudes of the leading three archetypes (a, b, c) of the SST anomalies. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

the third, western current, archetype (Fig. 16.19c) is dominated by the last quarter of the observational period starting around the late 1970s. The time series of the archetypes, i.e. the columns of the stochastic matrix A2. show the “amplitudes” of the archetypes, somehow similar to the PCs, and are shown in Fig. 16.20. The time series of El-Niño shows slight weakening of the archetypes, although the events of early 80s and late 90s are clearly showing up. There is a decrease from the 90s to the end of the record. Prior to about 1945 the signal seemed quite stationary in terms of strength and frequency. The time series of La-Niña archetype shows a general decrease in the last 50 or so years. The signal was somehow “stationary” (with no clear trend) before about 1920. Unlike the previous El-Niño and La-Niña archetypes the third (or western current) archetype time series has an increasing trend starting immediately after an extended period of weak activity around 1910. The trend is not gradual, with the existence of a period with moderate activity around 1960s. The strongest activity occurs during the last two decades starting around late 1990s. Figure 16.21 shows the simplex projection of the data using three archetypes. The colours refer to the points that are closest to each of the three archetypes.

16.9 Archetypal Analysis

409 1

0.5

0

-0.5

-1 -1

-0.5

0

0.5

1

Fig. 16.21 Two-dimensional simplex projection of the SST anomalies using three archetypes. The 200 points that are closest to each of the three archetypes are coloured, respectively, red, blue and black and the remaining points are shown by light grey-shading. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

a)

90 80 70 60

50

50 40

b)

4 2 0 -2 100

EOF3

Relative RSS (%)

100

2

4 6 8 Number of archetypes

10

EOF2

0 -50

-100

-50

0

50

100

EOF1

Fig. 16.22 (a) Relative RSS of 8 subsamples of sizes ranging from 1500 to 100, by increments of 200 (continuous) as well as the curves for the whole sample (black dotted-dashed), and the original data less the top 1st (blue diamond), 2.5th (red square) and the 5th (black triangle) percentiles. (b) Projection of the three archetypes of the subsamples described in (a) onto the leading three EOFs of the SST anomalies (star), along with the same plots of the whole sample (filled circle), and the original data less the top 1st (diamond), 2.5th (square) and 5th (triangle) percentiles. Adapted from Hannachi and Trendafilov (2017). ©American Meteorological Society. Used with permission

Hannachi and Trendafilov (2017) also applied AA to the detrended SST anomalies. Their finding suggests only two main archetypes, namely El-Niño and La Niña. This once again strongly suggests that the archetype associated with the western boundary currents is the pattern that mostly explains the trend in extremes. They also show that that the method is quite robust to sample size and extremes (Fig. 16.22).

410

16 Further Topics

16.10 Other Nonlinear PC Methods 16.10.1 Principal Nonlinear Dynamical Modes Unlike conventional (linear) PCA nonlinear methods seek an approximation of the multivariate data in terms of a set of nonlinear manifolds (or curves) maximising a specific criterion. These nonlinear manifolds constitute an approximation of the d-dimensional data vector at each time, x(t), as: x(t) =

q 

fk (pk (t)) + ε t ,

(16.79)

k=1

where fk (.) is the kth trajectory, i.e. a map from the set of real numbers into the ddimensional data space, pk (.) is the associated time series, and εt is a residual term. Conventional EOF method corresponds to linear maps in Eq. (16.79). An interesting nonlinear EOF analysis method, namely nonlinear dynamical mode (NDM) decomposition was presented by Mukhin et al. (2015). In their decomposition Mukhin et al. (2015) used Eq. (16.79) with extra parameters and fitted the model to the multivariate time series xt , t = 1, . . . n, with the nonlinear trajectory fk (.) being the kth NDM. The NDMs are computed recursively by identifying one NDM at a time, then compute the next one from the residuals, etc., that is: xt = f(a, pt ) + ε t ,

t = 1, . . . n,

(16.80)

with εt being multinormal with diagonal covariance matrix. The component f(.), a and pt , are first estimated from the sample xt , t = 1, . . . n, then the next components, similar to f, a, and pt , t = 1, . . . n, are estimated from the residuals εt , etc. Each component of the NDM, e.g. f(.) in Eq. (16.80), is basically a combination of orthogonal, with respect to the Gaussian probability density function, polynomials, namely Hermite polynomials. As in most nonlinear PC methods, the data dimension was reduced via (linear) PCA, which simplifies the computation significantly. In vector form the principal components at time t, yt , t = 1, . . . , n is expanded as:  yt =

Aw(pt ) O



 +

σ ε1 Dε 2

 ,

(16.81)

where w is a m-dimensional vector containing the leading m Hermite polynomials (m being the number of PCs retained for the analysis), A = (aij ) and represents the coefficients of these polynomials whereas pt , t = 1, . . . n, is the hidden time series. Also, ε1 , and ε 2 are white noise vectors, σ is the (common) amplitude of each component of ε1 , and D = diag(σ1 , . . . , σD−m ) and contains the amplitudes

16.10 Other Nonlinear PC Methods

411

of ε 2 . The last term on the right hand side of Eq. (16.81) represents the residuals, which are used in the next step to get the next hidden time series and associated NDM. Mukhin et al. (2015) used a Bayesian framework and maximised the likelihood function L((aij ), p1 , . . . pn , σ, (σk )) = P r(x1 , . . . , xn |(aij ), p1 , . . . , pn , σ, (σk ))Ppr ((aij ), p1 , . . . , pn , σ, (σk )),

(16.82)

where the last term is a prior distribution. The prior distribution of the latent variables p11 , . . . p1n , i.e. p1 (t), t = 1, . . . n, was taken to be multinormal based on the assumption of a first-order autoregressive model. They also assumed a multinormal distribution with diagonal covariance matrix for the prior distribution of the parameter vector a1 . One of the interesting properties of NDMs, compared to other methods such as kernel EOFs and Isomap method (Hannachi and Turner 2013b), is that the method provides simultaneously the NDMs and associated time series. Mukhin et al. (2015), see also Gavrilov et al. (2016), applied the NDM method to monthly National Oceanic and Atmospheric Administration (NOAA) optimal interpolation OI.v2 SST data for the period 1981–2015. Substantial explained variance was found to be captured by the leading few NDMs (Fig. 16.23). The leading NDM captures the annual cycle (Fig. 16.23a). An interesting feature they identified in the second NDM was a shift that occurred in 1997–1998 (Fig. 16.23b).

Fig. 16.23 The leading three time series p1t , p2t , p3t , t = 1, . . . n, of the leading three NDMs (a) and the trajectory of the system within the obtained three-dimensional space (b). Blue colour refers to the period 1981–1997 and the red refers to 1998–2015. Adapted from Mukhin et al. (2015)

412

16 Further Topics

Fig. 16.24 Spatial patterns associated with the leading two NDMs showing the difference between winter and summer averages of SST explained by NDM1 (a) and the difference between SST explained by NDM2 (b) averaged over the periods before and after 1998 and showing the opposite phase of the PDO. Adapted from Mukhin et al. (2015)

The second NDM captures also some parts of the Pacific decadal oscillation (PDO), North Tropical Atlantic (NTA) and the North Atlantic indices. The third NDM (Fig. 16.23a) captures parts of the PDO, NTA and the Indian Ocean dipole (IOD). The spatial patterns of the leading two NDMs are shown in Fig. 16.24. As expected, the nonlinear modes capture larger explained variance compared to those explained by conventional EOFs. For example, the leading three NDMs are found to explain around 85% of the total variance versus 80% from the leading three EOFs. The leading NDM by itself explains around 78% compared to 69% of the leading EOF. Gavrilov et al. (2016) computed the NDMs of SSTs from a 250-year preindustrial control run. They obtained several non-zero nonlinear modes. The leading mode came out as the seasonal cycle whereas the ENSO cycle was captured by a combination of the second and third NDMs. The combination of the fourth and the fifth nonlinear modes yielded a decadal mode. The time series of these modes are shown in Fig. 16.25. The leading five PCs of the SSTs are also shown for comparison. The effect of mixing, characterising EOFs and PCs, can be clearly seen in the figure. The time series of the nonlinear modes do not suffer from the mixing drawback.

16.10.2 Nonlinear PCs via Neural Networks An interesting method to obtain nonlinear PCs is to apply techniques taken from machine (or deep) learning using neural networks. These methods, described in the following chapter, use various nonlinear transformations to model complex architectures. Hsieh (2009) describes the application of neural networks to compute nonlinear PCs. An example is discussed in the last section of Chap. 17, presented below.

16.10 Other Nonlinear PC Methods

413

Fig. 16.25 Times series of the first five nonlinear dynamical modes (left) and the PCs (right) obtained from the SST simulation of a climate model. Adapted from Gavrilov et al. (2016)

Chapter 17

Machine Learning

Abstract This last chapter discusses a relatively new method applied in atmospheric and climate science: machine learning. Put simply, machine learning refers to the use of algorithms allowing the computer to learn from the data and use this learning to identify patterns or draw inferences from the data. The chapter describes briefly the flavour of machine learning and discusses three main methods used in weather and climate, namely neural networks, self-organising maps and random forests. These algorithms can be used for various purposes, including finding structures in the data and making prediction. Keywords Machine learning · Neural networks · Training sets · Back-propagation · Self-organising · Random forests · Decision trees

17.1 Background Unprecedented advances in technology and computer power have led to a remarkable increase in the amount of massive data generated from observational instruments or model simulations. This data volume is obtained in countless domains such as medicine (Keogh et al. 2001; Matsubara et al. 2014), finance (Zhu and Shasha 2002) and climate (e.g. Hsieh and Tang 1998; Scher and Messori 2019). This unequivocal increase in data volume brings a unique opportunity to scientists to explore and analyse, in a comprehensive way, the available amount of information to get valuable knowledge and gain wisdom. This kind of knowledge is usually achieved under the heading of machine learning or artificial intelligence. The insight behind this is to allow the computer to explore “all” the possibilities and figure out the optimum solution. There is no universal definition, however, of machine learning. For example, one of the earliest definitions is that of Samuel (1959), who defined machine learning as the field of study that enables computers to learn without explicit programming. More recently, one reads from Wikipedia: “machine learning is the study of computer algorithms that improve automatically through experience”. This is more like the definition of Mitchell (1998) who

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3_17

415

416

17 Machine Learning

presents machine learning as a well-posed learning problem based on some kind of performance measure to complete certain task that improves with experience. Machine learning algorithms construct mathematical models based on a sample, or training, dataset (e.g. Bishop 2006). Machine learning is a subset of artificial intelligence, which refers to “intelligence” exhibited by machines. Machine learning contains supervised (e.g. classification) and unsupervised (e.g. clustering) learning, in addition to artificial neural network, while deep learning (Chollet 2018) is a subset of neural networks (Scher 2020). Some computer scientists, however, consider that artificial intelligence and machine learning can be seen as two faces of the same coin (e.g. Bishop 2006), and in the rest of the chapter, and unless otherwise stated, I will refer to machine learning. Machine learning is used to solve many problems ranging from pattern recognition and feature extraction (e.g. clustering), dimensionality reduction, identification of relationships among variables and nonlinear modelling and time series forecasting. The algorithms used in machine learning include mostly (different types of) neural networks (NNs), self-organising maps (SOMs) and decision trees and random forests. Other algorithms exist, but the focus here is on these three types. A number of textbooks have been written on machine learning, e.g. Bishop (2006), Hastie et al. (2009), Goodfellow et al. (2016), Buduma (2017) and Haykin (1999, 2009). This chapter provides an introduction to and discusses the above few algorithms mostly used in machine learning, and for a more comprehensive literature the reader is referred to the above textbooks.

17.2 Neural Networks 17.2.1 Background and Rationale Neural networks (NNs) originate from an attempt by scientists to mimic the human brain during the process of learning and pattern recognition (McCulloch and Pitts 1943; Rosenblatt 1962; Widrow and Stearns 1985). The cornerstone of NNs is the so-called universal approximation theorem (Cybenko 1989; Hornik 1991). The theorem states that any regular multivariate real valued function f (x) can be approximated at any given precision by a NN with one hidden layer and a finite number of neurons activation function and one linear output neuron. with the same T x + b ), with g(.) being a bounded function, with That is, f (x) ≈ m α g(w j k=1 k k specific properties, referred to as sigmoid function, see below. NNs are supposed to be capable of performing various tasks, e.g. pattern recognition and regression (Watt et al. 2020). In addition, NNs can also be used for other purposes such as dimension reduction, time series prediction (Wan 1994; Zhang et al. 1997), classification and pdf estimation. And as put by Nielsen (2015), “Neural networks are one of the most beautiful programming paradigms ever invented”.

17.2 Neural Networks

417

The idea of introducing NN as computing machine was proposed by McCulloch and Pitts (1943), but it was Rosenblatt (1958) who proposed the perceptron as the first model for supervised learning (Haykin 2009). The perceptron is based on a single neuron and was used in a binary classification problem. The model of this perceptron is shown in Fig. 17.1 (upper panel). Given an input vector x = (x1 , . . . xm )T , a  set of weights w = (w1 , . . . wm )T is used to form a linear T wk xk , possibly with a bias term b, which is then fed to a combination w x = sigmoid function g() to get an output o = g(wT x + b). Two examples of sigmoid functions are shown in Fig. 17.1 (middle panel). There are basically two main types of statistical models, namely regression and classification. Using a fixed set of basis functions φj (), j = 1, . . . M, these M models can be written as y = j =1 wj φj (x), for linear regression and y = M g( j =1 wj φj (x)) for classification, where g() is a nonlinear sigmoid function. To illustrate, for example, how supervised learning works, let us consider the case of a binary classification. We have a set of inputs x1 , . . . xn with corresponding classes (or targets) c1 , . . . cn that are 0 or 1. The objective is to predict the class co of a new observation xo . A naive solution is to fit a linear regression c(x) = wT x and use it to get the predicted class of xo . This, however, gives unrealistic values of co and can yield wrong classification. The correct alternative is to consider the so-called logistic regression model by using the logistic function: c(x) = g(wT x) =

1 1 + e−w

Tx

.

Remark Note, in particular, that the logistic function (and similar sigmoids) has nice properties. For example, it has a simple derivative (g  (x) = 1 − g(x)) and that g(x) is approximately linear for small |x|. Now c(x) can be interpreted as a probability of x belonging to class 1, i.e. c(x) = P r(c = 1|x; w). The logistic regression model shows the importance of using the sigmoid function to determine the class of a new observation. Note that this model works only for linearly separable classes. The application of the binary classification problem in NN becomes straightforward, given a training set (x1 , c1 ), . . . (xn , cn ). For the Rosenblatt perceptron, this is obtained by finding the weights, w1 , . . . wm (and possibly a bias b), minimising an objective function measuring the distance between the model output and the target. Since the target is binary, the costfunction 1 n T consistent with the logistic regression is 2n i=1 f (g(w xi ), ci ), with the distance function given by f (y, z) = −z log y − (1 − z) log(1 − y). This convex function is known as cross-entropy error (Bishop 2006) and is derived based on statistical arguments, see discussion in Sect. 17.2.5 below. The convexity is particularly useful in optimisation. Rosenblatt (1962) showed precisely that the perceptron algorithm converges and identifies the hyperplane between the two classes, that is, the perceptron convergence theorem. The single perceptron NN can be extended to include two or more neurons, which will form a layer, the hidden layer, yielding the single-layer perceptron (Fig. 17.1, bottom panel). This latter model can also be

418

17 Machine Learning

x

1

x2 . . .

wT x

g( )

o = g( w T x + b)

xm

g(x)

Logistic function: g(x) = 1/(1+exp(-x)) Threshold function: g(x) = 1 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

-6

-4

-2

0

x

2

4

6

0

-6

-4

-2

0

2

4

x>0

6

x

Fig. 17.1 Basic model of a simple nonlinear neuron, or perceptron (top), two examples of sigmoid functions (middle) and a feedforward perceptron with one input, one output and one hidden layers, along with the weights W(1) and W(2) linking the input and output layers to the hidden layer (bottom)

17.2 Neural Networks

419

extended to include more than one hidden layer, yielding the multilayer perceptron (Fig. 17.2). Now, recalling the universal approximation theorem, a network with one hidden layer can approximate any bounded function with arbitrary accuracy. Similarly, a network with two hidden layers can approximate any function with arbitrary accuracy.

17.2.2 General Structure of Neural Networks The progress in NNs is mainly triggered by the advent in computing power. Figure 17.2 shows a schematic of a general representation of a neural network model. The model shown in Fig. 17.2 represents an example of a feedforward NN because the flow is forward. The circles represent the neurons, or units, and the lines or arrows represent the weights. Each NN comprises the input and output layers as well as a finite number of hidden layers represented, respectively, by filled and open circles (Fig. 17.2). The output of the ith layer is used as input to the next (i + 1)th layer. The response of a unit is called activation. Figure 17.3 shows a diagram reflecting the relationship between the j th neuron of the (i + 1)th layer, denoted (j, i + 1), and the neurons of the previous layer. In general, the output (or (i+1) activation) yj of the unit (j, i + 1) is a nonlinear function, which may depend (i)

on the outputs yk of the unit (k, i), k = 1, . . . m, as (i+1) yj

= g(i+1)

 m 

(i)

(i+1)

i wkj yk + εj

,

(17.1)

k=1 (i+1)

i ) represent interconnecting parameters between units, ε where the weights (wkj j is the bias (or offset) and g(i+1) (.) is the transfer function characterising the (i +1)th layer (Rumelhart et al. 1994; Haykin 2009; Bishop 2006). Note that the transfer (or sigmoid) function g(1) (.) of the input layer, which contains the input data, is simply the identity. The transfer functions are normally chosen from a specific family referred to as sigmoid or squashing functions. Sigmoid functions were considered in Chap. 12. The main property of these sigmoid functions is that they are positive and go to zero and one as the argument goes, respectively, to minus and plus infinity, i.e. g(x) ≥ 0, and g(x) → 0 as x → −∞ and g(x) → 1 as x → +∞. In most cases the sigmoid functions are monotonic increasing. The hyperbolic tangent function is often the most used sigmoid. Other transfer functions, such as the logistic and cumulative distribution functions, are also commonly used. The cumulative distribution function of the logistic distribution is

  

−1 1 aT x + b g(x) = 1 + exp aT x + b = 1 − tanh , 2 2

420

17 Machine Learning

Fig. 17.2 Schematic representation of an example of multilayer feedforward neural network model with 5 layers

Fig. 17.3 Relation between a given neuron of a given layer and neurons from the previous layer

17.2 Neural Networks

421

The following two functions are also used sometimes: • threshold function, e.g. g(x) = 1x>0 and • piece-wise linear, g(x) = 12 + x − 1]− 1 , 1 [ + 1[ 1 ,∞[ . 2 2

2

One of the main reasons behind using squashing functions is to reduce the effect of extreme input values on the performance of the network (Hill et al. 1994). Many of the sigmoid functions have other nice properties, such as having simple derivatives, see above for the example of the sigmoid function. This property is particularly useful during the optimisation process. For the output layer, the activation can also be linear or threshold function. For example, in the case of classification, the activation of the output layer is a threshold function equaling one if the input belongs to its class and zero otherwise. This has the effect of identifying a classifier, i.e. a function g(.) from the space of all objects (inputs) into the set of K classes where the value of each point is either zero or one (Ripley 1994). Remarks • The parameter a in the sigmoid function determines the steepness of the response. • In the case of scalar prediction of a time series, the output layer contains only one unit (or neuron). • Note that there are also forward–backward as well as recurrent NNs. In recurrent NNs connections exist between output units. NNs can have different designs depending on the tasks. For instance, to compare two multivariate time series xk and zk , k = 1, 2, . . . n, where xk is p-dimensional and zk is q-dimensional, a three-layer NN can be used. The input layer will then have p units (neurons) and the output layer will have q units. Using a three-layer NN where the transfer functions of the second and third layers g(2) (.) and g(3) (.) are, respectively, a hyperbolic tangent and linear functions with p and q units in the input and output layers, respectively. Cybenko (1989), Hornik et al. (1989) and Hornik (1991) showed that it is possible to approximate any continuous function from Rp to Rq if the second layer contains a sufficiently large number of neurons. The NN is trained by finding the optimal values of the weight and bias parameters which minimise the error or costfunction. This error function is a measure of the proximity of the NN output O from the desired target T and can be computed using the squared error O − T 2 , which is a function of the weights (wij ). For example, when comparing two time series x = {xt , t = 1, . . . , n} and z = {zt , t = 1, . . . , n}, the costfunction takes the form: J =< z − O (x, θ ) 2 >,

(17.2)

where θ represents the set of parameters, i.e. weights and biases, O is the output from the NN and “< >” is a time-average operator. The parameters are then required to satisfy

422

17 Machine Learning

∇θ J = 0.

(17.3)

The minimisation is generally carried out using quasi-Newton or conjugate gradient algorithms, see Appendix E. These types of algorithms involve iteratively changing the weights by small steps in the direction of −∇θ J . The other and mostly used alternative is to apply a backpropagation algorithm (Rumelhart et al. 1986; Hertz et al. 1991), which is similar to a backward integration of the adjoint operator of the linearised equations in numerical weather prediction. In the case of prediction via NNs, the predictor variables are used as inputs and the predictands as outputs. In this case the costfunction of a three-layer model, as in Fig. 17.1 (bottom panel), is J =



2 zk − zko ,

(17.4)

k

   (2) (2) (1) (1) where zk = , and zk◦ is the j wj k yj + bk , yj = tanh i wij xi + bj kth observed value (target). The NN model is normally trained, i.e. estimating its parameters, by using a first subset of the data (the training set) and then the second subset for forecasting (the testing subset). Remark Different architectures can yield different networks. Examples of special networks include convolutional (LeCun et al. 2015) and recurrent (e.g. Haykin 2009) networks. Convolutional networks act directly on matrices or tensors (for images) overcoming, hence, the difficulty resulting from transforming those structures into vectors, as is the case in conventional fully connected networks (Watt et al. 2020). Transforming, for example, images into vectors yields a loss of the spatial information. Not all networks are feedforward. There are networks that contain feedback connections. These are the recurrent NNs. Recurrent networks differ from conventional feedforward NN by the fact that they have at least one feedbackward loop. They can be seen as multiple copies of the same network and are used to analyse sequential data such as texts and time series. Another type of network is support vector machine (SVM) pioneered by Boser et al. (1992). SVM is basically a learning machine with a feedforward network having a single hidden layer of nonlinear units. It is used in pattern recognition and nonlinear regression, through a nonlinear mapping from the input space into a high-dimensional feature space (see Chap. 13). Within this feature space, the problem becomes linear and could be solved at least theoretically. For example, in binary classification the problem boils down to constructing a hyperplane maximally separating the two classes (Haykin 2009; Bishop 2006).

17.2 Neural Networks

423

Fig. 17.4 An example of a linear perceptron showing a feedforward neural network with one input layer, one output layer and no hidden layers

17.2.3 Examples of Architectures The simplest NN architecture corresponds to a single-layer feedforward with a single or multiple outputs (Fig. 17.4), where the activation or output of zj (x) of the j th neuron, j = 1, . . . m, is given by  zj (x) = g a0 +



wij xi ,

(17.5)

i

 where a0 is the bias parameter. The  term a0 + i wij xi is the net input to the j th neuron. Intuitively speaking, a0 + i wij xi represents the total level of voltage exciting the j th neuron, and zj (x) represents the intensity of the resulting output (the activation level) of the neuron (Werbos 1990). The simplest case is obtained when g(.) is the identity and with one output. In this case we get z(x) = a0 +

d 

wi xi

i=1

known in classification as the discriminant function. In general, however, the function g() is chosen to be of sigmoid/squashing type, well suited for regression or classification problems. Other forms also exist  such as splines or RBFs (Appendix A) when the output takes the form z(x) = m i=1 wi φi (x), as with support vector machines, mentioned above. The multilayer perceptron (MLP) is an extension of the single layer and

424

17 Machine Learning

Fig. 17.5 Schematic representation of neural network model with multiple layers

includes one or more hidden layers (Fig. 17.5). An architecture with more than one hidden layer leads to the nested sigmoid scheme, see e.g. Poggio and Girosi (1990):  zl (x) = g a0 +

 i

 (1) wil g

a1 +



 (2) wil g



. . . g ap +



wα xα . . .

.

α

k

(17.6) Note that each layer can have its own sigmoid although, in general, the same sigmoid is used for most layers. When the transfer function is an RBF φ(.) (Appendix A), such as the Gaussian probability density function, one obtains an RBF network, see e.g. chap. 5 of Haykin (2009), with the mapping: z(x) =

m 

wi φi (x),

(17.7)

i=1

  where φi (x) = φ d1i x − ci . The parameter ci is the centre of φi (.) and di its scaling factor. In this case the distance between the  input x and the centres (or weights) is used instead of the standard dot product i wij xi . Common choices of RBFs include the Cauchy distribution, multiquadratic and its inverse, the Gaussian function and thin-plate spline. RBF networks1 can model

1 There is also another related class of NNs, namely probabilistic NNs, derived from RBF networks,

most useful in classification problems. They are based on estimating the pdfs of the different classes (Goodfellow et al. 2016).

17.2 Neural Networks

425

any shape of function, and therefore the number of hidden layers can be reduced substantially compared to MLP. RBF networks can also be trained extremely quickly compared to MLP. The learning process is achieved through the training set. If no training set is available, the learning is unsupervised, also referred to sometimes as self-organisation, which is presented later. Example (Autoregressive Model) A two-layer NN model connected by linear transfer functions with inputs being the lagged values of a time series, xt , . . . xt−d and whose output is the best prediction xˆt+1 of xt+1 from previous values reduces to a simple autoregressive model. The application section below discusses other examples including nonlinear principal component analysis.

17.2.4 Learning Procedure in NNs The process of changing the weights during the minimisation of the costfunction in NNs makes the training or learning algorithm. The most known minimisation algorithm in NNs is the backpropagation (e.g. Watt et al. 2020). It is based on taking small steps wij controlled by the unchanged learning rate η in the direction of the steepest descent, i.e. following −∇J as new old old wij = wij + wij = wij −η

∂J . ∂wij

This descent is controlled by the learning rate η. Consider, for example, the feedforward NN with two (input and output) layers and one hidden layer (Fig. 17.1). Let x1 , . . . xd be the actual values of the input units (in the input layer), which will propagate to the hidden layer. The response (or activation) value hl at unit l of the hidden layer takes the form hl = gh

 d 

(1)

(1)

wil xi + εl

,

i=1

where gh (.) is the activation function of the hidden layer. These responses will then make the inputs to the output layer “o” (Fig. 17.1) so that the kth output ok takes the form    d   (2)  (2)  (1) (2) (2) (1) ok = go wlk hl + εk wlk gh wil xi + εl = go εk + , l

l

i=1

where go (.) is the activation function of the output layer. Various algorithms exist to minimise the objective (or loss) function. These include conjugate gradient and quasi-Newton algorithms (Appendix E), in addition

426

17 Machine Learning

to other algorithms such as simulated annealing, see e.g. Hertz et al. (1991) and Hsieh and Tang (1998). Because of the nested sigmoid architecture, the conventional chain rule to compute the gradient can easily become confusing and erroneous particularly when the network grows complex. The most popular algorithm used for supervised learning is the (error) backpropagation algorithm (see e.g. Amari 1990). At its core, backpropagation is an efficient (and exact) way to compute the gradient of the costfunction in only one pass through the system. Backpropagation is the equivalent of adjoint method, i.e. backward integration of the adjoint equations, in variational data assimilation (Daley 1991). Backpropagation proceeds as follows. Let yiα denote the activation of the ith unit from layer α (Fig. 17.6), with α = 1, . . . L (the values 1 and L correspond, respectively, to the input and output layers). Let also xiα denote the input to the ith α the weights between neuron of layer α + 1 prior to the sigmoid activation and wij layers α and α + 1 (Fig. 17.6); that is, xiα =

 j

wjαi yjα and yiα+1 = g(xiα ).

(17.8)

Here we have dropped the bias function for simplicity, and we consider only one sigmoid function g(.). If there are different sigmoids for different layers, then g(.) will be replaced by gα (.). The costfunction then takes the form J = z◦ − yL 2 =

m  2  zk◦ − ykL

(17.9)

k=1

for one input–output pair, and more generally, J =

C 

2 z◦n − yL n

(17.10)

n=1

for more than one input–output pair. Differentiating J , we get ∂J ∂xiα ∂J ∂J = α = α yjα , α ∂wj i ∂xi ∂wjαi ∂xi and using Eq. (17.8), we get

∂J ∂xiα

=

∂J g  (xiα ), ∂yiα+1

i.e.

∂J ∂J = α+1 g  (xiα )yjα . ∂wjαi ∂yi For α = L − 1, the partial derivative

∂J ∂yiα+1

(17.11)

(17.12)

in Eq. (17.12) is easily computed for the

output layer using Eq. (17.9) or Eq. (17.10). the αth hidden layer, 1 ≤ α < L−2,  For α+1 α+1 yk ), i.e. the term yiα+1 exists in we observe that yjα+2 = g(xjα+1 ) = g( k wkj

17.2 Neural Networks

427

Fig. 17.6 Illustration of a structure in a feedforward multilayer NN used to explain how backpropagation works (see text for details)

xjα+1 for all j . Therefore, ∂J ∂yiα+1 The term

∂J ∂xjα+1

=

 ∂J j

∂xjα+1

∂xjα+1

∂yiα+1

=

 ∂J j

∂xjα+1

is computed as in Eq. (17.12), i.e.

α+1 wij .

∂J ∂xjα+1

=

(17.13) ∂J g  (xjα+1 ), ∂yjα+2

and

Eq. (17.13) becomes ∂J ∂xiα+1

=

 ∂J j

∂yjα+2

α+1 g  (xjα+1 )wij

(17.14)

for 1 ≤ α ≤ L − 2. That is, the gradient is computed recursively starting from the output layer and going backward. Hence in the process of learning of a feedforward neural network, activation values are propagated forward as signals to the output layer. The error signal is then computed and propagated backward to enable computing the gradient of the costfunction. The weights (and consequently the costfunction) are then updated according to the rule given in the beginning of this section. This process is repeated until the costfunction falls below some tolerance level. To avoid overfitting, etc., it is desirable to reduce the size by limiting the connection between neurons by fixing, for example, some weights to zero so that they do not add to the computational burden.

428

17 Machine Learning

17.2.5 Costfunctions for Multiclass Classification The background section illustrates briefly how supervised learning works. We continue here that discussion on the multiclass classification with a little more focus on the costfunction to be minimised. The binary (two-class) case is related precisely to the Bernoulli random variable (Appendix B). The Bernoulli distribution has two outcomes, e.g. 0 and 1, with respective probabilities q and p = 1 − q, and its pdf is px (1 − p)1−x (Appendix B). Based on this probabilistic interpretation, the conditional distribution of the target z, given the input x parametrised by the weights w, is precisely a Bernoulli distribution with pdf y z (1 − y)1−z , where y is the output from the perceptron using the sigmoid (logistic)( function. Now, given a training set (x1 , z1 ), . . . (xn , zn ), the likelihood is precisely ni=1 yizi (1 − yi )1−zi . The weights are then obtained by maximising the likelihood or minimising minus log-likelihood, i.e.  w = argmin −

n 

. zi log yi + (1 − zi ) log(1 − yi ) .

i=1

Remark The inverse of the logistic function is known as the logit function. It is used in generalised linear models (McCullagh and Nelder 1989) in the logistic regression y = wT x, in which y is the binary/Bernoulli response variable, model, i.e. log 1−y i.e. the probability of belonging to one of the two classes. The logit is also known as the link function of this generalised linear model. The above illustration shows the equivalence between the single-layer perceptron and the logistic regression. The above Bernoulli distribution can be extended to K outcomes with respective  probabilities p1 , . . . , pK , pk = 1. Similarly, for a K-class classification, the likelihood for a given input x with targets (classes) z1 , . . . zK in {0, 1} and outputs yk = P r(zk = 1|x, w), ( k =zk1, . . . K, is obtained as a generalisation of the twoclass model to yield K k=1 yk (x, w). With a training set xi , zij , i = 1 : n, j = 1 : K, the K-class cross-entropy takes the form − ik zki log yk (xi , w). The Kclass logistic function can be obtained based on the Bayes theorem, i.e. P r(k|x) ∝ P r(x|k)P r(k), where P r(k|x) is the probability of class k, given x, and hence yk can be written as yk (x, w) = exp(ak )/( exp(aj )), referred to as the softmax function (e.g. Bishop 2006). Remark As outlined in Sect. 17.2.1, for regression problems the activation function of the output unit is linear (in the weights), and the costfunction can simply be the quadratic norm of the error.

17.3 Self-organising Maps

429

17.3 Self-organising Maps 17.3.1 Background Broadly speaking, there are two main classes of training networks, namely supervised and unsupervised training. The former refers to the case in which a target output exists for each input pattern (or observation) and the network learns to produce the required outputs. When the pair of input and output patterns is not available, and we only have a set of input patterns, then one deals with unsupervised training. In this case, the network learns to identify the relevant information from the available training sample. Clustering is a typical example of unsupervised learning. A particularly interesting type of unsupervised learning is based on what is known as competitive learning, in which the output network neurons compete among themselves for activation resulting, through self-organisation, in only one activated neuron: the winning neuron or the best matching unit (BMU), at any time. The obtained network is referred to as self-organising map (SOM) (Kohonen 1982, 2001). SOM or a Kohonen network (Rojas 1996) has two layers, the input and output layers. In SOM, the neurons are positioned at the nodes of a usually low-dimensional (one- or two-dimensional) lattice. The positions of the neurons follow the principle of neighbourhood so that neurons dealing with closely related input patterns are kept close to each other following a meaningful coordinate system (Kohonen 1990; Haykin 1999, 2009). In this way, SOM maps the input patterns onto the spatial locations (or coordinates) of the neurons in the low-dimensional lattice. This kind of discrete maps is referred to as topographic map. SOM can be viewed as a nonlinear generalisation of principal component analysis (Ritter 1995; Haykin 1999, 2009). SOM is often used for clustering and dimensionality reduction (or mapping) and also prediction (Vesanto 1997).

17.3.2 SOM Algorithm Mapping Identification and Kohonen Network Each input pattern x = (x1 , . . . , xd )T from the usually high-dimensional input space is mapped onto i(x) in the output space, i.e. neuron lattice (Fig. 17.7, left panel). A particularly popular SOM model is the Kohonen network, which is a feedforward system in which the output layer is organised in rows and columns (for a 2D lattice). This is shown schematically in Fig. 17.7 (right panel).

430

17 Machine Learning

Fig. 17.7 Schematic of the SOM mapping from the high-dimensional feature space into the low-dimensional space (left) and the Kohonen SOM network showing the input layer and the computational layer (or SOM grid) linked by a set of weights (right)

Training of SOM Each neuron in the low-dimensional (usually two-dimensional) SOM grid is associated with a d-dimensional weight vector, also known as prototype, codebook vector and synaptic weight. The training of SOM is done iteratively. In each step one sample vector x from the input data is chosen randomly, and the distance x − wj between x and all the prototype (weight) vectors is computed. The best matching unit (BMU), or the winning neuron, is identified. So the BMU is the unit whose weight vector is the closest to input sample x. The weight vectors are then updated as detailed below, in such a way that the BMU gets closer to the input vector x and similarly for the BMU neighbours. Figure 17.8 (left panel) shows an example of updating units. So, both the weight vector of the BMU and its neighbours are updated, by moving them a little closer to the input data sample (Fig. 17.8, left). This yields a kind of stretching of the grid around the BMU as the associated weight vector moves towards the input data vector. In other words, during this training process, SOM constructs a kind of elastic net, which folds onto the input data clouds, approximating, hence, the data probability distribution function (Kohonen 2001). This is comparable somewhat to the non-metric multidimensional scaling (MDS) presented in Chap. 9 (Sect. 9.4). The goal in non-metric MDS is to preserve the monotonicity of interpoint distances, whereas the objective here is to preserve the proximity with no particular concern for monotonicity. The neighbours are defined based on a neighbourhood function. The neighbourhood function is a kernel-type function centred at the winner unit, such as a Gaussian or box (bubble) function, i.e. one around the BMU and zero elsewhere. Discrete neighbourhood function can also be used, including rectangular (Fig. 17.8, middle) or hexagonal (Fig. 17.8, right) lattice. These can have different sizes, e.g. 0, 1 and 2, where the innermost polygon corresponds to the 0-neighbour (only the winner unit is considered), the second polygon to the 1-neighbour and the largest to the 2-neighbour. Let wj = (wj 1 , . . . , wj d )T be the synaptic weight corresponding to the j th, j = 1, . . . , M, output neuron, with d being the input space dimension. The index i(x) of the neuron associated with the image of x (Fig. 17.7, right panel) is the one maximising the scalar product xT wj , i.e.

17.3 Self-organising Maps

431

Fig. 17.8 Left: schematic of the updating process of the best matching unit and its neighbours. The solid and dashed lines represent, respectively, the topological relationships of the SOM before and after updating, the units are at the crossings of the lines and the input sample vector is shown by x. The neighbourhood is shown by the eight closest units in this example. Middle and right: rectangular and hexagonal lattices, respectively, and the units are shown by the circles. In all panels the BMU is shown by the blue filled circles

i(x) =

argmin x − wj . j = 1, . . . M

(17.15)

Equation (17.15) summarises the competitive process among the M output neurons in which i(x) is the winning neuron or BMU (Fig. 17.8, left). SOM has a number of components that set up the SOM algorithm. Next to the competitive process, there are two more processes, namely co-operative and adaptive processes, as detailed below. The co-operative process concerns the neighbourhood structure. The winning neuron determines a topological neighbourhood since a firing neuron tends to excite nearby neurons more than those further away. A topology is then defined in the SOM lattice, reflecting the lateral interaction among a set of excited neurons (Fig. 17.8). Let dkl denote the lateral distance between neurons k and l on the SOM grid. This then allows to define the topological neighbourhood hj,i(x) . As is mentioned above, various neighbourhood functions can be used. A typical choice for this topology or neighbourhood is given by the Gaussian function:  hj,i(x) = exp −

2 dj,i(x)

2σ 2

,

(17.16)

where dj,i(x) = rj − ri(x) and denotes the lateral distance on the SOM grid between, respectively, the winning and excited neurons i(x) and j , and rj and ri(x) are positions of neurons j and i(x) on the same grid. A characteristic feature of SOM is that the neighbourhood size shrinks with time (or iteration). A typical choice of this shrinking is given by

432

17 Machine Learning

 σ (n) = σ0 exp

 n , − τ0

(17.17)

where n is the iteration time step, σ0 is an initial value of σ and τ0 is a constant. Note also that, in some algorithms, the left hand side of Eq. (17.16) is multiplied by a factor α(n), which is a linearly and slowly decreasing function of the iteration n, such as α(n) = 0.95(1 − n/1000) when using 1000 iterations, in order to accelerate the convergence. Remark It is interesting to note that if the neighbourhood function is a delta function at the BMU, i.e. hj,i(x) = δi(x) (1 at the BMU and 0 elsewhere), then SOM simply reduces to the k-means clustering algorithm (Moody and Darken 1989). So, k-means is a simple particular case of SOM. The adaptive process regards the way of learning process leading to the selforganisation of the outputs, and the feature map is formed. The topographic neighbourhood mirrors the fact that the weights of the winning and also neighbouring neurons get adapted, though not by an equal amount. In practice, the weight update is given by

wj = wj (n + 1) − wj (n) = η(n)hj,i(x) x − wj (n) ,

(17.18)

which is applied to all neurons in the lattice within the neighbourhood of the winning neuron (Fig. 17.8). The discrete time learning rate is given by η(n) = η0 exp(− τn1 ). The values η0 = 0.1 and τ1 = 1000 are typical examples that can be used in practice (Haykin 1999). Also, in Eq. (17.17), the value τ0 = 1000/ log(σ0 ) can be adapted in practice (Haykin 1999). The neighbourhood of neurons can be defined by a radius within the 2D topological map of neurons. This neighbourhood decreases monotonically with iterations. √ A typical √ initial value of this neighbourhood radius could be of the order O( N), e.g. N/2, for a sample size N . The size of the topological map, that is, the number of neurons M, can be learned from experience, √ example, for a two-dimensional but typical values can be of the order O( N). For √ SOM, the grid can have a total number of, say, 5 N neurons, e.g. (Vesanto and Alhoniemi 2000). Remark The SOM error can be computed based on the input data, weight vectors and neighbourhood function. a fixed neighbourhood function, the SOM error  For M 2 function is ESOM = N t=1 j =1 hj,i(xt ) xt − wj , where hj,i(xt ) represents the neighbourhood function centred at unit i(xt ), i.e. the BMU of input vector xt . In order not to end in a metastable state, the adaptive process has two identifiable phases, namely the ordering or self-organising and the convergence phases. During the former phase, the topological ordering of the weight vectors takes place within around 1000 iterations of the SOM algorithm. In this context, the choice of η0 = 0.1 and τ1 = 1000 is satisfactory, and the previous choice of τ0 in Eq. (17.17) can also be adopted. During the convergence phase, the feature map is fine tuned providing, hence, an accurate statistical quantification of the input patterns, which

17.4 Random Forest

433

takes a number of iterations, around 500 times the number of neurons in the network (Haykin 1999).

Summary of SOM Algorithm The different steps of SOM algorithm can be summarised as follows: • (1) Initialisation—Randomly choose initial values of the weights wj (0), j = 1, . . . M, associated with the M neurons in the lattice. Note that these weights can also be chosen from the input data xt , t = 1, . . . N, where N is the sample size. • (2) Sampling—Draw a sample input vector x from the input space. • (3) Neuron identification—Find the winning neuron i(x) based on Eq. (17.15). • (4) Updating—Update the synaptic weights wj , j = 1, . . . , M, following Eq. (17.18). • (5) Continuation—Go to step 2 above until the feature map stops changing significantly.

17.4 Random Forest In machine learning, supervised learning refers to applications in which the training set is composed of pairs of input data and their corresponding target output. Classification in which the target (or classes) is composed of finite number of discrete categories is a good example in which the classes of the training data input are known. When the target data are composed of continuous variables, one gets regression. Random forest (RF) is a supervised learning algorithm for classification and regression problems (Breiman 2001), based on what is known as decision trees, which are briefly described below.

17.4.1 Decision Trees What Are They? Decision trees are the building blocks of random forests. They aim, based on a training set, at predicting the output of any data from the input space. A decision tree is based on a sequence of binary selections and looks like a (reversed) tree (Bishop 2006). Precisely, the input sample is (sequentially) split into two or more homogeneous sets based on the main features of the input variables. The following simple example illustrates the basic concept. Consider the set {−5, −3, 1, 2, 6},

434

17 Machine Learning

Fig. 17.9 Decision tree of the simple dataset shown in the top box

which is to be separated using the main features (or variables), namely sign (+/−) and parity (even/odd). Starting with the sign, the binary splitting yields the two sets {1, 2, 6} and {−3, −5}. The last set is homogeneous, i.e. a class of negative odds. The other set is not, and we use the parity feature to yield {1} and {2, 6}. This can be summarised by a decision tree shown in Fig. 17.9. Each decision tree is composed of: • • • • • •

root node—containing the entire sample, splitting node, decision node—a subnode that splits into further subnodes, terminal node or leaf—a node that cannot split further, branch—a subtree of the whole tree and parent and child nodes.

In the above illustrative example, the root node is the whole sample. The set {1, 2, 6}, for example, is a decision node, whereas {2, 6} is a terminal node (or leaf). Also, the last subset, i.e. {2, 6}, is a child node of the parent node {1, 2, 6}. The splitting rule in the above example is quite simple, but for real problems more criteria are used depending on the type of problem at hand, namely regression or classification, which are discussed next.

Classification and Regression Trees As is mentioned above, decision trees attempt to partition the input (or predictor) space using a sequence of binary splitting. This splitting is chosen to optimise a splitting criterion, which depends on the nature of the predictor, e.g. discrete/categorical versus continuous, and the type of problem at hand, e.g. classification versus regression. In the remaining part of this chapter we assume that our training set is composed of n observations (x1 , y1 ), . . . , (xn , yn ), where, for each k, k = 1, . . . n, xk = (xk1 , . . . , xkd )T and contains the d variables (or

17.4 Random Forest

435

features), and yk is the response variable. The splitting proceeds recursively with each unsplit node by looking for the “best” binary split. Case of Categorical Predictors We discuss here the case of categorical predictors. A given unsplit (parent) node, containing a subset x1 , . . . , xm of the training set, is split into two child nodes: a left (L) node and a right (R) node. For regression, a common criterion F is the mean square residual or variance of the response variable. For this subsample in the parent node, the costfunction (variance) is 1  (yi − y)2 , m m

F =

(17.19)

i=1

where y is the average response of this subset. Now, if FL and FR are the corresponding variance for the (left and right) child nodes, then the best split is based on the variable (or feature) yielding maximum variance reduction, i.e. max|F − FLR |, or similarly minimising FLR , i.e. min(FLR ), where FLR = m−1 (mL FL + mR FR ), with mL and mR being the sample sizes of the left and right (child) nodes, respectively. Note again that to obtain the best result, every possible split based on every variable (among the d variables) is considered. The splitting criterion for classification, with, say, K classes, is different. Let pk designate the proportion of the kth class in a given (unsplit) node, i.e. pk =  m−1 m i=1 1{yi =k} , where 1A is the indicator of set A, i.e. 1 if A is true and 0 otherwise. Then the most commonly used criterion is the Gini index (e.g. Cutler et al. 2012): F =

K  k it is a non-parametric method; − > it is easy to understand and implement; − > it can use any data type (categorical or continuous). • Although decision trees can be applied to continuous variables, they are not ideal because of categorisation. But the main disadvantage is that of overfitting, related to the fact that the tree can grow unnecessarily—e.g. ending with one leaf for each single observation. One way to overcome the main downside mentioned above, i.e. overfitting, is to apply what is known as pruning. Pruning consists of trimming off unnecessary branches starting from the leaf nodes in such a way that accuracy is not much reduced. This can be achieved, for example, by dividing the training set into two subsets, fitting the (trimmed) tree with one subset and validating it with the other subset. The best trimmed tree corresponds to optimising the accuracy of the

17.4 Random Forest

437

validation set. There is another more attractive way to overcome overfitting, namely random forest discussed next.

17.4.2 Random Forest: Definition and Algorithm Random forest is an algorithm based on an ensemble of large number of individual decision trees aimed to obtain better prediction skill. The random element in the random forest comes from the bootstrap sampling, which is based on two key rules: • Tree building is based on random subsamples of the training set. • Splitting nodes is based on random subsets of the variables (or features). Note that no pruning is done in random forest. The final outcome is then obtained by averaging the predictions of the individual trees from continuous response (for regression) or by combined voting for categorical response (for classification). The algorithm for RF is given below: Random Forest Algorithm (1) Begin with the training set (x1 , y1 ), . . . , (xn , yn ), and decide a number I for the bootstrap samples and a number N of predictor variables to be chosen among the original d variables. (2) i = 1 Decide a bootstrap subsample Bi placed in a root node. • Randomly select N predictors. • Identify the best binary split based on the N selected variables, and split the parent node accordingly. • Continue until the ith decision tree is obtained. • If i < I , then i = 1 + i, then go to (2), otherwise go to (3).  (3) For a new data x, the prediction is given by yˆx = I −1 Ii=1 yx,i for regression I and yˆx = argmax i=1 1{yx,i =y} for classification, where as before yx,i is the y predicted response at x from the ith tree.

17.4.3 Out-of-Bag Data, Generalisation Error and Tuning Out-of-Bag Data and Generalisation Error A convenient and robust way to get an independent estimate of the prediction error is to use, for each tree in the random forest, the observations that do not get selected in the bootstrap; a tool similar to cross-validation. These data are referred to as out-

438

17 Machine Learning

of-bag (oob) data. It is precisely these data that are used to estimate the performance of the algorithm and works as follows. For each input data (xt , yt ) from the training set, find the set Tt of those (Nt ) trees that did not include  this observation. Then, the oob  prediction at xt is given by yoob,t , i.e. Nt−1 Tt yxt ,i for regression and argmax Tt 1{yxt ,i } for classification, where yxt ,i is the prediction of xt based on the y ith tree of the random forest. These predictions are then used  to calculate the oob by (e.g. Cutler et al. 2012) εoob = n−1 nt=1 (yt − yoob,t )2 for error, εoob , given  regression or n−1 nt=1 1{yt =yoob,t } for classification. Parameter Selection To get the best out of the random forest, experience shows that the number of trees in the forest can be chosen to be large enough (Breiman 2001), e.g. several hundreds. Typical values of the selected number of variables in the random forest depend√on the type of the problem at hand. For classification, a standard value of N is d, whereas for regression it is of the order n/3. Remarks Random forest algorithm has a number of advantages inherited from decision trees. The main advantages are accuracy, robustness (to outliers) in addition to being easy to use and reasonably fast. The algorithm can also handle missing values in the predictors and scale well with large samples (Hastie et al. 2009). The algorithm, however, has few main disadvantages. Random forest tends to be difficult to interpret, in addition to being not very good at capturing relationships involving linear combinations of predictor variables (Cutler et al. 2012).

17.5 Application Machine learning has been applied in weather and climate since the late 90s, with an application of neural network to meteorological data analysis (e.g. Hsieh and Tang 1998). The interest in machine learning application in weather and climate has grown recently in the academia and also weather centres, see e.g. Scher (2020). The application spans a wide range of topics ranging from exploration to weather/climate prediction. This section discusses some examples, and for more examples and details, the reader is referred to the references provided.

17.5.1 Neural Network Application NN Nonlinear PCs Nonlinear EOF, or nonlinear principal component analysis, can take various forms. We have already presented some of the methods in the previous chapters, such as

17.5 Application

439

Fig. 17.10 Schematic representation of the NN design for nonlinear PCA. Adapted from Hsieh (2001b)

independent PCA, PP and kernel EOFs. Another class of nonlinear PCA has also been presented, which originates from the field of artificial intelligence. These are based on neural networks (Oja 1982; Diamantaras and Kung 1996; Kramer 1991; Bishop 1995) and have also been applied to climate data (Hsieh 2001a,b; Hsieh and Tang 1998; Monahan 2000). As is discussed in Sect. 17.2, the neural network model is based on linking an input layer containing the input data to an output layer using some sort of “neurons” and various intermediate layers. The textbook of Hsieh (2009) provides a detailed account of the application of NN to nonlinear PC analysis, which is briefly discussed below. The NN model used by Hsieh and Tang (1998) and Monahan (2000) to construct nonlinear PCs contains five layers, three intermediate (or hidden), one input and one output layers. A schematic representation of the model is shown in Fig. 17.10. A nonlinear function maps the high-dimensional state vector x = x1 , . . . , xp onto a low-dimensional state vector u (one-dimensional in this case). Then, another

nonlinear transformation maps u onto the state space vector z = z1 , . . . , zp from the original p-dimensional space. This mapping is achieved by minimising the costfunction J =< x - z 2 >. These transformations (or weighting functions) are f(x) = f1 (W1 x + b1 ) , u = f2 (W2 f(x) + b2 ) , g(u) = f3 (w3 u + b3 ) , z = f4 (W4 g(u) + b4 ) ,

(17.21)

where f1 (.) and f3 (.) are m-dimensional functions. Also, W1 and W4 are m × p and p × m weight matrices, respectively. The objective of NN nonlinear PCA is to find the scalar function u(t) = F (x(t)) that minimises J . Note that if F (.) is linear, i.e.

440

17 Machine Learning

F (x) = aT x, then u(t) corresponds to the conventional PC time series. Monahan (2000, 2001) and Hsieh (2001a,b) have applied this NN model to find nonlinear PCs from atmospheric data. They chose f1 (.) and f3 (.) to be hyperbolic tangent, whereas f2 (.) and f4 (.) correspond to the identity function (in one- and p-dimensional spaces, respectively). During the minimisation procedure, u was normalised by choosing

2 < u2 >= 1, and the costfunction was taken to be < x - z 2 > + < u2 > −1 . Hsieh (2001a,b) penalised the costfunction by the norm of the matrix W1 as  2  1 2 (wij ) , J =< x − z 2 > + < u2 > −1 + λ

(17.22)

ij

in order to stabilise the algorithm through smoothing of J (see Appendix A). An account of the problems encountered in the application of NNs to meteorological data is given in Hsieh and Tang (1998). Hsieh and Tang (1998), and also Hsieh (2001a,b) applied the NN nonlinear PC analysis to the Lorenz model and used it to analyse and forecast tropical Pacific SST. In particular, Monahan (2000) found that the leading two nonlinear PCs of the Lorenz model capture about 99.5% of the total variance compared to 95% from conventional EOFs. Monahan et al. (2000) applied NN to compute the nonlinear PCs of the northern hemisphere SLP from a long integration of the Canadian climate model. They identified, in particular, a bimodal behaviour from the control simulation (Fig. 17.11); the corresponding SLP patterns are shown in Fig. 17.12. This bimodal behaviour disappeared in the climate change experiment (Fig. 17.13).

Application to Weather Forecasting Weather and climate prediction is another topic that attracted the interest of climate scientists. Neural networks can in principle approximate any nonlinear function (Nielsen 2015; Goodfellow et al. 2016) and can be used to approximate the nonlinear relationships involved in weather forecasting. An example was presented by Scher (2018) to approximate a simple GCM using a convolutional neural network. This example was used as a proof of concept to show that it is possible to consider NN to learn the time evolution of atmospheric fields and hence provide a potential for weather prediction. Convolutional NN was also used to study precipitation predictability on regional scales and discharge extremes by Knighton et al. (2019). Another example was presented by Subashini et al. (2019) to forecast weather variables using data collected from the National Climatic Data Centre. They used a recurrent NN, based on a long short-term memory (LSTM) algorithm, to forecast wind, temperature and cloud cover. Weyn et al. (2019) developed an elementary deep learning NN to forecast a few meteorological fields. Their model was based on a convolutional NN architecture and was used to forecast 500-hPa geopotential height for up to 14 days lead time. The model was found to outperform persistence, climatology and barotropic

17.5 Application

441

Fig. 17.11 Nonlinear PCs from the control simulation of the Canadian climate model showing the temporal variability along the NN PC1 with its PDF (top) and the nonlinear PC approximation of the data projected onto the leading two PCs (bold curve, bottom) along with the PDFs associated with the two branches. Adapted from Monahan et al.(2000)

Fig. 17.12 Composite SLP anomaly maps associated with the three representative points on the nonlinear PC shown in Fig. 17.11. Adapted from Monahan et al.(2000)

vorticity models, in terms of root mean square errors (RMSEs) at forecast lead time of 3 days. The model does not, however, beat an operational weather forecasting system and climate forecasting system (CFS), as expected, as the latter contains full physics. Weyn et al. (2019) found, however, that the model does a good job of forecasting realistic states at a lead time of 14 days and captures reasonably well the

442

17 Machine Learning

Fig. 17.13 As in Fig. 17.11, but for the climate change simulation. Adapted from Monahan et al.(2000)

500-hPa climatology and annual variability. Figure 17.14 shows the RMSE of 500hPa heights for up to 3 days lead time of different convolutional NN architectures compared to the other models. An example of 24-hr 500-hPa forecast is shown in Fig. 17.15. The main features of the field are very well captured. Another important topic in weather prediction is forecasting uncertainty. Forecast uncertainty is an integral component of weather (and climate) prediction, which is used by the end users for planning and design. In weather forecasting centres, forecast uncertainty is usually obtained using a computationally expensive ensemble of numerical weather predictions. A number of authors have proposed machine learning as an alternative to ensemble methods. An example where this is important is tropical cyclone (Richman and Leslie 2010, Richman et al. 2017) and Typhoon (Haghroosta 2019) forecast. For example, Richman and Leslie (2012) used machine learning approaches, based on support vector regression (a subclass of support vector machine), to provide seasonal prediction of tropical cyclone frequency and intensity over Australia. We remind that the support vector regression is a special architecture of neural net with two layers, an input and an output layer, and where each input observation is mapped into a high-dimensional feature space. As mentioned above, the architecture belongs to the class of radial basis function

17.5 Application

443

Fig. 17.14 RMSE forecast error of 500-hPa height over the test period 2007–2009, obtained from different models: persistence, climatology, barotropic vorticity, the operational climate forecast system (CFS) and the different convolutional NN architectures. Adapted from Weyn et al. (2019)

Fig. 17.15 An example of a 24-hr 500-hPa height forecast at 0 UTC 13 April 2007, based on the NN architectures (bottom) compared to the barotropic (c) and the CFS (d). Coloured shading shows the difference between forecasts and the verification (b) in dekameter. (a) initial state, (e-h) forecasts from LSTM neural network. Adapted from Weyn et al. (2019)

networks (Haykin 2009, Chap. 5), in which the mapping is based on nonlinear radial basis functions (Appendix A). The authors obtained high values of R 2 of the order 0.56 compared to 0.18 obtained with conventional multiple linear regression.

444

17 Machine Learning

Fig. 17.16 Schematic illustration of the convolutional NN used by Scher and Messori (2019) to predict weather forecast uncertainty. The network is fed with gridded atmospheric fields and generates a scalar representing the uncertainty forecast (Source: Modified from Scher and Messori (2019))

Richman et al. (2017) used the same machine learning architecture to reduce tropical cyclone prediction error in the North Atlantic regions. A review of the application of machine learning to tropical cyclone forecast can be found in Chen et al. (2020). Scher and Messori (2019) proposed machine learning to predict weather forecast uncertainty. They considered a convolutional NN (Krizhevsky et al. 2012; LeCun et al. 2015) trained on past weather forecasts. As is discussed above, convolutional NN is not fully connected, characterised by local (i.e. not full) connections and also weight sharing (i.e. sharing similar weights), and involves convolution operation and hence is faster than fully connected nets. In this network, training is done with the forecast errors and the ensemble spread of forecasts. An uncertainty is then predicted, given an initial forecast field (Scher 2020). Figure 17.16 shows a schematic illustration of the approach. They suggest that while the obtained skill is lower than that of ensemble methods, the network-based method is computationally very efficient and hence offers the possibility to be explored.

17.5.2 SOM Application SOM has been applied since the early 90s and is still being applied, in atmospheric science and oceanography to reduce the dimensionality of the system and identify patterns and clustering (e.g. Hewitson and Crane 1994; Ambroise et al. 2000; Liu et al. 2006; Liu and Weisberg 2005; Cassano et al. 2015; Huva et al. 2015; Gibson et al. 2017; Meza–Padilla 2019). This application spans a wide range of topics including synoptic climatology, cloud classification, weather/climate extremes, downscaling and climate change. SOM has also been suggested to be used in time series prediction (Vesanto 1997). SOM is particularly convenient and quite useful in synoptic weather categorisation (Sheridan and Lee 2010; Hewitson and Crane

17.5 Application

445

2002). The obtained weather types are then used to study the relationship between large scale teleconnections and local surface climate variables such as surface temperature or precipitation. Surface weather maps and mid-tropospheric fields have been used by a number of authors to study changes in atmospheric synoptic circulations and their relation to precipitation (e.g. Hewitson and Crane 2002). The identification of synoptic patterns from reanalysis as well as climate models was also performed via SOM by Schuenemann et al. (2009) and Schuenemann and Cassano (2010). Mass and moisture fields were also used to study North American monsoon (Cavazos et al. 2002), precipitation downscaling (Ohba et al. 2016), Antarctic climate (Reusch et al. 2005) and local Mediterranean climate (Khedairia and Khadir 2008), see e.g. Huth et al. (2008) for a review of SOM application to synoptic analysis. Clouds are known to have complex features and constitute a convenient test-bed for SOM application and feature extraction (Ambroise et al. 2000; McDonald et al. 2016). For example, McDonald et al. (2016) show that SOM analysis enables identification of a wide range of cloud clusters representative of low-level cloud states, which are related to geographical position. They also suggest that SOM enables an objective identification of the different cloud regimes. SOM has also been applied to study weather and climate extremes and climate change (Gibson et al. 2016; Horton et al. 2015; Gibson et al. 2017; Cassano et al. 2016). These studies show that SOM can reveal correspondence between changes in the frequency of geopotential height patterns and temperature and rainfall extremes (Horton et al. 2015; Cassano et al. 2015, 2016). Gibson et al. (2017) found that synoptic circulation patterns are well represented during heat waves in Australia but also highlight the importance of critically assessing the SOM features. SOM was also explored in oceanography to explore and analyse SST and sea surface height (Leloup et al. 2007; Telszewski et al. 2009; Iskandar 2009), ocean circulation (e.g. Meza-Padilla et al. 2019) and ocean current forecasting (Vilibi´c et al. 2016), see also the textbook of Thomson and Emery (2014, chapter 4). An interesting feature revealed by SOM in ocean circulation in the Gulf of Mexico (Meza-Padilla 2016) is the existence of areas of loop current eddies dominating the circulation compared to local regimes at the upper slope. Vilibi´c et al. (2016) found that SOM-based forecasting system of ocean surface currents was found to be slightly better than the operational ROMS-derived forecasting system, particularly during periods of strong wind conditions. Altogether, this shows that SOM, and machine learning in general, has potential for improving ocean surface current forecast. Figure 17.17 illustrates the SOM algorithm application to weather and climate fields. The two-dimensional (gridded) field data are transformed into a n × d data matrix, with n and d being the sample size and the number of grid points (or the number of variables), respectively. Each observation xt (t = 1, . . . n) is then used to update the weights of SOM following Eqs. (17.15–17.18). The obtained weight vectors of the SOM lattice are then transformed to physical space to yield the SOM patterns (Fig. 17.17). To illustrate how SOM organises patterns, large scale flow based on SLP is a convenient example to discuss. Johnson et al. (2008), for example,

446

17 Machine Learning

Fig. 17.17 Schematic illustration of the different steps used by SOM in meteorological application (Source: Modified from Liu et al. (2006))

examined the SOM continuum of SLP over the winter (Dec–Mar) NH using NCEPNACR reanalysis. Figure 17.18 shows an example of NH SLP 4 × 5 SOM maps obtained from Johnson et al. (2008), based on daily winter NCEP-NCAR SLP reanalysis over the period 1958–2005. By construction (a small number of SOM patterns), the figure shows large scale and low-frequency patterns. One of the main features of Fig. 17.18 is the emergence of familiar teleconnection patterns, e.g. −NAO (bottom left) and +NAO (bottom right). The occurrence frequency of those patterns is used as a measure of climate change signal reflected by the NAO shift. The SOM analysis also shows that interdecadal SLP variability can be understood in terms of changes in the frequency of occurrence of the teleconnection patterns. The SOM analysis of Johnson et al. (2008) reveals a change from a westwarddisplaced −NAO-like pattern to an eastward-displaced +NAO-like pattern. More examples and references of SOM application to synoptic climatology and large scale phenomena can be found in Hannachi et al. (2017). Due to its local character and the proximity neighbourhood, SOM seems to offer some advantages compared to classical methods such as PCA and k-means (Reusch et al. 2005; Astel et al. 2007; Lin and Chen 2006; Solidoro et al. 2007). Reusch et al. (2005) compared the performance of SOM and PCA using synthetic climatological data with and without noise contamination. They conclude that SOM is more robust than PCA. For instance, SOM is able to isolate the predefined patterns with the correct explained variance. On the other hand, PCA fails to identify the patterns due to mixing. This conclusion is shared with other researchers (Liu and Weisberg 2007; Annas et al. 2007; Astel et al. 2007). Liu and Weisberg (2007) compared SOM and EOFs in capturing ocean current patterns using velocity field from moored ADCP array. They found that SOM was readily more accurate to reveal, for example,

17.5 Application

447

Fig. 17.18 Illustration of 4 × 5 SOM maps of daily winter (Dec–Mar) SLP field from NCEPNCAR reanalysis for the period 1958–2005. Positive and negative values are shown by continuous and dashed lines, respectively. Percentages of occurrence of the patterns are shown in the bottom right corner for the whole period and in the top right corners for the periods 1958–1977 (top), 1978–1997 (middle) and 1998–2005 (bottom). Contour interval: 2 hPa. Adapted from Johnson et al. (2008). ©American Meteorological Society. Used with permission

the asymmetric features (between upwelling and downwelling current patterns) in current strength, jet location and the veering velocity with depth. As a last example, we discuss the application of SOM to rainfall events. Daily rainfall events in any region of the globe can be analysed in terms of events with a number of features. Derouiche et al. (2020) transformed winter daily rainfall series in northern Tunisia into six features (or variables), on seasonal basis, namely number of events, number of rainy day, seasonal total rainfall, average accumulation per event, average event duration and average accumulation per rainy days. A rainfall event is defined as consecutive rainy days separated by at least two no rainy days. These features are computed based on a 50-year rainfall series observed over the period 1960–2009, using a rain gauge network of 70 stations in northern Tunisia. SOM was applied to this feature space, with 3500 space-time observations, using two-dimensional (2D) hexagonal grid of the SOM map, with 320 neurons. One of the main outstanding features of SOM is its prominent visualisation property, which enables a sensible data survey. This is made possible because

448

17 Machine Learning

Fig. 17.19 Schematic representation of the two-level approach of SOM clustering

SOM transforms, using a topology-preserving projection, the data from its original (usually high-dimensional) state space onto a low-dimensional (usually twodimensional) space, i.e. the SOM map. This SOM map, represented as an ordered grid, contains prototype vectors representing the data (e.g. Vesanto and Alhoniemi 2000). This map can then be used to construct, for example clusters (Fig. 17.19). This algorithm of clustering SOM rather than the original data is referred to as twolevel approach (e.g. Vesanto and Alhoniemi 2000). SOM provides therefore a sensible tool to classify for example rainfall regimes in a given location. Figure 17.21 (top left) shows the obtained SOM map of rainfall events in northern Tunisia. Note that each neuron, or prototype vector (individual hexagon), contains a number of observations in its neighbourhood. These prototype vectors are then agglomerated to obtain clusters. One particularly interesting method to obtain the number of clusters is to use the data image method (Minnotte and West 1999). The data image is a powerful visualisation tool showing the dissimilarity matrix as an image where each pixel shows the distance between two observations (Martinez et al. 2010). Several variants of this image can be incorporated. Precisely, rows and columns of the dissimilarity matrix can be reordered, for example, based on some clustering algorithm, such as hierarchical clustering, allowing clusters to emerge as blocks along the main diagonal. An example of application of data image to the stratosphere can be found in Hannachi et al. (2011). In the hierarchical clustering algorithm a bunch of fully nested sets are obtained. The smallest sets are the clusters obtained as the individual elements of the dataset, whereas the largest set is obtained as the whole dataset. Starting, for example from the individual data elements as clusters, the algorithm then proceeds by successively merging closest clusters, based on a chosen similarity measure until we are left with only one single cluster. This can be achieved using a linkage algorithm, such as single or complete linkages (e.g. Gordon 1999; Hastie et al. 2009). The result of the hierarchical clustering is presented in the form of a tree-like graph or dendrogram. This dendrogram is composed of branches linking the whole cluster to the individual elements. Cutting through the dendrogram at a specific level yields specific number

17.5 Application

449

Fig. 17.20 Two-dimensional scatter plot of two Gaussian clusters, with sample size of 50 each (top left), dendrogram (top right), data matrix is showing interpoint distance between any two data points (bottom left) and data matrix when the data are reordered so that the top left and bottom right blocks represent, respectively, the first and the second clusters. Adapted from Hannachi et al. (2011). ©American Meteorological Society. Used with permission

of clusters. Figure 17.20 shows an illustrative example of two clusters (Fig. 17.20, top left) and the associated dendrogram (Fig. 17.20, top right). The interpoint distance matrix between data points is shown as a data matrix in Fig. 17.20 (bottom left). Dark and light contrasts represent, respectively, small and large distances. The dark diagonal line represents the zero value. The figure shows scattered dark- and light-coloured areas. When the lines and columns of the interpoint distance matrix are reordered following the two clusters obtained from the dendrogram, see the vertical dashed line in Fig. 17.20 (top right), the data matrix (Fig. 17.20, bottom right) now shows two dark blocks along the main diagonal, with light contrast in the background. The application of the two-way approach, i.e. SOM+clustering, is shown in Fig. 17.21. The bottom left panel of Fig. 17.21 shows the data matrix obtained from the interpoint distances between the SOM prototypes (SOM map). Like the example above, dark and light contrasts reflect, respectively, a state of proximity and remoteness of the SOM prototype vectors. Those prototypes that are close to each other can be agglomerated by the clustering algorithm. Figure 17.21 shows the image when the SOM (prototype) data are reordered based on two

450

17 Machine Learning

Fig. 17.21 SOM map with hexagonal grid of the rainfall events (top left), SOM map with three clusters on the SOM map (top right), data image of the SOM prototype vectors (bottom left), data image with two (bottom centre) and three (bottom right) clusters. The numbers in the SOM map represent the numbers of observations within a neighbourhood of the neurons (or prototype vectors). Courtesy of Sabrine Derouiche

(Fig. 17.21, bottom centre) and three (Fig. 17.21, bottom right) clusters obtained from a dendrogram or clustering tree of hierarchical clustering. The contrast between the diagonal blocs and the background is stronger in the three-cluster case (Fig. 17.21, bottom right), compared to the two-cluster case (Fig. 17.21, bottom centre), suggesting three clusters, which are shown in the SOM map (Fig. 17.21, top right). These clusters are found to represent three rainfall regimes in the studied area, namely wet, dry and semi-dry.

17.5.3 Random Forest Application Random forest (RF) is quite new to the field of weather/climate. It has been applied recently to weather prediction (Karthick et al. 2018), temperature downscaling (Pang et al. 2017) and a few other related fields such as agriculture, e.g. crop yield (Jeong et al. 2016), greenhouse soil temperature (Tsai et al. 2020) and forest fire (Su

17.5 Application

451

et al. 2018). In weather prediction, for example, Karthick et al. (2018) compared few techniques and found that RF was the best with around 87% accuracy, with only one disadvantage, namely overfitting. Temperature downscaling using RF was performed by Pang et al. (2017) in the Pearl river basin in southern China. The method was compared to three other methods, namely artificial NN, multilinear regression and support vector machines. The authors found, based on five different criteria, that RF outperforms all the other methods. For example, RF could identify the best predictor combination compared to all the other methods. In crop yield production, Jeong et al. (2016) used RF to predict three types of crops in response to climate and biophysical variables and compared it to multiple linear regression as a benchmark. RF was found to outperform the multilinear regression. For example, the root mean square error was in the range of 6−14% compared to 14−49% for multiple linear regression. Though this suggests that RF is an effective and versatile tool for crop yield prediction, the authors also caution that it may result in a loss of accuracy when applied to extremes or responses beyond the boundaries of the training set, a weakness that characterises machine learning approaches in general.

Appendix A

Smoothing Techniques

A.1 Smoothing Splines Splines provide a nonlinear and smooth fitting to a unidimensional or multivariate scatter of points. The spline smoothing can be regarded as a nonlinear and nonparametric regression model. We assume that at each point xi (independent variable) we observe yi (dependent variable), i = 1, . . . n, and we are interested in seeking a nonlinear relationship linking y to x of the form: y = f (x) + ε,

(A.1)

for which the objective is to estimate f (.). It is of course easier if we knew the general form of the function f (.). In practice, however, this information is very seldom available. The spline smoothing considers f (.) to be a polynomial. One of the most familiar polynomial smoothing is the cubic spline and corresponds to the case of a piece-wise cubic function, i.e. f (x) = fi (x) = ai + bi x + ci x 2 + di x 3 for xi ≤ x ≤ xi+1 ,

(A.2)

for i = 1, . . . n − 1. In addition, to get smoothness the first two derivatives are assumed to be continuous, i.e. dα dα fi (xi ) = fi−1 (xi ), α dx dx α

(A.3)

for α = 0, 1, and 2. The constraints given by Eq. (A.2) and Eq. (A.3) lead to a smooth function. However, the problem is not closed, and we need  extra conditions. The problem is normally simplified by minimising the quantity ni=1 (yi − f (xi ))2 with a smoothness condition that takes the form of an integral of the second derivative squared. The functional to be minimised is © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

453

454

A Smoothing Techniques

F =

n 

  [yk − f (xk )] + λ 2

k=1

d 2 f (x) dx 2

2 (A.4)

dx.

The first part of Eq. (A.4) is a measure of the goodness of fit, whereas the second part provides a measure of the overall smoothness. In the theory of elastic rods the latter term is proportional to the energy of the rod when it is bent under constraints. Note that the functional F (.), Eq. (A.4), can be extended to two dimensions and the final surface will behave like a smooth plate. The function F in Eq. (A.4) is known as the penalised residual sum of squares. Also, in Eq. (A.4) λ represents the smoothing parameter and controls the relative weight given to the roughness penalty and goodness of fit. It controls therefore the balance between goodness of fit and smoothness. For example, the larger the parameter λ the smoother the function f . Remark Note that if ε = 0 in Eq. (A.1) the spline simply interpolates the data. This means that the spline solves Eq. (A.4) with λ → 0, which is equivalent to  min

  2 f (x) dx subject to yk = f (xk ), k = 1, . . . n.

(A.5)

Equation (A.4) can be extended to higher dimensions by replacing the second  2 by high-order derivatives with respect to all derivative roughness measure f coordinates to ensure consistency with all directions. In this case the quantity to be minimised takes the form n   2 yj − f (xj ) + λ j =1



 k1 +...+km =k

k k1 +...+km



 Rm

∂k km ∂x1k1 . . . ∂xm

2 f (x)

dx,

(A.6) where k is a fixed positive integer. The obtained solution is known as thin-plate

m+k−1 spline. The solution is generally obtained as linear combination of the m monomials of degree less than k and a set of n radial basis functions (Wahba 1990). The minimisation of Eq. (A.4), when λ is known, yields the cubic spline. The determination of λ, however, is more important since it controls the smoothness of the fitting. One way to obtain an appropriate estimate of λ is to use cross-validation. The idea behind cross-validation is to have estimates that minimise the effect of omitted observations. If fλ,k (x) is an estimate of the spline with parameter λ when

2 the kth observation is omitted, the mis-fit of the point xk is given by yk − fλ,k (x) . The best value of λ is the one that minimises the cross-validation criterion n 

2 wk yk − fλ,k (x) ,

k=1

where wk , k = 1, . . . n, is a set of weights, see Sect. A.1.2.

A

Smoothing Techniques

455

A.1.1 More on Smoothing Splines Originally, splines were used as a way to smoothly interpolate the set of points (xk , yk ), k = 1, . . . n, where x1 < x2 < . . . xn belong to some interval [a, b], by means of piece-wise polynomials. The name spline was coined by Schoenberg (1964), see also Wahba (2000). The function is obtained as a solution to a variational problem such as1 1 (yi − f (xi ))2 + μ n n

min

i=1



b

 2 f (m) (x) dx

(A.7)

a

for some μ > 0, and over the set of functions with 2m − 2 continuous derivatives over [a, b], C 2m−2 ([a, b]). The solution is a piece-wise polynomial of degree 2m−1 inside each interval [xi , xi+1 ], i = 1, . . . n − 1, and m − 1 inside the outer intervals [a, x1 ] and [xn , b]. In general, the smoothing spline can be formulated as a regularisation problem (Tikhonov 1963; Morozov 1984). Given a set of points x1 , . . . xn in Rd and n numbers y1 , . . . yn , we seek a smooth function f (x) from Rd into R that best fits the data (x1 , y1 ), . . . (xn , yn ). This problem is normally solved by seeking to minimise the functional: F (f ) =

n 

(yi − f (xi ))2 + μL(f )2 .

(A.8)

i=1

Note that the first part measures the mis-fit and the second part is a penalty measuring the smoothness of the function f (.). The operator L is in general a differential operator, and μ is the smoothing parameter, which controls the tradeoff between both attributes. By computing F (f + δf ) − F (f ), where δf is a small “departure” from f , the stationary solutions of Eq. (A.8) can be shown to satisfy n    μ LL∗ (f )(x) = [yi − f (x)] δ (x − xi ) ,

(A.9)

i=1

where L∗ is the adjoint (Appendix F) of L.

1 The space over which Eq. (A.7) is defined is normally referred to as Sobolev space of functions b 2 defined over [a, b] with m−1 absolutely continuous derivatives and satisfying a f (m) (x) dx < ∞.

456

A Smoothing Techniques

Exercise Derive Eq. (A.9). Hint Using the L2 norm where L(f ) =< L(f ), L(f ) >, we obtain, after discarding the second-order terms in δf , F (f + δf ) − F (f ) = −2 ni=1 [(yi − f (xi )) δf (xi )] + 2μ < L∗ L(f ), δf >. The solution is then obtained as the stationary points of the differential operator F  (f ):  < F  (f ), v >= F  (f ).v = −2 ni=1 [yi − f (xi )] v(xi ) + 2μ < L∗ L(f ), v >= n −2 < i=1 (yi − f (xi )) .δ(x − xi ), v > +2μ < L∗ L(f ), v > . The solution to Eq. (A.9) can be expressed in the form of an integral as 1 f (x) = μ



 G(x, y)

n 

 (yi − f (y)) δ(y − xi ) dy,

i=1

where G(x, y) is the Green’s function of the operator L∗ L (see Sect. A.3 below); hence, f (x) =

 1 μi G (x, xi ) . (yi − f (xi )) G (x, xi ) = μ n

n

i=1

i=1

(A.10)

The coefficients μj , j = 1 . . . n, are computed by applying it to x = xj , i.e. yj =  n i=1 μi Gj i or Gμ = y,

(A.11)



where G = (G)ij = G xi , xj , μ = (μ1 , . . . , μn )T and y = (y1 , . . . , yn )T . Note that Eq. (A.11) can be extended to include a homogeneous solution p(x) of the partial differential equation (PDE) L∗ L, to yield f (x) =

n 

μi G (x, xi ) + p(x)

i=1

with L∗ L(p) = 0. Examples of functions p(.) are given below. The popular thin-plate spline corresponds to the case where the differential operator is an extension to that given in Eq. (A.7) to the multidimensional space, i.e. L(f ) = 2

 k1 +...kd =m

m! k1 ! . . . kd !



 Rd

∂m ∂x1k1 . . . ∂xdkd

2 f (x)

dx,

and where the functional or energy to be minimised takes the form

(A.12)

A

Smoothing Techniques

457

F (f ) =

n 

(yk − f (xk ))2 + λLf 2

(A.13)

k=1

for a fixed positive integer m. The function f (x) used in Eq. (A.12) or Eq. (A.13) is of class C m , i.e. with (m − 1) continuous derivatives and the mth derivative satisfies Lf (m)  < ∞. The corresponding L∗ L operator is invariant under translation and rotation; hence, the corresponding Green’s function G(x, y) is a radial function and satisfies (−1)m m G(x) = δ(x),

(A.14)

whose solution, see e.g. Gelfand and Vilenkin (1964), is a thin-plate spline, i.e.  G(x) =

for 2m − d > 0, and d is even x2m−d log |x x2m−d when d is odd.

(A.15)

The general thin-plate spline is then given by f (x) =

n 

μj G x − xj  + p(x),

(A.16)

j =1

where p(x) is a polynomial

of degree m − 1 that can be expressed as a linear combination of l = dd+m−1 monomials in Rd of degree less than m, i.e. p(x) = l k=1 λk pk (x). The parameters μj , j = 1, . . . m, and λk , k = 1, . . . l, are obtained by taking f (xj ) = yj and imposing further conditions on the polynomial p(x) in order to close the problem. This is a well known radial basis function problem. Noisy data have not been explicitly mentioned, but the formulation given in Eq. (A.7) takes account of an uncorrelated noise (see Sect. A.3) when the interpolation is not exact. If the noise is not autocorrelated, e.g. yi = f (xi ) + εi

for i = 1, . . . n, with zero-mean multinormal noise with E εε T = W where ε = (ε1 , . . . , εn )T , then the functional to be minimised takes the form of a penalised likelihood: F (f ) = (y − f)T W−1 (y − f) + μL(f )2 , where f = (f (x1 ), . . . , f (xn ))T . In the case of thin-plate spline where L(f )2 is given by Eq. (A.12), the solution is similar to Eq. (A.16) and the parameters μ = (μ1 , . . . , μl )T and λ = (λ1 , . . . , λn )T are the solution of a linear system of the form:

458

A Smoothing Techniques



G + nμW P O PT

    λ y = , μ 0

where G = (Gij ) = G(xi − xj ) for i, j = 1, . . . n and P = (Pik ) = (pk (xi )) for i = 1, . . . n and k = 1, . . . l.

A.1.2 Choice of the Smoothing Parameter So far the parameter μ was assumed to be fixed but unknown. One way to deal with the problem would be to choose an arbitrary value based on experience. A more concise way is to compute it from the data using an elegant procedure known as cross-validation, see also Chap. 15. Suppose that one would like to solve the problem given in Eq. (A.7) and would like to have an optimal estimate of μ. The idea of cross-validation is to take one or more points out of the sample and find the value of μ that minimises the mis-fit. Suppose in fact that xk was taken out. Then (k) the spline fμ (.) that fits the remaining data minimises the functional:  b n 2 1  f (m) (t) dt. [yi − f (xi )]2 + μ n a

(A.17)

i=1,i=k

The overall optimal value of μ is the one that minimises the overall mis-fit or crossvalidation: 2 1  yk − fμ(k) (xk ) . n n

cv (μ) =

(A.18)

k=1

Let us designate by fμ (x) the spline function fitted to the whole sample for a given μ. Let also A(μ) = aij (μ) , i, j = 1, . . . n, be the matrix relating

T y = (y1 , . . . , yn )T to fμ = fμ (x1 ), . . . fμ (xn ) , i.e. satisfying A(μ)y = fμ . Then Craven and Wahba (1979) have shown that yk − fμ(k) (xk ) =

yk − fμ (xk ) , 1 − akk (μ)

and therefore,  n  1  yk − fμ (xk ) 2 cv (μ) = . n 1 − akk (μ)

(A.19)

k=1

The generalised cross-validation is obtained by substituting a(μ) = n1 tr (A(μ)) for akk (μ) to yield

A

Smoothing Techniques

459

 cv (μ) = n

 (I − A(μ)) y tr (I − A(μ))

2 .

(A.20)

Then, Eq. (A.19) or Eq. (A.20) is minimised to yield an optimal value of μ.

A.2 Radial Basis Functions A.2.1 Exact Interpolation Radial basis functions (RBFs) constitute one of the attractive tools to interpolate and/or smooth scattered data. RBFs have been formally introduced and coined by Powell (1987) in exact multivariate interpolation, although the technique was hanging around before that, see e.g. Hardy (1971), Franke (1982). Given m distinct points2 xi , i = 1, . . . n, in Rd , and n real numbers fi , i = 1, . . . n, these numbers can be regarded as the values at xi of a certain unknown function f (x). The problem of RBF is to find a smooth real function s(x) satisfying the interpolation conditions: s(xi ) = fi

i = 1, . . . n

(A.21)

and that the interpolating function is of the form s(x) =

n 

λk φk (x) =

k=1

m 

λk φ (x − xk ) .

(A.22)

k=1

The functions φk (x) = φ (x − xk ) are known as radial basis functions. The real function φ(x) is defined over the positive numbers, and . is any Euclidean norm or Mahalanobis distance. Thus the radial basis functions s(x) are a simple linear combination of the shifted radial basis functions φ(x).

Examples of Radial Basis Functions • φ(r) = r k , for a positive integer k. The cases k = 1, 2, and 3 correspond, respectively, to the linear, quadratic and cubic RBF.

1 • φ(r) = r 2 + c 2 for c > 0 and corresponds to the multiquadratic case. 2 • φ(r) = e−ar for a > 0 is the Gaussian RBF. • φ(r) = r 2 log r, which is the thin-plate spline.

2 Generally

known as nodes, knots, or centres of interpolation.

460

A Smoothing Techniques

−1 • φ(r) = 1 + r 2 and corresponds to inverse quadratic. Equations (A.21) and (A.22) lead to the following linear system: Aλ = f,

(A.23)



where A = (aij ) = φ xi − xj  , i, j = 1, . . . n, and f = (f1 , . . . fn )T . The more general RBF interpolation problem is obtained by extending Eq. (A.22) to yield s(x) = pm (x) +

n 

λk φ (x − xk ) ,

(A.24)

k=1

where pm (x) is a low-order polynomial of degree at most m in Rd . Apart from interpolation, RBFs constitute an attractive tool that can be used for various other purposes such as minimisation of multivariate functions, see e.g. Powell (1987) for a discussion, filtering and pattern recognition (Carr et al. 1997), and can also be used in PDEs and neural networks (Larsson and Fornberg 2003). Note also that RBFs are used naturally in other fields such as gravitation.3 Because there are more parameters than constraints in the case of Eq. (A.24), further constraints are imposed, namely, n 

λj p(xj ) = 0

(A.25)

j =1

for all polynomials p(x) of degree at most m. Apart from introducing more equations, system of Eq. (A.25) can be used to measure the smoothness of the RBF (Powell 1990). It also controls the rate of growth at infinity of the nonpolynomial

part of s(x) in Eq. (A.24) (Beatson et al. 1999). If (p1 , . . . pl ), with m+d = (m+d)! l= d m!d! , is a basis of the space of algebraic polynomials of degree less or equal than m in Rd , then Eq. (A.25) becomes

3 For

example in the N -body problem, the gravitational potential at a point y takes the form φ(y) =

N  i=1

Similarly, the heat equation the solution

∂ ∂t h

αi . xi − y

− ∇ 2 h = 0, with initial condition h(t, x) = g(x), has, for t > 0, 3

h(t, x) = (4π t)− 4



which looks like Eq. (A.22) when it is discretised.

e−

x−y2 4t

g(y)dy,

A

Smoothing Techniques

461 n 

λj pk (xj ) = 0

k = 1, . . . l.

(A.26)

j =1

Note l also that pm (x) in Eq. (A.24) can be substituted for the combination k=1 μk pk (x), which, when combined with Eq. (A.26), yields the following system:        λ λ f A P A = = , μ μ 0 PT O

(A.27)

where P = (pij ) = (pj (xi )), (i = 1, . . . n, j = 1, . . . l). The next important equation in RBF is related to the invertibility of the system of Eq. (A.23) and Eq. (A.27). For many choices of φ(.), the matrix A in Eq. (A.23) is invertible. For example, for the linear and multiquadratic cases, A is always invertible for every n and d, provided the points are distinct (Michelli 1986). For the quadratic case where s(x) becomes quadratic in x, A becomes singular if the number of points n exceeds the dimension of the space of quadratic polynomials, i.e. n > 1 2 (d + 1)(d + 2), whereas for the cubic case A can be singular if d ≥ 2 but is always nonsingular for the unidimensional case.4 Powell (1987) also gives further examples

−β of nonsingularity such as φ(r) = r 2 + 1 , (β > 0). Consider now the extended interpolation in Eq. (A.24), the matrix A in Eq. (A.27) is nonsingular only if the columns of P are linearly independent. Michelli (1986) gives sufficient conditions for the invertibility of the system of Eq. (A.27) based on conditional positivity,5 i.e. when φ(r) is conditionally strictly

this case the matrix is nonsingular for all φ(r) = r 2α+1 (α positive integer) and the interpolation function

4 In

s(x) =

n 

λi |x − xi |2α+1

i=1

is a spline function. 5 A real function φ(r) defined on the set of non-negative real numbers is conditionally (strictly) positive definite of order m + 1 if for any distinct points x1 , . . . xn and scalars satisfying λ1 , . . . λn satisfying Eq. (A.26) the quadratic form 

λT λ = λi φ xi − xj  λj ij

is non-negative (positive). The “conditionally” in the definition refers to Eq. (A.26). The set of conditionally positive definite functions of order m has been characterised by Michelli (1986). If a dk continuous function φ(.) defined on the set of non-negative real numbers is such that (−1)k dr k φ(r) is completely monotonic, then φ(r 2 ) is conditionally positive definite of order k. Note that if dk (−1)k dr k φ(r) ≥ 0 for all positive integers k, then φ(r) is said to be completely monotonic. The following important result is also given in Michelli (1986). If the continuous and positive function

462

A Smoothing Techniques

positive. The previous two sufficient conditions allow for various choices of radial

−α β

basis functions, such as r 2 + a 2 for α > 0, and r 2 + a 2 for 0 < β < 1. For 3

instance, the functions φ1 (r) = r 2 and φ2 (r) = 4r log r have their second derivatives completely monotonic for r > 0. The functions φ1 (r 2 ) = r 3 and φ2 (r 2 ) = r 2 log r can also be used as RBF. The latter case corresponds to the thin-plate spline. Another case of singularity was provided by Powell (1987) and corresponds to b ∞ 2 φ(r) = 0 e−xr ψ(x)dx, where ψ(x) is non-negative with a ψ(t)dt > 0 for some constants a and b. There is also another set of functions such as  2(m+1)−d for odd and 2(m + 1) > d r

φ(r) = 2(m+1)−d log r 2(m+1)−d , r where m is the largest degree of the polynomial included in s(x). Remark The system of Eq. (A.27) can be solved using SVD. Alternatively, one can define the n × (n − l) matrix Q whose columns span the orthogonal complement of the columns of P. Hence PT λ yields a unique γ such that PT λ = γ . The first system from Eq. (A.27) yields QT AQγ = QT f, which is invertible since Q is full rank and A is strictly conditionally positive definite of order m + 1. The vector μ is then obtained from Pμ = f − AQγ . A possible choice of Q is given by Beatson et al. (2000), namely ⎡

p1,l+1 p1,l+2 ⎢ . ⎢ .. ⎢ ⎢ p ⎢p Q = − ⎢ l,l+1 l,l+2 ⎢ 1 0 ⎢ ⎢ .. . ⎣ 0 0

⎤ . . . p1n .. ⎥ . ⎥ ⎥ ⎥ . . . pln ⎥ ⎥, ... 0 ⎥ ⎥ ⎥ ⎦ ... 1

see also Beatson et al. (1999) for fast fitting and evaluation of RBF.

Example: Thin-Plate Spline  In this case we have φ(r) = r 2 log r and s(x) = p(x) + nk=1 λk φ (x − xi ) with p(x) = μ0 + μ1 x + μ2 y. The matrix P in this case is given by

φ(r) defined on the set of non-negative numbers has its first derivative completely monotonic (not constant), then for any distinct points x1 , . . . xn

(−1)n−1 det φ xi − xj 2 > 0.

A

Smoothing Techniques

463



1 ⎜1 ⎜ P=⎜. ⎝ ..

x1 x1 .. .

⎞ y1 y1 ⎟ ⎟ .. ⎟ . . ⎠

1 xn yn Note that thin-plate (or biharmonic) splines serve to model the deformation of an infinite thin plate (Bookstein 1989) and are a C 1 function that minimises the energy 

 E(s) =

R2

∂ 2s ∂x 2

2



∂ 2s +2 ∂x∂y

2

 +

∂ 2s ∂y 2

2  dxdy.

A.2.2 RBF and Noisy Data In the previous section the emphasis was on exact interpolation where the fitted function goes exactly through the data (xi , fi ). This corresponds to the case when the data are noise-free. If the data are contaminated with noise, then instead of the condition given by Eq. (A.21) we seek a function s(x) that minimises the functional 1 [s(xi ) − fi ]2 + ρs2 , n n

(A.28)

i=1

where the penalty function, given by ρs2 , provides a measure of the smoothness of s(x) and ρ ≥ 0 is the smoothing parameter. Equation (A.12) provides an example of norm, which is used in thin-plate splines (Cox 1984; Wahba 1979; Craven and Wahba 1979). Remark If we use the semi-norm in Eq. (A.12) with d = 3 and m = 2, the solution to Eq. (A.28) is given (Wahba 1990) by s(x) = p(x) +

n 

λi x − xi ,

i=1

where p(x) is a polynomial of degree 1, i.e. p(x) = μ0 + μ1 x1 + μ2 x2 + μ3 x3 , and the coefficients are given by the linear system: 

A − 8nπρI P O PT

    λ f = , μ 0

where A and P are defined as in Eq. (A.27).

464

A Smoothing Techniques

A.2.3 Relation to PDEs and Other Techniques In many problems in mathematical physics, one seeks to solve the following PDE: Lu = f,

(A.29)

in a domain D within Rd under specific boundary conditions, and L is a differential operator. The Green’s function6 G of the operator L is the (generalised) function satisfying LG(x, y) = δ(x − y),

(A.30)

where δ is the Dirac (or impulse) function. The solution to Eq. (A.29) is then given by the following convolution:  u(x) =

D

f (y)G (x, y) dy + p(x),

(A.31)

where p(x) is a solution to the homogeneous equation Lp = 0. Note that Eq. (A.31) is to be compared with Eq. (A.24). In fact, if there is an operator L satisfying Lφ(x− xi ) = δ(x − xi ) and also Lpm (x) = 0, then clearly the RBF φ(r) is the Green’s function of the differential operator L and the radial basis function s(x) given by Eq. (A.24) is the solution to Lu =

n 

λk δ(x − xk ).

k=1

As such, it is possible to use the PDE solver to solve an interpolation problem (see e.g. Press et al. (1992)). In general, given φ, the operator L can be determined using filtering techniques from time series. RBF interpolation is also related to kriging. For example, when pm (x) = 0 in Eq. (A.24), then the equations are similar to kriging and where φ plays the role of an (isotropic) covariance function. The relation to splines has also been outlined. For example, when the radial function φ(.) is cubic or thin-plate spline, then we have a spline interpolant function. In this case the function minimises the bending energy of an infinite thin plate in two dimensions, see Poggio and Girosi (1990) for a review. if φ(r) = r 2m+1 (m positive integer), then the function n For instance, m 2m+1 s(x) = i=1 λi |x − xi | + k=1 μk x k is a natural spline.

6 The Green’s function G depends only on the operator L and has various properties. For example, if

L is self-adjoint, then G is symmetric. If L is invariant under translation, then G(x, y) = G(x − y), and if L is also invariant under rotation, then G is a radial function, i.e. G(x, y) = G(x − y).

A

Smoothing Techniques

465

A.3 Kernel Smoother This is a kind of local average smoother where a weighted average is obtained around each target point. Unlike linear smoothers, the kernel smoother uses a particular function K() known as kernel. Given the data points (xj , yj ), j = 1, . . . n, for each target point xi the weighted average yˆi is obtained by yˆi =

n 

(A.32)

Kij yj ,

j =1

where the weights are given by  Kij =

n 

m=1

 K

xi − xm b

−1

 K

xi − xj b

 .

Clearly  the weights are non-negative and add up to one, i.e. Kij ≥ 0, and for each i, nj=1 Kij = 1. The kernel function K(.) satisfies the following properties: (1) K(t) ∞ ≥ 0, for all t. (2) −∞ K(t)dt = 1. (3) K(−t) = K(t) for all t. Hence, K(.) is typically a symmetric probability density function. Note that the parameter b gives a measure of the size of the neighbourhood in the averaging process around each target point xi . Basically, the parameter b controls the “width” of the function K( xb ). In the limit b → 0, we get a Dirac function δ0 . In this case the smoothed function is identical to the original scatter, i.e. yˆi = yi . On the other hand, in the limit b → ∞ we get a uniform  weight function, and the smoothed curve reduces to the mean, i.e. yˆi = y = n1 yi . A familiar example of kernels is given by the Gaussian PDF: x2 1 K(x) = √ e− 2 . 2π

There are several other kernels used in the literature. The following are examples: • Box kernel K(x) = 1[− 1 , 1 ] . 2 2

• Triangle kernel K(x) = 1 − |t| a 1[− a1 , a1 ] (for some a > 0). ⎧ 3

x 2 ⎪ |x| ⎪ 1 − 6 + 6 for x ≤ M ⎪ M M 2 ⎨

3 • Parzen kernel: K(x) = 2 1 − |x| M for 2 ≤ |x| ≤ M ⎪ M ⎪ ⎪ ⎩ 0 for |x| > M. These kernels can also extend easily to the multidimensional case.

Appendix B

Introduction to Probability and Random Variables

B.1 Background Probability is a branch of mathematics that deals with chance, randomness or uncertainty. When tossing a coin, for example, one talks of probability of getting head or tail. With an operator receiving phone calls at a telephone switch board, one also talks about the probability of receiving a given number of phone calls within a given time interval. We also talk about the probability of having rain tomorrow at 13:00 at a given location. Games of chance also constitute other good examples involving probability. Games of chance are very old indeed, and it has been found that cubic dices have been used by ancient Egyptians around 2000 BC (DeGroot and Shervish 2002). Probability calculus has been popularised apparently around the mid-fifteen century by Blaire Pascal and Pierre De Fermat, and it was in 1933 that A. N. Kolmogorov axiomatised the probability theory using sets and measures theory (Kolmogorov 1933). Despite its common use by most scientists, no unique interpretation of probability exists among scientists and philosophers. There are two main schools of thought: (i) The frequentists school, led by von Mises (1928) and Reichenback (1937), holds that the probability p of an event is the relative frequency nk of the occurrence of that event in an infinite sequence of similar (and independent) trials, i.e. p = lim

n→∞

k , n

where k is the number of times that event occurred in n trials. (ii) The subjective or “Bayesian” school, which holds that the probability of an event is a subjective or personal judgement of the likelihood of that event. This interpretation goes back to Thomas Bayes (1763) and Pierre Simon © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

467

468

B Introduction to Probability and Random Variables

Laplace in the early sixteenth century (see Laplace 1951). This trend argues that randomness is not an objectively measurable phenomenon but rather a “knowledge” phenomenon, i.e. they regard probability as an epistemological rather than ontological concept. Besides these two main schools, there is another one: the classical school, which interprets probability based on the concept of equally likely outcomes. According to this interpretation, when performing a random experiment, one can assign the same probability to events that are equally likely. This interpretation can be useful in practice, although it has a few difficulties such as how to define equally likely events before even computing their probabilities, and also how to define probabilities of events that are not equally likely.

B.2 Sets Theory and Probability B.2.1 Elements of Sets Theory Sets and Subsets Let S be a set of elements s1 , s2 , . . . , sn , finite or infinite. Note that in the case of infinite sets one distinguishes two types: (1) countable sets whose elements can be counted using natural numbers 1, 2, . . . and (2) uncountable sets that are infinite but one cannot count their elements. The set of rational numbers is countable, whereas the set of real numbers is uncountable. A subset A of S, noted A ⊂ S, is a set whose elements belong to S. The empty set Ø and the set S itself are examples of (trivial) subsets.

Operations on Subsets Given a set S and subsets A, B, and C, one can perform the following operations: • Union—The union of A and B, note A ∪ B, is the subset containing the elements from A or B. It is clear that A ∪ Ø = A, and A ∪ S = S. Also, if A ⊂ B, then A ∪ B = B. This definition can be extended to an infinite sequence of subsets A1 , A2 , . . . to yield ∪∞ k=1 Ak . • Intersection—The intersection of two subsets A and B, noted as A ∩ B, is the set containing only common elements to A and B. If no common elements exist, then A∩B = Ø, and the two subsets are said to be mutually exclusive or disjoint. It can be seen that A ∩ S = A and that if D ⊂ B then D ∩ B = D. The definition also extends to an infinite sequence of subsets. • Complements—The complement of A, noted as Ac , is the subset of elements that are not in A. One has (Ac )c = A; S c = Ø; A ∪ Ac = S and A ∩ Ac = Ø.

B

Introduction to Probability and Random Variables

469

B.2.2 Definition of Probability Link to Sets Theory An experiment involving different outcomes when it is repeated under similar conditions is a random experiment. For example, throwing a dice yields in general1 a random experiment. The outcome of a random experiment is called an event. The set of all possible outcomes of a random experiment is named as the sample space S. For the case of the dice, S = {1, 2, 3, 4, 5, 6}, and any subset of S is an event. For example, A = {1, 3, 5} corresponds to the event of odd outcomes. The empty subset Ø corresponds to the impossible event. A sample space is discrete if it is finite or countably infinite. All the operations on subsets mentioned above can be readily transferred to operations between events. For example, two events A and B are disjoint or mutually exclusive if A ∩ B = Ø.

Definition/Axioms of Probability Given a sample space S, a probability is a function defined on the events of S, assigning a number P r (A) to each event A, satisfying the following properties (axioms): (1) For any event A, 0 ≤ P r(A) ≤ 1. (2) P r(S) = 1. ∞ (3) P r(∪∞ i=1 P r(Ai ), for any sequence of disjoint events A1 , A2 , . . .. i=1 Ai ) = Properties of Probability • Direct consequences (1) (2) (3) (4) (5)

P r(Ø) = 0. P r(Ac ) = 1 − P r(A). P r(A ∪ B) = P r(A) + P r(B) − P r(A ∩ B). If A ⊂ B, then P r(A) ≤ P r(B). If A and B are exclusive, then P r(A ∩ B) = 0.

Exercise Derive the above properties. Exercise Compute P r(A ∪ B ∪ C). Answer P r(A) + P r(B) + P r(C) − P r(A ∩ B) − P r(A ∩ C) − P r(B ∩ C) + P r(A ∩ B ∩ C).

1 When

the dice is fair.

470

B Introduction to Probability and Random Variables

• Conditional Probability Given two events A and B, with P r(B) > 0, the conditional probability of A given by B, denoted by P r(A|B) is defined by P r(A|B) = P r(A ∩ B)/P r(B). • Independence—Two events A and B are independent if and only if P r(A ∩ B) = P r(A)P r(B). This is equivalent to P r(A|B) = P r(A). This definition also extends to more than two independent events. As a consequence, one has the following property: P r(A|B) = P r(B|A)P r(A)/P r(B). Note the difference between independent and exclusive/disjoint events. • Bayes theorem For n events B1 , . . . , Bn forming a partition of the sample space S, i.e. mutually exclusive whose union is S, and A any event, then P r(Bi )P r(A|Bi ) . P r(Bi |A) = n j =1 P r(Bj )P r(A|Bj )

B.3 Random Variables and Probability Distributions Definition A random variable is a real valued function defined on a sample space S of a random experiment. A random variable is usually noted by a capital letter, e.g. X, Y or Z, and the values it takes by a lower case, e.g. x, y or z. Hence a random variable X assigns a value x to each outcome in S. Depending on the sample space, one can have either discrete or continuous random variables. Sometimes we can also have a mixed random variable. Here we mainly describe discrete and continuous random variables.

B.3.1 Discrete Probability Distributions Let X be a discrete random variable taking discrete values x1 , . . . , xk , and pj = pr(X = xj ), j = 1, . . . , k. Then the function f (x) = P r(X = x) is the probability function of X. One immediately sees that k j =1 pj = 1. The function F (x) defined by

k

j =1 f (xj )

=

B

Introduction to Probability and Random Variables

471

F (x) = P r(X ≤ x) =



f (xi )

xi ≤x

is the cumulative distribution function (cdf) of X. The cdf of a discrete random variable is piece-wise constant function between 0 and 1. Various other characteristics can be defined from X, which are included in the continuous case discussed below.

B.3.2 Continuous Probability Distributions Definition Let X be a continuous random variable taking values in a continuous subset I of the real axis. The function f (x) defined by 

b

P r(a ≤ x ≤ b) =

f (x)dx a

for any interval [a, b] in I is the probability density function (pdf) of X. Hence the quantity f (x)dx represents the probability of the event x ≤ X ≤ x + dx, i.e. P r(x ≤ X ≤ x + dx) = f (x)dx. The pdf satisfies the following properties: (1) f (x) ≥ 0 for all x. ∞ (2) −∞ f (x)dx = 1. The cumulative distribution function of X is given by  F (x) =

x −∞

f (x)dx.

Remark Let X be a discrete random variable taking values x1 , . . . , xk , with probabilities p1 , . . . , pk . Designate by δx () the Dirac impulse function, i.e. δx (y) = 1, only if y = x, and zero otherwise. Then the probability function f (x) can be written as f (x) = kj =1 pj δxj (x). Hence by using the rule of integration of a Dirac impulse function, i.e.  δx (y)g(y)dy = g(x)1I (x), I

where 1I () is the indicator of the interval I , then X can be analysed as if it were continuous.

Moments of a Random Variable Let X be a continuous random variable with pdf f () and cdf F (). The quantity:

472

B Introduction to Probability and Random Variables

 E(X) =

xf (x)dx

is the expected value or first-order moment of X. Notethat for a discrete random variable one obtains, using the above remark, E(X) = xi pi . The kth-order moment of X is defined by mk = E(Xk ) = x k f (x)dx. The centred kth-order moment is  μk = E (X − E(X))k = (x − μ)k f (x)dx. The second-order centred moment μ2 is the variance, var(X) = σ 2 of X, and we have σ 2 = E(X2 ) − E(X)2 . One can define addition and multiplication of two (or more) random variables over a sample space S and also multiply a random variable by a scalar. The expectation operator is a linear operator over the set of random variables on S, i.e. E(λX + Y ) = λE(X) + E(Y ). We also have var(λX) = λ2 var(X).

Cumulants The (non-centred) moments μm , m = 1, 2, . . ., of a random variable X with pdf f (x), are defined by

μm = E Xm =

 x m f (x)dx.

The centred moments are defined with respect to the centred random variable X − E(X). The characteristic function is given by

 φ(s) = E eisX = eisx f (x)dx, and the moment generating function is given by  g(s) = E esX = esx f (x)dx. We have, in particular, μm = given by

dm ds m φ(s)|s=0 .

κm =

dm i m ds m

The cumulant of order m of X, κm , is

log (φ(s)) |s=0 .

For example, the third-order moment is the skewness, which provides a measure of the symmetry of the pdf (with respect to the mean when centred moment is used),

B

Introduction to Probability and Random Variables

473

and κ3 = μ3 − 3μ2 μ1 + 2μ31 . For the fourth-order cumulant, also called the kurtosis of the distribution, κ4 = μ4 − 4μ3 μ1 − 3μ22 + 12μ21 μ2 − 6μ41 . Note that for zeromean distributions κ4 = μ4 − 3μ22 . A distribution with zero kurtosis is known as mesokurtic, like the normal distribution. A distribution with positive kurtosis is known as super-Gaussian or leptokurtic. This distribution is characterised by a higher maximum and heavier tail than the normal distribution with the same variance. A distribution with negative kurtosis is known as sub-Gaussian or platykurtotic and has lower peak and higher tails than the normal distribution with the same variance.

B.3.3 Joint Probability Distributions Let X and Y be two random variables over a sample space S with respective pdfs fX () and fY (). For any x and y, the function f (x, y) defined by  P r(X ≤ x; Y ≤ y) =

x



y

−∞ −∞

f (u, v)dudv

is the joint probability density function. The definition can be extended in a similar T

fashion to p random variables X1 , . . . , Xp . The vector x = X1 , . . . , Xp is called a random vector, and its pdf is given by the joint pdf f (x) of these random variables. Two random variables X and Y are said to be independent if f (x, y) = fX (x)fY (y), for all x and y. The pdfs fX () and fY () and associated cdfs FX () and FY () are called marginal pdfs and marginal cdfs of X and Y , respectively. The marginal pdfs and cdfs are linked to the joint cdf via FX (x) = F (x, ∞) and fX (x) =

d dx FX (x),

and similarly for the second variable. The expectation of any function h(X, Y ) is given by   E(h(X, Y )) =

h(x, y)f (x, y)dxdy.

The covariance between X and Y is given by cov(X, Y ) = E(XY ) − E(X)E(Y ). The correlation between X and Y is given by

474

B Introduction to Probability and Random Variables

ρX,Y = √

cov(X, Y ) var(X)var(Y )

and satisfies −1 ≤ ρX,Y ≤ 1. If ρX,Y = 0, the random variables X and Y are said to be uncorrelated. Two independent random variables are uncorrelated, but the converse is not true.

For a random vector x = X1 , . . . , Xp with joint pdf f (x) = f (x1 , . . . , xp ), the joint probability (or cumulative) distribution function is given by  F (x1 , . . . , xp ) =



xp

−∞

...

x1

−∞

f (u1 , . . . , up )du1 . . . dup .

The joint pdf is then given by f (x1 , . . . , xp ) =

∂ p F (x1 , . . . , xp ) . ∂x1 . . . ∂xp

Like the bivariate case, p random variables X1 , . . . , Xp are independent if the joint cdf F () can be factorised into a product of the marginal cdfs as: F (x1 , . . . , xp ) = F 1 (x1 ) . . . FXp (xp ), and similarly for the joint pdf. Also, we have fX1 (x1 ) = X ∞ ∞ −∞ . . . −∞ f (x)dx2 . . . dxp , and similarly for the remaining marginal pdfs.

B.3.4 Expectation and Covariance Matrix of Random Vectors T

Let x = X1 , . . . , Xp be a random vector with pdf f () and cdf F (). The expectation of a function g(x) is defined by  E [g(x)] =

g(x)f (x)dx.

The mean μ of x is obtained when g() is the identity, i.e. μ = xf (x)dx. Assuming the random variables X1 , . . . , Xp have finite variance, the covariance matrix,  xx , of x is given by 

  xx = E (x − μ) (x − μ)T = E xx T − μμT ,

with components [ xx ]ij = cov Xi , Xj . The covariance matrix is symmetric

positive semi-definite. Let us now designate by Dxx = diag σ12 , . . . , σp2 , the diagonal matrix containing the individual variances of X1 , X2 , . . . , Xp , then the

correlation matrix ρXi ,Xj is given by:

B

Introduction to Probability and Random Variables

475

  −1/2 −1/2 −1/2 −1/2 = Dxx  xx Dxx .  xx = E Dxx (x − μ) (x − μ)T Dxx

B.3.5 Conditional Distributions Let x and y be two random vectors over some state space with joint pdf fx,y (.). The conditional probability density of y given x = x is given by fy|x (y|x) = fx,y (x, y)/fx (x), when fx (x) = 0; otherwise, one takes fy|x (x, y) = fx,y (x, y). Using this conditional pdf, one can obtain the expectation of any function h(y) given x = x, i.e.  ∞ E (h(y)|x = x) = h(y)fy|x (y|x)dy, −∞

which is a function of x only. As in the two-variable case, x and y are independent if fy|x (y|x) = fy (y) or equally fx,y (.) = fx (.)fy (.). In particular, two (zeromean) random vectors x and y are uncorrelated if the covariance matrix vanishes, i.e. E xy T = O.

B.4 Examples of Probability Distributions B.4.1 Discrete Case Bernoulli Distribution A Bernoulli random variable X takes only two values, 0 and 1, i.e. X has two outcomes: success or failure (true or false) with respective probabilities P r(X = 1) = p and P r(X = 0) = q = 1 − p. The pdf of this distribution can be written as f (x) = px (1 − p)1−x , where x is either 0 or 1. A familiar example of a Bernoulli trial is the tossing of a coin.

Binomial Distribution A binomial random variable X with parameters n and 0 ≤ p ≤ 1, noted as X ∼ B(n, p), takes n + 1 values 0, 1, . . . , n with probabilities P r(X = j ) =

n j

pj (1 − p)n−j ,

476

B Introduction to Probability and Random Variables

n! where nj = j !(n−j )! . Given a Bernoulli trial with probability of success p, a Binomial trial B(n, p) consists of n repeated and independent Bernoulli trials. Formally, if X1 , . . . , Xn are independent and identically  distributed (IID) Bernoulli random variables with probability of success p, then nk=1 Xk follows a binomial distribution B(n, p). A typical example consists of tossing a coin n times, and the number of heads is a binomial random variable. Exercise Let X ∼ B(n, p), show that μ = E(X) = np, and σ 2 = var(X) = np(1 − p). Show that the characteristic function φ(t) = E(eiXt ) is (peit + q)n .

Negative Binomial Distribution In a series of independent Bernoulli trials, with constant probability of success p, the random variable X representing the number of trials until r successes are obtained is a negative binomial with parameters p and r. The parameter r can take values 1, 2, . . ., and for each value, we have a distribution, e.g. P r(X = j ) =



j −1 r−1

(1 − p)j −r pr ,

for j = r, r + 1, . . .. If we are interested in the first success, i.e. r = 1, one gets the geometric distribution. Exercise Show that the mean and variance of the negative binomial distribution are, respectively, μ = r/p and σ 2 = r(1 − p)/p2 .

Poisson Distribution A Poisson random variable with parameter λ > 0 can take all the integer numbers and satisfies P r(X = k) =

λk −λ k! e

k = 0, 1, . . . .

Poisson distributions are typically used to analyse processes involving counts. Exercise Show that for a Poisson distribution one has E(X) = λ = var(X). and  φ(t) = exp λ(eit − 1) .

B.4.2 Continuous Distributions The Uniform Distribution A continuous uniform random variable over the interval [a, b] has the following pdf:

B

Introduction to Probability and Random Variables

f (x) =

477

1 1[a,b] (x), b−a

where 1I () is the indicator of I , i.e. with a value of one inside the interval and zero elsewhere. Exercise Show that for a uniform random variable X over [a, b], E(X) = (a+b)/2 and var(X) = (a − b)2 /12.

The Normal Distribution The normal (or Gaussian) distribution, N(μ, σ 2 ), has the following pdf:   1 (x − μ)2 . f (x) = √ exp − 2σ 2 σ 2π Exercise Show that for the above normal distribution E(X) = μ and var(X) = σ 2 . For a normal distribution, the random variable X−μ σ has zero mean and unit variance and is referred to as the standard normal. The cdf of X is generally noted as x (x) = −∞ f (u)du and is known as the error function. The normal distribution is very useful and can be reached using a number of ways. For example, if Y is Y −np binomial B(n, p), Y ∼ B(n, p), then √np(1−p) approximates the standard normal for large np. The same result holds for Y√−λ when Y follows a Poisson distribution λ with parameter λ. This result constitutes a particular case of a more general result, namely the central limit theorem (see e.g. DeGroot and Shervish 2002, p. 282).

The Central Limit Theorem Let X1 , . . . , Xn be a sequence of n IID random variables with mean μ and variance 0 < σ 2 < ∞ each, then for every number x 

 Xn − μ lim P r √ ≤ x = (x), n→∞ σ/ n  where () is the standard normal cdf, and Xn = n1 nk=1 Xk . The theorem says that the (properly scaled) sum of a sequence of independent random variables with same mean and (finite) variance is approximately normal.

The Exponential Distribution The pdf of the exponential distribution with parameter λ > 0 is given by

478

B Introduction to Probability and Random Variables

 f (x) =

if x≥0 λe−λx 0 otherwise.

The Gamma Distribution The pdf of the gamma distribution with parameters λ > 0 and β > 0 is given by ' f (x) =

λβ β−1 −βx e (β) x

0

if x>0 otherwise,

∞ where (y) = 0 e−t t y−1 dt, for y > 0. If the parameter β < 0, the distribution is known as Erlang distribution. Exercise Show that for the above gamma distribution E(X) = β/λ, and var(X) = β/λ2 . Show that φ(t) = (1 − it/λ)−β .

The Chi-Square Distribution The chi-square random variable with n degrees of freedom (dof), noted as χn2 , has the following pdf: ' f (x) =

2−n/2 n2 −1 −x/2 e (n/2) x

0

if x>0 otherwise.

Exercise Show that E(χn2 ) = n and var(χn2 ) = 2n.

 If X1 , . . . , Xn are independent N(0, 1), the random variable nk=1 Xk2 is distributed as χn2 with n dof. If Xk ∼ N(0, σ 2 ), then the obtained χn2 follows the σ 2 chi-square distribution. Exercise Find the pdf of the σ 2 chi-square distribution. Answer f (x) =

x

σ n 2−n/2 n2 −1 − 2σ 2 e (n/2) x

for x > 0.

The Student Distribution The student T distribution with n dof has the following pdf:  − n+1 2 n−1/2 ( n+1 x2 2 ) 1+ f (x) = . (1/2)(n/2) n

B

Introduction to Probability and Random Variables

479

If X ∼ N (0, 1) and Y ∼ χn2 are independent, then T = distribution with n dofs.

√X Y /n

has a student

The Fisher–Snedecor Distribution The Fisher–Snedecor random variable with n and m dof, Fn,m , has the following pdf: ' f (x) =

( mn )n/2 ( n+m 2 ) (n/2)(m/2)

Exercise Show that E(Fn,m ) =

n x 2 −1 1 + 0

m m−2

n+m nx − 2 m

if x>0 otherwise.

and var(Fm,n ) =

2m2 (n+m−2) 4(m−2)(m−4) .

If X ∼ χn2 and Y ∼ χm2 are independent, then Fn,m = Snedecor distribution with n and m dof.

X/n Y /m

follows a Fisher–

The Multivariate Normal Distribution A multinormally distributed random vector x, noted as x ∼ Np (μ, ), has the pdf  1 T −1 f (x) = exp − (x − μ)  (x − μ) , p 1 2 (2π ) 2 || 2 

1

where μ and  are, respectively, the mean and the covariance matrix of x.   The 1 T T characteristic function of this distribution is φ(t) = exp iμ t − 2 t t . The multivariate normal distribution is widely used and has some very useful properties that are given below: • Let A be a m × p matrix, and y = Ax, and then y ∼ Nm (Aμ, AAT ). • If x ∼ Np (μ, ), and rank() = p, then (x − μ)T  −1 (x − μ) ∼ χp2 . • Let the random vector x ∼ Np (μ, ) partitioned as x T = (x T1 , x T2 ), where x 1 is q-dimensional (q < p), and similarly for the mean and the covariance matrix,

i.e. μ = μT1 , μT2 , and  = then

 11  12  21  22

 ,

480

B Introduction to Probability and Random Variables

(1) the marginal distribution of x 1 is multinormal Nq (μ1 ,  11 ); (2) x 1 and x 2 are independent if and only if  12 = O; (3) if  22 is of full rank, then the conditional distribution of x 1 given x 2 = x2 is multinormal with −1 E (x 1 |x 2 =x2 ) =μ1 + 12  −1 22 (x2 −μ2 ) and var (x 1 |x 2 ) = 11 − 12  22  21 .

The Wishart Distribution The Wishart distribution with n dofs and parameter , a p × p symmetric positive semi-definite matrix (essentially a covariance matrix), is the distribution of a p × p random matrix X (a matrix whose elements are random variables) with pdf f (X) =

⎧ n−p−1 ⎨2np/2 π p(p−1)/4 |X| 2 ⎩

||

( n/2 p

k=1 (

n+1−k ) 2

 exp

1 2 tr

−1   X

0

if

X is positive definite

otherwise.

If X1 , . . . , Xn are IID Np (0, ), p ≤ n, then the p × p random matrix W=

n 

Xk XTk

k=1

has a Wishart probability distribution with n dof.

B.5 Stationary Processes A (discrete) stochastic process is a sequence of random variables X1 , X2 , . . ., which is a realisation of some random variable X, noted as Xk ∼ X. This stochastic process is entirely characterised by specifying the joint probability distribution of any finite set (Xk1 , . . . , Xkm ) from the sequence. The sequence is also called sometimes time series, when the indices t = 1, 2, . . . are identified with “time”. When one observes a finite realisation x1 , x2 , . . . xn of the previous sequence, one also talks of a finite sample time series. Let μt = E(Xt ) and γ (t, k) = cov(Xt , Xt+k ), for t = 1, , 2, . . ., and k = 0, 1, . . .. The process (or time series) is said to be stationary if μt and γ (t, k) are independent of t. In this case one obtains E(Xt ) = μ and γ (k) = γk = cov(Xt , Xt+k ) . The function γ (), defined on the integers, is the autocovariance function of the stationary stochastic process. The function

B

Introduction to Probability and Random Variables

ρk =

481

γk γ0

is the autocorrelation function. We assume that we have a finite sample x1 , . . . , xn , supposed to be an independent realisation of some random variable X of finite mean and variance. Let x and s 2 be the sample meanand the sample variance, respectively, i.e. x = n 1 n 1 2 2 k=1 xk , and s = n−1 k=1 (xk − x) . Note that because the sample is n−1 random, these estimators are also random. These estimators satisfy, respectively, E(X) = E(X), and E(s 2 ) = var(X), and for this reason, they are referred to as unbiased estimators of the mean and the variance of X, respectively. Also, the function Fˆ (x) = n1 #{xk , xk ≤ x} represents an estimator of the cdf F () of X, or empirical distribution function (edf). Given a finite sample x1 , . . . , xn , an estimator of the autocovariance function is given by 1 γˆk = (xi − x)(xi+k − x), n n−k t=1

and the estimator autocorrelation is ρˆk = γˆk /γˆ0 . It can be shown that  of the 2+ρ 2 2 ρ var(ρˆk ) = n1 ∞ i+k ρi−k − 4ρk ρi ρi−k + 2ρi ρk . This expression can i=1 i be simplified further if the autocorrelation decays, e.g. exponentially. The computation of the variance of the sample estimators is useful in defining a confidence interval for the estimators.

Appendix C

Stationary Time Series Analysis

This appendix gives a brief introduction to stationary time series analysis for the univariate and multivariate cases.

C.1 Autocorrelation Structure: One-Dimensional Case A (discrete) time series is a sequence of numbers xt , t = 1, 2 . . . , n. In time series exploration and modelling, a time series is considered as a realisation of some stochastic process, i.e. a sequence of random variables. So, conceptually the time series xt , t = 1, 2, . . . is considered as a sequence of random variables and the corresponding observed series is simply a realisation of these random variables. The time series is said to be (second-order) stationary if the mean is constant and the covariance between any xt and xs is a function of t − s, i.e. E (xt ) = μ and cov (xt , xs ) = γ (t − s).

(C.1)

C.1.1 Autocovariance/Correlation Function The autocovariance function of a stationary time series xt , t = 1, 2 . . ., is defined by γ (τ ) = cov (xt+τ , xt ) = E (xt+τ − μ) (xt − μ) .

(C.2)

It is clear that the variance of the time series is simply σ 2 = γ (0). The autocorrelation function ρ() is given by

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

483

484

C Stationary Time Series Analysis

ρ(τ ) =

γ (τ ) . σ2

(C.3)

Properties of the Autocovariance Function The autocovariance function γ (.) satisfies the following properties: • |γ (τ )| ≤ γ (0) = σ 2 . • γ (τ ) = γ (−τ ). • For any p integers τ1 , τ2 , . . . τp and real numbers a1 , a2 , . . . ap , we have p 

γ (τi − τj )ai aj ≥ 0,

(C.4)

i,j =1

and the autocovariance function is said to be non-negative definite or positive semi-definite. Exercise Derive the above properties. Hint Use the fact that var (λxt + xt+τ ) ≥ 0 for any real λ. For the last one, use the fact that var( i ai xτi ) ≥ 0.

C.1.2 Time Series Models Let εt , t ≥ 1, be a sequence of IID random variables with zero mean and variance σε2 . This sequence is called white noise. The autocovariance of such a process is simply a Dirac pulse, i.e. γε (τ ) = δτ , i.e. one at τ = 0, and zero elsewhere. Although the white noise process is the simplest time series, it remains, however, hypothetical because it does not exist in practice. Climate and other time series are autocorrelated. Simple linear time series models have been formulated to explain this autocorrelation. The models we are reviewing here have been formulated in the early 1970s and are known as autoregressive moving average (ARMA) models (Box and Jenkins 1970; see also Box et al. (1994)).

Some Basic Notations Given a time series (xt ) where t is either continuous or discrete, various operations can be defined.

C

Stationary Time Series Analysis

485

• Backward shift or delay operator B—This is defined for discrete time series by Bxt = xt−1 .

(C.5)

More generally, for any integer m ≥ 1, B m xt = xt−m . By analogy, one can define the inverse operator B −1 , which is the forward operator. It is clear from Eq. (C.5) that for a constant c, Bc = c. Also for any integers m and n, B m B n = B m+n . Furthermore, for any time series (xt ), t = 1, 2, . . ., we have 1 xt = (1 + αB + α 2 B 2 + . . .)xt = xt + αxt−1 + . . . 1 − αB

(C.6)

whenever |α| < 1. Remark Consider the mean x of a finite time series (xt ), t = 1, . . . n. Then, x=

  n n 1 1  i 1 1 − Bn xn . xi = B xn = n n n 1−B i=1

i=0

• Differencing operator ∇ = 1 − B—This is defined by ∇xt = (1 − B)xt = xt − xt−1 .

(C.7)

For example, ∇ 2 xt = (1 − B)2 xt = xt − 2xt−1 + xt−2 . • Seasonal differencing ∇k = 1 − B k —This operator is frequently used to deal with seasonality. • Gain operator—It is a simple linear multiplication of the time series, i.e. axt . The parameter a is referred to as gain. • Differencing operator D—For a continuous time series, {y(t), a ≤ t ≤ b}, the differencing operator is simply a differentiation D, i.e. Dy(t) =

dy(t) dt

(C.8)

whenever this differentiation is possible. • Continuous shift operator—Another useful operator normally encountered in filtering is the shift operator in continuous time series, B u , defined by B u y(t) = y(t − u).

(C.9)

This operator is equivalent to the backward shift operator in discrete time series. It can be shown that d

B u = e−uD = e−u dt .

(C.10)

486

C Stationary Time Series Analysis

Fig. C.1 Examples of time series of AR(1) models with lag-1 autocorrelation 0.5 (a) and −0.5 (b)

Exercise Derive Eq. (C.10). Hint Use a Taylor expansion of y(t − u).

ARMA Models • Autoregressive schemes: AR(p) Autoregressive models of order p are given by xt = φ1 xt−1 + φ2 xt−2 + . . . + φp xt−p + εt =

 p 

 φk B

k

xt + εt .

(C.11)

k=1

The white noise εt is only correlated with xs for s ≥ t. When p = 1, one gets the familiar Markov or first-order autoregressive, AR(1), model also known as red noise. Figure C.1 shows an example of generated time series of an AR(1) model with opposite lag-1 autocorrelations. The red noise is a particularly simple model that is frequently used in climate research and constitutes a reasonably good model for many climate processes, see e.g. Hasselmann (1976, 1988), von Storch (1995a,b), Penland and Sardeshmukh (1995), Hall and Manabe (1997), Feldstein (2000) and Wunsch (2003) to name just a few. • Moving average scheme: MA(q) Moving average models of order q, MA(q), are defined by  xt = εt + φ1 εt−1 + . . . + φq εt−q = 1 +

q  k=1

 φk B

k

εt .

(C.12)

C

Stationary Time Series Analysis

487

It is possible to combine both the above models, AR(p) and MA(q), into just one single model, the ARMA model. • Autoregressive moving average scheme: ARMA(p, q) It is given by  1−

p 

 φk B

k

 xt = 1 +

k=1

q 

 θk B

k

εt .

(C.13)

k=1

The ARMA(p, q) model can be written in a more compact form as φ(B)xt = also q p θ (B)εt , where φ(z) = 1 − k=1 φk zk and θ (z) = 1 + k=1 θk zk . Stationarity of the ARMA(p, q) model, Eq. (C.13), requires that the roots of φ(z) = 0

(C.14)

be outside the unit circle, see e.g. Box et al. (1994) for details. Various ways exist to identify possible models for a given time series. For example, the autocorrelation function of an ARMA model is a damped exponential and/or sine waves that could be used as a guide to select models. Another useful measure is the partial autocorrelation function. It exploits the fact that, for example, for an AR(p) model the autocorrelation function can be entirely described by the first p lagged autocorrelations whose behaviour is described by the partial autocorrelation, which is a function that cuts off after lag p for the AR(p) model. Alternatively, one can use concepts from information theory (Akaike 1969, 1974) by fitting a whole range of models, computing the residual estimates εˆ and their variances (the mean squared errors) σˆ 2 and then deriving, for example, the Akaike information criterion (AIC) given by AI C = log(σˆ 2 ) +

2 (P + 1), n

(C.15)

where P is the number of parameters to be estimated. The best model corresponds to the smallest AIC.

C.2 Power Spectrum We assume that we have a stationary  time series xt , t = 1, 2 . . ., with summable autocovariance function γ (.), i.e. k γ (k) < ∞. The spectral density function, or power spectrum, f (ω) is defined by f (ω) =

∞ 1  γ (k)e−ikω . 2π k=−∞

(C.16)

488

C Stationary Time Series Analysis

Fig. C.2 Autocorrelation function of AR(1) models with lag-1 autocorrelations 0.5 (a) and −0.5(b)

Using the symmetry of the autocovariance function, the power spectrum becomes σ2 f (ω) = 2π

 1+2

∞ 

 (C.17)

ρ(k)coskω .

k=1

Remark Similar to power spectrum, the bispectrum is the Fourier transform of the bicovariance function, and is related to the skewness (e.g. Pires and Hannachi 2021) Properties of the Power Spectrum • f () is even, i.e. f (−ω) = f (ω). • f (ω) ≥ 0 for all ω in [−π, π ]. π π • γ (τ ) = −π eiωτ f (ω)dω = −π cosτ ωf (ω)dω, i.e. the autocovariance function is the inverse Fourier transform of the power spectrum. Note π that from the last property, one gets, in particular, the familiar result σ 2 = −π f (ω)dω, i.e. the power spectrum distributes the variance. Examples 2

σ • The power spectrum of a white noise is constant, i.e. f (ω) = 2π . • For a red noise time series (of zero mean), xt = αxt−1 + εt , the autocorrelation function is ρ(τ ) = α |τ | , and its power spectrum is f (ω) =

σ2 2 −1 (Figs. C.2, C.3). 2π 1 − 2αcosω + α

Exercise Derive the relationship between the variance of xt and that of the innovation εt in the red noise model. Hint σ 2 = σε2 (1 − α 2 )−1 . Exercise 1. Compute the  autocorrelation function ρ(.) of the AR(1) model xt = φ1 xt−1 + εt . 2. Compute k≥0 ρ(k).

C

Stationary Time Series Analysis

489

Fig. C.3 Power spectra of two AR(1) models with lag-1 autocorrelation 0.5 and −0.5

3. Write ρ(τ ) = e−τ/T0 , and calculate T0 as a function of φ1 . ∞ 4. Calculate 0 e−τ/T0 dτ. 5. Reconcile the expression from 2 with that from 4. Hint 1. For k ≥ 1, xt xt−k = φ1 xt−1 xt−k + εt xt−k yields ρ(k) = φ1k . 1 2. 1−φ . 1

1 . 3. T0 = − logφ 1 4. T0 . 5. T0−1 = −log(1 − (1 − φ1 )) ≈ 1 − φ1 .

General Expression of the ARMA Spectra A direct method to compute the spectra of ARMA processes is to make use of results from linear filtering as outlined in Sect. 2.6 of Chap. 2. Exercise Consider the delay operation yt = Bxt . Find the relation between the Fourier transforms y(ω) and x(ω) of yt and xt , respectively. Find a similar relationship when yt = αxt + βBxt . Answer y(ω) = (α + βeiω )x(ω). Let xt , t = 0, 1, 2, . . ., be a stationary time series, and consider the filtering equation: yt =

p  k=1

αk xt−k .

490

C Stationary Time Series Analysis

Using the above exercise, we get y(ω) = A(eiω )x(ω), where A(z) =

p 

αk zk ,

k=1

where the function a(ω) = A(eiω ) is the frequency response function, which is the Fourier transform of the transfer function. Now, the power spectrum of yt is linked to that of xt following: fy (ω) = |a(ω)|2 fx (ω). The application of this to the ARMA time series model (C.13), see also Chap. 2, yields ) ) ) θ (eiω ) )2 ) . fx (ω) = σε2 )) φ(eiω ) )

(C.18)

In the above equation it is assumed that the roots of φ(z) are outside unit circle (stationarity) and similarly for θ (z) (for invertibility, i.e. εt is written as a convergent power series in xt , xt−1 , . . .).

C.3 The Multivariate Case Let

xt , t = 1, 2, . . ., be a multivariate time series where each element xt = xt1 , xt2 , . . . xtp is now p-dimensional. We suppose that xt is of mean zero and covariance matrix  0 .

C.3.1 Autocovariance Structure The lagged cross- or autocovariance matrix (τ ) is defined by

(τ ) = E xt+τ xTt .

(C.19)



The elements of (τ ) are [(τ )]ij = E xt+τ,i xt,j . The diagonal elements are the autocovariances of the individual unidimensional time series forming xt , whereas its off-diagonal elements are the lagged cross-covariances. The lagged covariance matrix has the following properties: • (−τ ) = [(τ )]T . • (0) is the covariance matrix  0 of xt .

C

Stationary Time Series Analysis

491

• (τ ) is positive semi-definite, i.e. for any integer m > 0, and real vectors a1 , . . . , am , m 

aTi (i − j )aj ≥ 0.

(C.20)

i,j =1

Similarly, the lagged cross-correlation matrix −1/2

ϒ(τ ) =  0

−1/2

(τ ) 0

,

 − 1 whose elements ρij (τ ) are [ϒ(τ )]ij = γij (τ ) γii (0)γjj (0) 2 , has similar properties. Furthermore, we have |ρij (τ )| ≤ 1. Note that the inequality γij (τ ) ≤ γij (0), for i = j , is not true in general.

C.3.2 Cross-Spectrum As for the univariate case, we can define the spectral density matrix F(ω) of xt , t = 1, 2, . . . for −π ≤ ω ≤ π as the Fourier transform of the autocovariance matrix: F(ω) =

∞ 1  −iτ ω e (τ ) 2π τ =−∞

(C.21)

 whenever τ (τ ) < ∞, where . is a matrix norm. For example, if  |γ (τ )| < ∞, for i, j = 1, 2, . . . p, then F(ω) exists. Unlike the univariate case, ij τ however, the spectral density matrix can be complex because () is not symmetric. The diagonal elements of F(ω) are real because they represent the power spectra of the individual univariate time series that constitute xt . The real part of F(ω) is the co-spectrum matrix, whereas the imaginary part is the quadrature spectrum matrix. The spectral density matrix has the following properties: • F(ω) is Hermitian, i.e. F(−ω) = [F(ω)]∗T , where (∗ ) represents the complex conjugate. • The autocovariance matrix is the inverse Fourier transform of F(ω), i.e.

492

C Stationary Time Series Analysis

 (τ ) =

π

F(ω)eiτ ω dω.

−π

(C.22)

π  •  0 = −π F(ω)dω, and 2π F(0) = k (k). • F(ω) is semi-definite (Hermitian), i.e. for any integer m > 0, and complex numbers c1 , c2 , . . . , cp , we have ∗T

c

F(ω)c =

p 

ci∗ Fij (ω)cj ≥ 0,

(C.23)

i,j =1

T

where c = c1 , c2 , . . . , cp . The coherence and phase between xt,i and xt,j , t = 1, 2, . . ., for i = j , are, respectively, given by cij (ω) =

|Fij (ω)|2 , Fii (ω)Fjj (ω)

(C.24)

and  φij (ω) = Atan

 I m(Fij (ω)) . Re(Fij (ω))

(C.25)

The coherence, Eq. (C.24), gives essentially a measure of the square of the correlation coefficient between both the time series in the frequency domain. The phase, Eq. (C.25), on the other hand, gives a measure of the time lag between the time series.

C.4 Autocorrelation Structure in the Sample Space C.4.1 Autocovariance/Autocorrelation Estimates We assume that we have a finite sample of a time series, xt , t = 1, 2 . . . n. There are various ways to estimate the autocovariance function γ (). The most widely used estimators are 1 (xt − x) (xt+τ − x) n

(C.26)

n−τ 1  (xt − x) (xt+τ − x) . n−τ

(C.27)

n−τ

γˆ1 (τ ) =

t=1

and γˆ2 (τ ) =

t=1

C

Stationary Time Series Analysis

493

We can assume for simplicity that the sample mean is zero. It is clear from Eq. (C.26) and Eq. (C.27) that γˆ1 () is slightly biased, with bias of order n1 , i.e. asymptotically unbiased, whereas γˆ2 () is unbiased. The asymptotically unbiased estimator γˆ1 () is, however, consistent, i.e. its variance goes to zero as the sample size goes to infinity, whereas the estimator γˆ2 () is inconsistent with its variance tending to infinity with the sample size (see e.g. Jenkins and Watts 1968). But, for a fixed lag both the estimators are asymptotically unbiased and with approximate variances satisfying (Priestly 1981, p. 328)



1 var γˆ1 (τ ) ≈ O( n1 ) and var γˆ2 (τ ) ≈ O( n−k ). Similarly, the autocorrelation function can be estimated by ρ(τ ˆ )=

γˆ (τ ) , σˆ 2

(C.28)

whereγˆ () is an estimator of the autocovariance function, and σˆ 2 = (n − 1)−1 nt=1 (xt − x)2 is the sample variance. The sample estimate ρˆ1 (τ ), τ = 0, 1, . . . n − 1, is positive semi-definite, see Eq. (C.4), whereas this is not in general true for ρˆ2 (.), see e.g. Priestly (1981, p. 331). The graph showing the sample autocorrelation function ρ(τ ˆ ) versus τ is normally referred to as correlogram. A simple and useful significance test for the sample autocorrelation function is one based on asymptotic normality and white noise, namely,

E ρ(τ ˆ ) ≈ 0 for τ = 0 and

var ρ(τ ˆ ) ≈

1 n

for τ = 0.

These approximations can be used to construct confidence intervals for the sample autocorrelation function.

C.4.2 The Periodogram Raw Periodogram We consider again a centred sample of a time series, xt , t = 1, 2 . . . n, with sample autocovariance function γˆ (). In spectral estimate we normally consider the Fourier n frequencies ωk = 2πn k , for k = −[ n−1 2 ], . . . , [ 2 ], where [x] is the integer part of x. 2π (radians/time unit), where t is the sampling interval, is known The frequency 2t

494

C Stationary Time Series Analysis

as the Nyquist frequency.1 The Nyquist frequency represents the highest frequency that can be resolved, and therefore, the power spectrum can only be estimated for frequencies less than the Nyquist frequency. The sequence of the following complex vectors:

T 1 ck = √ eiωk , e2iωk , . . . , einωk n for k = 1, 2, . . . n, is orthonormal, i.e. c∗T k cl = δkl , and therefore, any ndimensional complex vector x can be expressed as n

[2] 

x=

αk ck ,

(C.29)

k=−[ n−1 2 ] T where αk = c∗T 1 , . . . , xn ) yields k x. The application of Eq. (C.29) to the vector (x  the discrete Fourier transform of the time series, i.e. αk = √1n nt=1 xt e−itωk . The periodogram of the time series is defined as the squared amplitude of the Fourier coefficients, i.e.

) n ) 1 ))  −itωk ))2 In (ωk ) = ) xt e ) . n

(C.30)

t=1

Now, from Eq. (C.29) one easily gets (n − 1)σˆ = 2

n 

n

xt2

=

t=1

[2] 

n

|αk | = 2

k=−[ n−1 2 ]

[2] 

In (ωk ).

(C.31)

k=−[ n−1 2 ]

As for the power spectrum, the periodogram also distributes the sample variance, i.e. the periodogram In (ωk ) represents the contribution to the sample variance from the frequency ωk . The periodogram can be seen as an estimator of the power spectrum, Eq. (C.16). In fact, by expanding Eq. (C.30) one gets ⎤ ⎡ n n−1

1 ⎣ 2   In (ωp ) = xt + xt xτ eikωp + e−ikωp ⎦ n t=1

1 Or

1 2t

k=1 |t−τ |=k

if the frequency is expressed in (1/time unit). For example, if the sampling time interval is unity, then the Nyquist frequency is 12 .

C

Stationary Time Series Analysis

=

n−1 

495

γˆ (k) cos(ωp k).

(C.32)

k=−(n−1) 1 In (ωp ) is a candidate estimator for the power spectrum f (ωp ). FurTherefore 2π



 thermore, it can be seen from Eq. (C.32) that E In (ωp ) = n−1 k=−(n−1) E γˆ (k) cos(ωp k), i.e.



E In (ωp ) ≈ 2πf (ωp ).

(C.33)

The periodogram is therefore an asymptotically unbiased estimator of the power spectrum. However, it is not consistent because it can be shown to have a constant variance. The periodogram is also highly erratic with sampling fluctuations that do not vanish as the sample size increases, and therefore, some smoothing is required.

Periodogram Smoothing Various ways exist to construct a consistent estimator of the spectral density function. Smoothing is the most widely used way to achieve consistency. The smoothed periodogram is obtained by convolving the (raw) periodogram using a “spectral window” W () as n

1 fˆ(ω) = 2π

[2] 

W (ω − ωk )In (ωk ).

(C.34)

k=−[ n−1 2 ]

The spectral window is a symmetric kernel function that integrates to unity and decays at large values. This smoothing is equivalent to a discrete Fourier transform of the weighted autocovariance estimator using a (time domain) lag window λ(.) as 1 fˆ(ω) = 2π

n−1 

λ(k)γˆ (k) cos(ωp k).

(C.35)

k=−(n−1)

The sum in Eq. (C.35) is normally truncated at the truncation point of the lag window. The spectral window W () is the Fourier transform of the lag window, whose aim is to neglect the contribution, in the sample autocovariance function, from large lags. This means that localisation in time is associated with broadness in the spectral domain and vice versa. Figure C.4 illustrates the relationship between time (or lag) window and spectral window. Various lag/spectral windows exist in the literature. Two examples are given below, namely, the Bartlett (1950) and Parzen (1961) windows:

496

C Stationary Time Series Analysis

Fig. C.4 Illustration of the relationship between time and spectral windows

• Bartlett window for which the lag window is defined by  λ(τ ) =

τ for |τ | < M 1− M 0 otherwise,

(C.36)

and the corresponding spectral window is M W (ω) = n



sin(π Mω) π Mω

2 (C.37)

.

• Parzen window: ⎧

τ 2

τ 3 M ⎪ ⎨ 1 − 6 M + 6 M for |τ | ≤ 2

3 τ λ(τ ) = 2 1− M ⎪ ⎩ 0 otherwise,

(C.38)

and W (ω) =

6 πM3



sin(Mω/4) sinω/2

4 .

(C.39)

C

Stationary Time Series Analysis

497

Fig. C.5 Parzen window showing W (ω) in ordinate versus ω in abscissa for different values of the parameter M

Figure C.5 shows an example of the Parzen window for different values of the parameter M. Notice in particular that as M increases the lag window becomes narrower. Since M can be regarded as a time resolution, it is clear that the variance increases with M and vice versa. Remark There are other ways to estimate the power spectrum such as the maximum entropy method (MEM). The MEM estimator is achieved by fitting an autoregressive model to the time series and then using the model parameters to compute the power spectrum, see e.g. Burg (1972), Ulrych and Bishop (1975), and Priestly (1981). The cross-covariance and the cross-spectrum can be estimated in a similar way to the sample covariance function and sample spectrum. For example, the cross-covariance between two zero-mean time series samples xt , and yt , t = 1, . . . n, can be estimated using 1 xt yt+τ n n−τ

γˆ12 (τ ) =

(C.40)

t=1

for τ = 0, 1, . . . , n − 1, which is then complemented by symmetry, i.e. γˆ12 (−τ ) = γˆ21 (τ ). Similarly, the cross-spectrum can be estimated using M 1  fˆ12 (ω) = λ(k)γˆ12 (k)eiωk . 2π k=−M

(C.41)

Appendix D

Matrix Algebra and Matrix Function

D.1 Background D.1.1 Matrices and Linear Operators Matrices Given two n-dimensional vectors x = (x1 , . . . , nn )T and y = (y1 , . . . , yn )T and scalar λ, then x + y and λx are also n-dimensional vectors given, respectively, by (x1 + y1 , . . . , xn + yn )T and (λx1 , . . . , λxn )T . The set En of all n-dimensional vectors is called a linear (or vector) space. It is n-dimensional if it is real and 2ndimensional if it is complex. For the real case, for example, a natural basis of the space is (e1 , . . . , en ), where ek contains zero everywhere except at the kth position where it is one. A matrix X of order n × p is a collection of (real or complex) numbers xij , i = 1, . . . , n, j = 1, . . . , p, taking the following form: ⎛

x11 ⎜ x21 ⎜ X=⎜ . ⎝ ..

x12 . . . x22 . . . .. .

⎞ x1p x2p ⎟ ⎟ .. ⎟ = xij . . ⎠

xn1 xn2 . . . xnp When p = 1, one obtains a n-dimensional vector, i.e. one column of n numbers x = (x1 , . . . , xn )T . When n = p, the matrix is called square. Similar

operations can be defined on matrices, i.e. for any two n × p matrices X =

xij and Y = yij , and scalar λ, we have X + Y = xij + yij and λX = λxij . The set of all n × p real matrices is a linear space with dimension np.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

499

500

D Matrix Algebra and Matrix Function

Matrices and Linear Operators Any n×p matrix X is a representation of a linear operator from a linear space Ep into a linear space En . For example, if the space En is real, then one gets En = Rn . Let us denote by xk = (x1k , . . . , xnk )T , and then the matrix is written as X = x1 , . . . , xp . The kth column xk of X represents the image of the kth basis vector ek of Ep , i.e. Xek = xk .

D.1.2 Operation on Matrices Transpose

The transpose of a n × p matrix X = (xij ) is the p × n matrix XT = yij where yij = xj i .

Product The product of n × p and p × q matrices X and Y, respectively, is the n × q matrix p Z = XY, defined by Z = zij , with zij = k=1 xik ykj . Diagonal

A diagonal matrix is n × n matrix of the form A = xij δij , where δij is the Kronecker symbol. For a n × p matrix A, the main diagonal is given by all the elements aii , i = 1, . . . , min(n, p).

Trace The trace of a square n × n matrix X = (xij ) is given by tr (X) =

n

k=1 xkk .

Determinant Let X = (xij ) be a p × p matrix, and then, the determinant |X| of X is a multilinear function of the columns of X and is defined by det (X) = |X| =

p *  (−1)|π | xkπ(k) , π

k=1

(D.1)

D

Matrix Algebra and Matrix Function

501

where the sum is over all permutations π() of {1, 2, . . . , p} and |π | is either +1 or −1 depending on whether π() is written as the product of an even or odd number of transpositions, respectively. The determinant can also be defined in a recurrent manner as follows. For a scalar x, the determinant is simply x, i.e. det (x) = x. Then, for a p × p matrix X, the determinant is given by |X| =



(−1)i+j xij ij =

j

 (−1)i+j xij ij , i

where ij is the determinant of the (p − 1) × (p − 1) matrix X−(i,j ) obtained by deleting the ith line and j th column. The determinant ij is referred to as the minor of xij , and the term cij = (−1)i+j ij as the cofactor of xij . It can be shown that p 

xik cj k = |X|δij ,

(D.2)

k=1

where δij is the Kronecker symbol. The matrix C = (cij ) is the matrix of cofactors of X.

Matrix Inversion • Conventional inverse When |X| = 0, the square p × p matrix X = (xij ) is invertible and its inverse X−1 satisfies XX−1 = X−1 X = Ip . It is clear from Eq. (D.2) that when X is invertible the inverse is given by X−1 =

1 T C , |X|

(D.3)

where C is the matrix of cofactors of X. In what follows, the elements of X−1 are denoted by x ij , i, j = 1, . . . n, i.e. X−1 = (x ij ). • Generalised inverse Let X be a n × p matrix, and the generalised inverse of X is the p × n matrix X− satisfying the following properties: XX− and X− X are symmetric XX− X = X, and X− XX− = X− . The generalised inverse is unique and is also known as pseudoinverse or Moore– Penrose inverse. • Rank The rank of a n × p matrix X is the number of columns of X or its transpose that are linearly independent. It is the number of rows (and columns) of the largest

502

D Matrix Algebra and Matrix Function

invertible square submatrix of X. We have automatically: rank(X) ≤ min(n, p). The matrix is said to be of full rank if rank(X) = min(n, p).

Symmetry, Orthogonality and Normality Let X be a real p × p square matrix, and then • X is symmetric when XT = X; • it is orthogonal (or unitary) when XXT = XT X = Ip ; • it is normal when it commutes with its transpose, i.e. XXT = XT X, when the matrix is complex; • X is Hermitian when X∗T = X. For the complex case, the other two properties remain the same except that the transpose (T ) is replaced by the complex conjugate transpose (∗T ).

Direct Product Let A = (aij ) and B = (bij ) two matrices of respective order n × p and q × r. The direct product of A and B, noted as A × B or A ⊗ B, is the nq × pr matrix defined by ⎛

⎞ a11 B a12 B . . . a1p B ⎜ a21 B a22 B . . . a2p B ⎟ ⎜ ⎟ A⊗B=⎜ . .. .. ⎟ . ⎝ .. . . ⎠ an1 B an2 B . . . anp B. The above product is indeed a left direct product. A direct product is also known as Kronecker product. There is also another type of product between two n × p matrices of the same order A = (aij ) and B = (bij ), and that is the Hadamard product given by

A  B = aij bij .

Positivity A square p × p matrix A is positive semi-definite if xT Ax ≥ 0, for any pdimensional vector x. It is definite positive when xT Ax > 0 for any non-zero p-dimensional vector x.

D

Matrix Algebra and Matrix Function

503

Eigenvalues/Eigenvectors Let A a p ×p matrix. The eigenvalues of A are given by the set of complex numbers λ1 , . . . , λp solution to the algebraic polynomial equation: |A − λIp | = 0. The eigenvectors of u1 , . . . , up of A are the solutions to the eigenvalue problem: Au = λu, where λ is an eigenvalue. The eignevectors are normally chosen to have unit-length. For any invertible p × p matrix B, the eigenvalues of A and B−1 AB are identical.

Some Properties of Square Matrices Let A and B be two p × p matrices. Then we have: tr(αA + B) = αtr(A) + tr(B), for any number α. tr(AB) = tr(BA). −1 AP) for any nonsingular p × p matrix P. tr(A)

=T tr(P tr Axx = xT Ax, where x is a vector. (AB)−1 = B−1 A−1 . det (AB) = |AB| = |A||B|. |A ⊗ B| =|A|p |B|p and tr (A ⊗ B) = tr(A)tr(B). p tr (A) = k=1 λk , where λ1 , . . . , λp are the eigenvalues of A. The eigenvectors corresponding to different eigenvalues are orthogonal. rank (A) = #{λk ; λk = 0}. If A is (real) symmetric, then its eigenvalues λ1 , . . . , λp and its eigenvectors P = u1 , . . . , up are real. If it is positive semi-definite, then its eigenvalues are all non-negative. If the matrix is Hermitian, then its eigenvalues

are all nonnegative. For both cases, we have A = P P∗T , where = diag λ1 , . . . , λp . • If A is normal, i.e. commuting with its Hermitian transpose, then it is diagonalisable and has a complete set of orthogonal eigenvectors.

• • • • • • • • • • •

Singular Value Decomposition (SVD) The SVD theorem has different forms, see e.g. Golub and van Loan (1996), and Linz and Wang (2003). In its simplest form, any n × p real matrix X, of rank r, can be decomposed as X = UDVT ,

(D.4)

504

D Matrix Algebra and Matrix Function

where the n × r and p × r matrices U and V are orthogonal, i.e. UT U = VT V = Ir and D = diag (d1 , . . . , dr ) where dk > 0, k = 1, . . . r, are the singular values of X.

Theorem of Sums of Products Let A, B, C and D be p × p, p × q, q × q and q × p matrices, respectively, then

−1 • (A + BCD)−1 = A−1 − A−1 B C−1 + DA−1 B DA−1 and • |A + BD| = |A||Ip + A−1 BD| = |A||Iq + DA−1 B|, when all necessary inverses exist.

Theorem of Partitioned Matrices Let A be a block partitioned matrix as  A=

A11 A12 A21 A22

 ,

then we have −1 |A| = |A11 ||A22 − A21 A−1 11 A12 | = |A22 ||A11 − A12 A22 A21 |,

when all necessary inverses exist. Furthermore, if A is invertible with inverse denoted by −1

A

 =

A11 A12 A21 A22

 ,

then

−1 A11 = A11 − A12 A−1 A , 21 22 −1 22 A12 = −A11 A12 A−1 22 = −A11 A12 A ,

−1 A , and A22 = A22 − A21 A−1 12 11 −1 −1 A21 = −A22 A21 A−1 11 = −A22 A21 A11 .

D

Matrix Algebra and Matrix Function

505

D.2 Most Useful Matrix Transformations • LU decomposition For any nonsingular n × n matrix A, there exists some permutation matrix P such that PA = LU, where L is a lower triangular matrix with ones in the main diagonal and U is an upper triangular matrix. • Cholesky factorisation For any symmetric positive semi-definite matrix A, there exists a lower triangular matrix L such that A = LLT . • QR decomposition For any m × n matrix A, with m ≥ n say, there exist a m × m unitary matrix Q and a m × n upper triangular matrix R such that A = QR.

(D.5)

The proof of this result is based on Householder transformation and finds a sequence of n unitary matrices Q1 , . . . , Qn such that Qn . . . Q1 A = R. If at step k, we have say  Qk . . . Q1 A =

Lk Om−k,k

 | B , | c|C

where Lk is a k × k upper triangular matrix, then Qk+1 will transform the vector c = (ck+1 , . . . , cm )T into the vector d = (d, 0, . . . , 0)T without changing the structure of Lk and the (m − k) × k null matrix Om−k,k . This matrix is known as Householder transformation and has the form:   Ik Ok,m−k , Qk+1 = Om−k,k Pm−k where Pm−k = Im−k −

2 uuT , uT u

where u = c + c (1, 0, . . . , 0).

Remark The following formula can be useful when expressing matrix

products. Consider two p×q and r ×q matrices U and V, respectively, with U = u1 , . . . , uq

q j and V = v1 , . . . , vq . Since the ith and j th element of UVT is k=1 uik vk , and

506

D Matrix Algebra and Matrix Function j

because uik vk is the ith and j th element of uk vTk , one gets UVT =

q 

uk vTk .

k=1

Similarly, if = λ1 , . . . , λq is a diagonal matrix, then we also have U VT = q T k=1 λk uk vk .

D.3 Matrix Derivative D.3.1 Vector Derivative T

Let f (.) be a scalar function of a p-dimensional vector x = x1 , . . . , xp . The ∂f and is defined in the usual partial derivative of f (.) with respect to xk is noted ∂x k way. The derivative of f (.) with respect to x is given by ∂f = ∇f (x) = ∂x



∂f ∂f ,..., ∂x1 ∂xp

T (D.6)

and is also known as the gradient of f (.) at x. The differential of f () is then written T

p ∂f dxk = ∇f (x)T dx, where dx = dx1 , . . . , dxp . as df = K=1 ∂x k Examples • For a linear form f (x) = aT x, ∇f (x) = a. • For a quadratic form f (x) = xT Ax, ∇x f = 2Ax.

For a vector function f(x) = f1 (x), . . . , fq (x) , where f1 (.), . . . , fq (.) are scalar functions of x, the gradient in this case is called the Jacobian matrix of f(.) and is given by 

 ∂f j T T Df(x) = ∇f1 (x) , . . . , ∇fq (x) = (x) . ∂xi

(D.7)

D.3.2 Matrix Derivative





Definition Let X = xij = x1 , . . . , xq be a p ×q matrix and Y = yij = F (X) a r × s matrix function of X. We assume that the elements yij of Y are differentiable scalar function with respect to the elements xij of X. We distinguish two cases:

D

Matrix Algebra and Matrix Function

507

1. Scalar Case If Y = F (X) is a scalar function, then to define the derivative of F () we first use

T transforming X into a pqthe vec (.) notation given by vec (X) = xT1 , . . . , xTq dimensional vector. The differential of F (X) is then obtained by considering F () as a function of vec (X). One gets the following expression: ∂F = ∂X The derivative

∂F ∂X



∂F ∂xij

 (D.8)

.

is then a p × q matrix.

2. Matrix Case If Y = F (X) is a r × s matrix, where each yij = Fij (X) is a differentiable scalar function of X, the partial derivative of Y with respect to xmn is the r × s matrix: ∂Y = ∂xmn



 ∂Fij (X) . ∂xmn

(D.9)

The partial derivative of Y with respect to X, based on Eq. (D.9), is the pr × qs matrix given by ⎛

∂Y ∂x11

∂Y ⎜ = ⎜ .. ∂X ⎝ .

∂Y ∂xp1

... ... ...

∂Y ∂x1q



.. ⎟ ⎟ . ⎠.

(D.10)

∂Y ∂xqq

Equation (D.10) also defines the Jacobian matrix DY (X) of the transformation. Another definition of the Jacobian matrix is given in Magnus and Neudecker (1995, p. 173) based on the vec transformation, namely, DF (X) =

∂vecF (X) ∂ (vecX)T

(D.11)

.

Equation (D.11) is useful to compute the Jacobian matrices using the vec transformation of X and Y and then get the Jacobian of a vector function. Note that Eqs. (D.9) or (D.10) can also be written as a Kronecker product: ∂ ∂Y =Y⊗ = ∂X ∂X



∂yij ∂X

 .

(D.12)

In this appendix we adopt the componentwise derivative concept as in Dwyer (1967).

508

D Matrix Algebra and Matrix Function

D.3.3 Examples In the following examples the p × q matrix Jij will denote the matrix

whose ith

and j th element is one and zero elsewhere, i.e. Jij = δm−i,n−j = δmi δnj , and similarly for the r × s matrix Kαβ . For instance, if X = (xmn ), then Y = Jij X is the matrix whose ith line is the j th line of X and zero elsewhere (i.e. ymn = δmi xj n ), and Z = XJij is the matrix whose j column is the ith column of X and zero elsewhere (i.e. zmn = δj n xmi ). The matrices Jij and Kij are essentially identical, but they are obtained differently, see the remark below.

Case of Independent Elements

We assume that the matrix X = xij is composed of pq independent variables. We j will also use interchangeably the notation δij or δi for Kronecker symbol.  • Let X be a p × p matrix, and f (X) = tr(X) = k xkk . The derivative of f () is ∂ ∂xmn (tr (X)) = δmn . Hence, ∂ ∂ tr (X) = Ip = tr XT . ∂X ∂X • f (X) = tr (AX). Here f (X) = anm ; hence,

  i

k aik xki ,

and

∂f ∂xmn

(D.13) =



k n i,k aik δm δi

∂f = AT . ∂X

=

(D.14)

• g (X) = g (f (X)), where f (.) is a scalar function of X and g(y) is a differentiable scalar function of y. In this case we have ∂g dg ∂f = . (f (X)) ∂X dy ∂X ∂ tr(XA) For example, ∂X e = etr(XA) AT . • f (X) = det (X) = |X|. For this case, one can use Eq. (D.2), i.e. |X| = j xαj Xαj where Xαj is the cofactor of xαj . Since Xαj is independent of xαk ,

for k = 1, . . . n, one gets

∂|X| ∂xαβ

= Xαβ , and using Eq. (D.3), one gets ∂|X| = |X|X−T . ∂X

(D.15)

Consequently, if g(y) is any real differentiable scalar function of y, then we get

D

Matrix Algebra and Matrix Function

509

∂ dg g (|X|) = (|X|) |X|X−T . ∂X dy

(D.16)

• f (X) = g (H(X)), where g(Y) is a scalar function of matrix Y and H(X) is a matrix function of X, both differentiable. Using a similar argument from the derivative of a scalar function of a vector, one gets  ∂g ∂ (H(X))ij ∂yij ∂f (X)  ∂g = = . (H(X)) (H(X)) ∂xαβ ∂yij ∂xαβ ∂yij ∂xαβ i,j

i,j

For example, if Y = Y (X) is any differentiable matrix T function of X, then  ∂|Y| ∂yij   ∂yij ∂|Y(X)| ∂Y = = Y = Y . That is, i,j ∂yij ∂xαβ i,j ij ∂xαβ i,j ij ∂xαβ ∂xαβ ji

   T  ∂Y −1 ∂|Y(X)| −T ∂Y = |Y|tr . = tr |Y|Y Y ∂xαβ ∂xαβ ∂xαβ

(D.17)

Remark We can also compute the derivative with respect to an element and derivative of an element with respect to a matrix as in the following examples. • Let f (X) = X, then

T

∂X = Jαβ , and ∂x = Jβα = JTαβ . αβ

∂y • For a r × s matrix Y = yij , we have ∂Yij = Kij . ∂X ∂xαβ

• For f (X) = AXB, one obtains Exercise Compute

∂f (X) ∂xαβ

= AJαβ B and

∂[f (X)]ij ∂X

= AT Kij BT .

∂Xn ∂xαβ .

Hint Use a recursive relationship. Write Un =

∂XXn−1 ∂xαβ ,

and then Un = Jαβ Xn−1 +

n−1

n−1 + XU X ∂X n−1 . By induction, one finds that ∂xαβ = Jαβ X

Un = Xn−1 Jαβ + XJαβ Xn−2 + X2 Jαβ Xn−3 + . . . + Xn−2 Jαβ X. Application 1. f (X) = XAX, in this case ∂f (X) = Jαβ AX + XAJαβ . ∂xαβ

(D.18)

This could be proven by expressing the ith and j th element [XAX]ij of XAX. Application 2. g(X) = XAX.

= tr(f (X)) where f (X)  (X) Since tr ∂f = tr J = AX + XAJ [AX]βα + [XA]βα , hence αβ αβ ∂xαβ ∂tr (XAX) = (AX + XA)T . ∂X

(D.19)

510

D Matrix Algebra and Matrix Function

Application 3.  

−1

−1 ∂|XAXT | XAT + XAXT XA . = |XAXT | XAT XT ∂X In particular, if A is symmetric, then

∂|XAXT | ∂X

= 2|XAXT |

(D.20)



 −1 XAT XT XA .



∂ XAXT = Jαβ AXT + XAJβα , see ∂xαβ     ∂xik  T  ∂ XAXT writing ∂xαβ ij = k ∂x AX kj αβ

One can use the fact that

also Eq. (D.18),    ∂ AXT which can be proven by + xik ∂xαβ kj .   β  The first sum in the right hand side of this expression is simply k δk δiα AXT kj , which is the (i, j )th element of Jαβ AXT (and also the (α, β)th element of Jij XAT ). Similarly, the second sum is the (i, j )th element of XAJαβ and, by applying the trace operator, provides the required answer. Exercise Complete the proof of Eq. (D.20).



−1

 Jαβ AXT+XAJβα . XAXT 

−1 Next use the same argument as that used in Eq. (D.19) to get tr XAXT Jαβ   

−1   T  −T  

T T T T = AX βi = AX XAX XA . A similar i XAX

Hint First use Eq. (D.17), i.e.

∂|XAXT | T ∂xαβ =|XAX |tr



αβ

reasoning yields    

−1

−1  

−1 T T T = XAX tr XAX XAJαβ = tr XAJβα XAX XA

,

αβ

which again yields Eq. (D.20). Application 4. ∂|AXB| = |AXB|AT (AXB)−T BT . (D.21) ∂X   In fact, one has ∂x∂αβ [AXB]ij = aiα bβj = AJαβ B ij . Hence ∂|AXB| = ∂xαβ  

 −1 −1 = |AXB|tr AJαβ B (AXB) . The last term equals i aiα B (AXB)   βi   −1 −1 B (AXB) βi aiα , which can be easily recognised as B (AXB) A βα =   iT A (AXB)−T BT αβ . −1

−1 • Derivative of the inverse. To compute ∂X ∂xαβ , one can use the fact that X X =  ik  ∂x ik Ip , i.e. k x xkj = δij , which yields after differentiation: k ∂xαβ xkj =  ik   ∂X−1 −1 − k x Jαβ kj , i.e. ∂xαβ X = −X Jαβ or

D

Matrix Algebra and Matrix Function

511

∂ X−1 = −X−1 Jαβ X−1 . ∂xαβ

(D.22)

• f (X) = Y = AX−1 B −. First, we have ∂x∂αβ Y = −AX−1 Jαβ X−1 B. Let us now find the derivative of each element of Y with respect to X. We first note that for any two matrices of respective orders n × p and q × m,





β A = aij and B = bij , one has AJαβ = aiα δj and AJαβ B = aiα bβj .       ∂yij Now, ∂xαβ = − AX−1 Jαβ X−1 B ij = − AX−1 iα X−1 B βj , which is also 



T   −1 T  T

T  X B = AX−1 Jij X−1 B , that is − AX−1 αi



αβ

∂yij = −X−T AT Jij BT X−T . (D.23) ∂X

• f (X) = y = tr X−1 A −. One uses the previous argument, i.e. ∂x∂yαβ =    

    −tr X−1 Jαβ X−1 A , which is − i X−1 iα X−1 A βi = − i X−T αi  T −T  A X . Therefore, iβ

∂ tr X−1 A = −X−T AT X−T . ∂X

(D.24)

Alternatively, one can also use the identity tr(X) = |X|tr(X−1 ) (e.g. Graybill 1969, p. 227).

Case of Symmetric Matrices The matrices dealt with in the previous examples have independent elements. When, however, the elements are not independent, the rules change. Here we consider the case of symmetric matrices, but there various other dependencies such as

are be a symmetric matrix, and J ij = normality, orthogonality etc. Let X = x ij

Jij + Jj i − diag Jij , i.e. the matrix with one for the (i, j )th and (j, i)th elements ∂X and zero elsewhere. We have ∂x = J ij . Now, if f (X) is a scalar function of ij the symmetric matrix X, then we can start first with the scalar function f (Y) for a general matrix, and we get (e.g. Rogers 1980)    ∂f (Y) ∂f (Y) ∂f (Y) ∂f (X) = + − diag . ∂X ∂Y ∂YT ∂Y Y=X The following examples illustrate this change.  ∂xki • ∂x∂αβ tr (AX) = i,k aik ∂x = aαβ + aβα ; therefore, αβ

512

D Matrix Algebra and Matrix Function

∂ tr (AX) = A + AT . ∂X Exercise Show that

∂ ∂xαβ AXB

(D.25)

= AJαβ B + BJβα B − AJαα B δαβ .

• Derivative of determinants 

 ∂ |X| = |X| 2X−1 − diag X−1 . ∂X

(D.26)



∂ |AXB|=|AXB| AT (AXB)−T BT +B (AXB)−1 A−diag B (AXB)−1 A . ∂X (D.27) Exercise Derive Eq. (D.26) and Eq. (D.27). Hint Apply (D.17) to the transformation Y(X) = X1 + XT1 − diag (X), where X1 is the lower triangular matrix whose elements are xij1 = xij , for i ≤ j . Then    ∂|Y| ∂yij 

ji J one gets ∂x∂αβ |Y(X)| = αβ ij . Keeping in mind ij ∂yij ∂xαβ = ij |Y|y

−1 that Y = X, the previous expression yields |X|tr X J αβ . To complete the



proof, remember that J αβ = Jαβ + Jβα − diag Jαβ ; hence, tr X−1 J αβ =  

αβ x βα + x αβ − x αβ δx = 2X−1 − diag X−1 αβ . Similarly, Eq. (D.27) is similar to Eq. (D.21) but involves symmetry, i.e. ∂ ∂ −1 AXB = AJ B. Therefore, |AXB| = |AXB|tr AJ B . (AXB) αβ αβ ∂xαβ ∂xαβ • Derivative of a matrix inverse ∂X−1 = −X−1 J αβ X−1 . ∂xαβ

(D.28)

The proof is similar to the non-symmetric case, but using J αβ instead. • Trace involving matrix inverse

∂ tr X−1 A = −X−1 A + AT X−1 + diag X−1 AX−1 . ∂X

(D.29)

D.4 Application D.4.1 MLE of the Parameters of a Multinormal Distribution Matrix derivatives find straight application in multivariate analysis. The most familiar example is perhaps the estimation of a p-dimensional multinormal distribution

D

Matrix Algebra and Matrix Function

513

N (μ, ) from a given sample of data. Let x1 , . . . , xn be a sample from such a distribution. The likelihood of this sample (see e.g. Anderson 1984) is   n  * 1 −p/2 −1/2 T −1 L= f (xt ; μ, ) = || exp − (xt − μ)  (xt −μ) . (2π ) 2 t=1 t=1 (D.30) The objective is then to estimate μ and  by maximising L. It is usually simpler to use the log-likelihood L = log L, which reads n *

np n 1 log 2π − log || − (xt − μ)T  −1 (xt − μ) . 2 2 2 n

L = log L =

(D.31)

t=1

L The estimates are obtained by solving the system of equations given by ∂∂μ = 0 and ∂L ∂

= O. The first of these yields is (assuming that  −1 exists) n 

(xt − μ) = 0,

(D.32)

t=1

which provides the sample mean. For the second, one can use Eqs. (D.16)–(D.26), and Eq. (D.29) for the last term, which can be written as a trace of a matrix product. This yields

2 −1 − diag  −1 − 2 −1 S −1 + diag  −1 S−1  −1 = O,





which can be simplified to 2 −1 Ip − S −1 − diag  −1 Ip − S −1 = O, i.e.

(D.33)  −1 Ip − S −1 = O, yielding the sample covariance matrix S.

D.4.2 Estimation of the Factor Model Parameters The estimation of the parameters of a factor model can be found in various text books, e.g. Anderson (1984), Mardia et al. (1979). The log-likelihood of the model has basically the same form as Eq. (D.31) except that now  is given by  =

+  T , where is a diagonal covariance matrix (see Chap. 10, Eq. (10.11)). ∂ Using Eq. (D.16) along with results from Eq. (D.20), we get ∂ log |  T +

−T ∂

| = 2  T +

. In a similar way we get ∂

log |  T + | =

514

D Matrix Algebra and Matrix Function

−1 ∂ diag  T + . Furthermore, using Eq. (D.27), one gets ∂ log |  T +





−1 −1

| = 2 T  T +

− diag T  T +

.

∂ tr H−1 S = H−T ST H−T XAT − Exercise Let H = XAXT . Show that ∂X H−1 SH−1 XA. Hint Let H = (hij ), then using arguments from Eq. (D.24) and Eq. (D.19) one gets

 ∂tr(H−1 S) ∂ ∂ tr H−1 S = hij ∂xαβ ∂hij ∂xαβ ij

    −H−T ST H−T Jαβ AXT + XAJβα . = ij

ij

ij



 This is precisely tr −H−1 SH−1 Jαβ AXT + XAJβα . Using an argument similar to that presented in Eqs. (D.23)–(D.24), one gets   ∂ trH−1 S = − H−T ST H−T XAT + H−1 SH−1 XA . αβ ∂xαβ Applying the above exercise and keeping in mind the symmetry of , see Eq. (D.29), yield

∂ tr  −1 S = 2 −2 −1 S −1 + diag  −1 S −1 . ∂



∂ Exercise Let H = AXB, and show that ∂X tr H−1 S = −AT H−T ST H−T BT .  

 







Hint As before, one finds − ij H−1 SH−1 ij AJαβ B ij = −tr AJαβ BH−T ST H−T ,      and this can be written as − i aiα BH−T ST H−T βi = − BH−T ST H−T A βα . Using the above result with A = , B = T and X = , keeping in mind the symmetry of  and , one gets  

−1

 S = 2 T −2 −1 S −1 + diag  −1 S −1 

T

+diag −2 −1 S −1 + diag  −1 S −1 . ∂ ∂ tr

For diagonal, one simply gets



 ∂ tr  −1 S = −2 −1 S −1 + diag  −1 S −1 , that is − diag  −1 S −1 . αα ∂ [ ]αα

Finally, one gets

D

Matrix Algebra and Matrix Function ∂L ∂ ∂L ∂ ∂L ∂

515

 



= − n2 2 −1  + 2 −2 −1 S −1 + diag  −1 S −1   −1 

= −n  ( − 2S)  −1 + diag  −1 S −1  





= − n2 2 T  −1 − diag T  −1 + 2 T −2 −1 S −1 + diag  −1 S −1 





− n2 diag T −2 −1 S −1 + diag  −1 S −1  



= − n2 diag −1 − diag  −1 S −1 = − n2 diag  −1 ( − S) −1 . (D.34)

Note that if one removes the terms pertaining to symmetry one finds what has been presented in the literature, e.g. Dwyer (1967), Magnus and Neudecker (1995). For example, in Dwyer (1967) symmetry was not explicitly taken into account in the differentiation. The symmetry condition can be considered via Lagrange multipliers (Magnus and Neudecker 1995). It turns out, in fact, that the stationary points of a (X) scalar function f (X) of the symmetric p × p matrix X, i.e. ∂f∂X = O are also the solutions to (Rogers 1980, th. 101, p. 80) 

∂f (Y) ∂Y

 = O,

(D.35)

Y=X

    (Y) where Y is non-symmetric, whenever ∂f∂Y = ∂f∂Y(Y) , which is T Y=X Y=X straightforward based on the definition of the derivative given in Eq. (D.8), and where the first differentiation reflects positional aspect, whereas the second one refers to the ordinary differentiation. This result simplifies calculation considerably. The stationary solutions to L are given by  −1 ( − S)  −1  = O T  −1 ( − S)  −1 = O

diag  −1 ( − S)  −1 = O.

(D.36)

D.4.3 Application to Results from PCA Matrix derivative also finds application in various other subjects. The eigenvalue problem of EOFs is a straightforward application. An interesting alternative to this eigenvalue problem, which uses matrix derivative, is provided by the following result (see Magnus and Neudecker 1995, th. 3, p. 355). For a given p × p positive semi-definite matrix , of rank r, the minimum of  (Y) = tr ( − Y)2 ,

(D.37)

where Y is positive semi-definite matrix of rank q ≤ p, is obtained for Y=

q  k=1

λ2k vk vTk ,

(D.38)

516

D Matrix Algebra and Matrix Function

where λ2k , and vk , k = 1, . . . q, are the leading eigenvalues and associated eigenvectors of . The matrix Y thus defined provides the best approximation to . Consequently, if represents the covariance matrix of some data matrix, then Y is simply the covariance matrix of the filtered data matrix obtained by removing the contribution from the last r − q eigenvectors of . Exercise Show that Y defined by Eq. (D.38) minimises Eq. (D.37). Hint Write Y = AAT and find A. A p × r matrix A is semi-orthogonal1 if AT A = Ir . Another connection to EOFs is provided by the following theorem (Magnus and Neudecker 1995). Theorem Let X be a n × p data matrix. The minimum of the following real valued function:



T φ(X) = tr X − ZAT X − ZAT = X − ZAT 2F ,

(D.39)

where the p × r matrix A is semi-orthogonal and .F stands for the Fröbenius norm, is obtained for A = (v1 , . . . , vr ) and Z = XA, where v1 , . . . , vr are the eigenvectors associated with the leading eigenvalues λ21 , . . . , λ2r of XT X. Further, p the minimum is k=r+1 λ2k . In other words, A is the set of the leading eigenvectors, and Z the matrix of the associated PCs. Exercise Find the stationary points of Eq. (D.39). Hint Use a Lagrange function (see exercise below). Exercise Show that the Lagrangian function of min φ(X) s.t. F(X) = O, where F(X) is a symmetric matrix function of X, is (X) = φ(X) − tr (LF(X)) , where L is a symmetric matrix. Hint The Lagrangian function is simply φ(X) − since F(X) is symmetric.

1 The

set of these matrices is known as Stiefel manifold.



i,j lij

[F(X)]ij , where lij = lj i

D

Matrix Algebra and Matrix Function

517

D.5 Common Algorithms for Linear Systems and Eigenvalue Problems D.5.1 Direct Methods A number of algorithms exist to solve linear systems of the kind Ax = b and Ax = λx. For the linear system, the m × m matrix A is normally assumed to be nonsingular. A large number of algorithms exists to solve those problems, see e.g. Golub and van Loan (1996). Some of those algorithms are more suited than others particularly for large and/or sparse matrices. Broadly speaking, two main classes of methods exist for solving linear systems and also eigenvalue problems, namely, direct and iterative. Direct methods are based mostly on decompositions. The main direct methods include essentially the SVD, LU and QR decompositions (Golub and van Loan 1996).

D.5.2 Iterative Methods Case of Eigenvalue Problems Iterative methods are based on what is known as Krylov subspaces. Given a m × m matrix A and a non-vanishing m-dimensional vector y, the sequence y, Ay, . . . , An−1 y is referred to as Krylov sequence. The Krylov subspace Kn (A, y) is defined as the space spanned by a Krylov sequence, i.e.

Kn (A, y) = Span y, Ay, . . . , An−1 y .

(D.40)

Iterative methods to solve systems of linear equations (see below) or eigenvalue problems are generally referred to as Krylov space solvers. The Krylov sequence b, Ab, . . . , An−1 b can be used in the approximation process of the eigenelements, but it is in general ill-conditioned, and an orthonormalisation is required. This is obtained using two main algorithms: Lanczos and Arnoldi algorithms for Hermitian and non-Hermitian matrices, respectively (Watkins 2007; Saad 2003). The basic idea is to construct an orthonormal basis given by the m × n matrix Qn = [q1 , . . . , qn ] of Kn , which is used to obtain a projection of the operator A H H ∗T onto Kn , Hn = QH n AQn , where Qn is the conjugate transpose of Qn , i.e. Qn . The pair (λ, x), with Hn x = λx, provides an approximate pair of eigenvalue and associated eigenvector2 of A.

2 The

number λ and vector Qk x are, respectively, referred to as Ritz value and Ritz vector of A.

518

D Matrix Algebra and Matrix Function

Lanczos Method Lanczos method is based on a triangularisation algorithm of a Hermitian matrix (or symmetric for real cases) A, as AQn = Qn Hn ,

(D.41)

where Hn = [h1 , . . . , hn ] and is a triangular matrix. Let us designate by [α1 , . . . , αn ] the main diagonal of H and [β1 , . . . , βn−1 ] as its upper and subdiagonals. Identifying the j th columns from both sides of Eq. (D.41) yields βj qj +1 = Aqj − βj −1 qj −1 − αj qj .

(D.42)

The algorithm then starts from an initial vector q1 (taking q0 = 0) and obtains αj , βj and qj +1 at each iteration step. (The vectors qi , i = 1, . . . n, are orthonormal.) After k steps, one gets AQk = Qk Hk + βk qk+1 eTk

(D.43)

with ek being the k-element vector (0, . . . , 0, 1)T . The algorithm stops when βk = 0. Arnoldi Method Arnoldi algorithm is similar to Lanczos’s except that the matrix Hn = (hij ) is upper Hessenberg matrix, which satisfies hij = 0 for i ≥ j + 2. As for the Lanczos j method, Eq. (D.42) yields hj +1,j qj +1 = Aqj − i=1 hij qi . After k steps, one obtains AQk = Qk Hk + hk+1,k qk+1 eTk .

(D.44)

The above Eq. (D.44) can be written in a compact form as AQk = Qk+1 Hk , where Hk is the obtained (k + 1) × k Hessenberg matrix. The matrix Hk is obtained from H k+1 by deleting the last row. Note that Arnoldi (and also Lanczos) methods are modified versions of the Gram–Schmidt orthogonalisation procedure, with Hessenberg and tridiagonal matrices involved, respectively, in the two methods. Note also that a non-symmetric Lanczos method exists, which yields a nonsymmetric tridiagonal matrix (e.g. Parlett et al. 1985).

Case of Linear Systems The simplest iterative method is the Jacobi iteration, which solves a fixed point ˆ + b, ˆ where A ˆ = problem. It transforms the linear system Ax = b into x = Ax −1 −1 ˆ Im − D A and b = D b, with D being either the diagonal matrix of A or simply the identity matrix. The fixed point iterative algorithm is then given by xn+1 = ˆ with a given initial condition x0 . The algorithm converges when the spectral ˆ n +b, Ax

D

Matrix Algebra and Matrix Function

519

ˆ is less than unity, i.e. ρ(A) ˆ < 1. The computation of the nth residual radius of A vector rn = b − Axn involves the Krylov subspace Kn+1 (A, r0 ). Other methods like gradient and semi-iterative methods are included in the Krylov space solver. From an initial condition x0 , the residual takes the form xn − x0 = pn−1 (A)r0 ,

(D.45)

for a polynomial pn−1 (.) of degree n − 1, and belongs to Kn (A, r0 ). The problem is then to find a good choice of xn in the Krylov space. There are essentially two methods for this (Saad 2003), namely, Arnoldi’s method (described above) or FOM (Full Orthogonalisation Method) and the GMRES (Generalised Minimum Residual Method) algorithms. The FOM algorithm is based on the above Arnoldi orthogonalisation procedure and looks for xn − x0 within Kn (A, r0 ) such that (b − Axn ) is orthogonal to this Krylov space (Galerkin condition). From the initial residual r0 , and letting r0 = βq1 , with β = r0 2 , the algorithm yields a similar equation to (D.41), i.e. QTk AQk = Hk and QTk r0 = βq1 . The approximate solution at step k is then given by −1 T xk = x0 + Qk yk = x0 + βQk H−1 k Qk q1 = x0 + βQk Hk e1 ,

(D.46)

where e1 = QTk q1 , i.e. e1 = (1, 0, . . . , 0)T . Exercise Derive the above approximation (Eq. (D.46)). Hint At the kth step, the vector xk − x0 belongs to Kk and is therefore of the form Qk y. The above Galerkin condition means that QTk (b − Axk ) = 0, that is, QTk r0 − QTk AQTk y = 0, and using Eq. (D.41) yields Eq. (D.46). The GMRES algorithm seeks vectors xn − x0 within Kn (A, r0 ) such that b − Axn is orthogonal to AKn . This condition implies the minimisation of Axn − b2 (e.g. Saad and Schultz 1985). Variants of these methods and other algorithms exist for particular choices of the matrix A, see e.g. Saad (2003) for more details. An approximate solution at step k, which is in the space x0 + Kk , is given by xk = x0 + Qk z∗ .

(D.47)

Axk − b = r0 − AQk z∗ = βq1 − Qk+1 Hk z∗ ,

(D.48)

Using Eq. (D.44), one gets

which yields Qk+1 (βe1 − Hk z∗ ). The vector z∗ is then given by z∗ = argmin βq1 − Qk+1 Hk z2 = argmin βe1 − Hk z2 , z

z

(D.49)

520

D Matrix Algebra and Matrix Function

where the last equality holds because Qk+1 is orthonormal (Qk+1 x2 = x2 ), and Hk is the matrix defined below Eq. (D.44). Remark The Krylov space can be used, for example, to approximate the exponential of a matrix, which is useful particularly for large matrices. Given a matrix A and a vector v, an approximation of eA v, using Kk (A, v), is given by (Saad 1990) etA v ≈ βQk etHk e1 ,

(D.50)

with β = v2 . Equation (D.50) can be used, for example, to compute the solution of an inhomogeneous system of first-order ODE. Also, and as pointed out by Saad (1990), Eq. (D.50) can be used to approximate the (matrix) integral: 



X=

T

euA bbT euA du,

(D.51)

0

which provides a solution to the Lyapunov equation AXT + XAT + bbT = O.

Appendix E

Optimisation Algorithms

E.1 Background Optimisation problems are ubiquitous in all fields of science. Various optimisation algorithms exist in the literature, and they depend on whether the first and/or second derivative of the function to be optimised is available. These algorithms also depend on the function to be optimised and the type of constraints. Since minimisation is the opposite of maximisation, we will mainly focus on the former. In general there are four types of objective functions (and constraints): • • • •

linear; quadratic; smooth nonlinear; non-smooth.

A minimisation problem without constraints is an unconstrained minimisation problem. When the objective function is linear and the constraints are linear inequalities, one has a linear programming, see e.g. Foulds (1981, chap. 2). When the objective function and the constraints are nonlinear, one gets a nonlinear programme. Minimisation algorithms also vary according to the nature of the problem. For instance in the unconstrained case, Newton’s method can be used in the multivariate case when the gradient and the second derivative of the objective function are provided. When only the first derivative is provided, a quasi-Newton method can be applied. When the dimension of the problem is large, conjugate gradient algorithms can be used. In the presence of nonlinear constraints, a whole class of gradient methods, such as reduced and projected gradient methods, can be used. Convex programming methods can be used when we have linear or nonlinear inequalities as constraints. In most cases, a constrained problem can be transformed into an unconstrained problem using Lagrange multipliers. In this appendix we provide a short review of the most widely used optimisation algorithms that are used in atmospheric science. © Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

521

522

E Optimisation Algorithms

For a detailed review of the various optimisation problems and algorithms, the reader is referred to Gill et al. (1981). There is in general a large difference between one- and multidimensional problems. The univariate and bivariate minimisation problems are in general not difficult to solve since the function can be plotted and visualised, particularly when the function is smooth. When the first derivative is not provided, methods like the golden section can be used. The problem gets more difficult for many variables when there are multiple minima. In fact, the main obstacle to minimisation in the multivariate case is the problem of local minima. For example, the global minimum can be attained when the function is quadratic: f (x) =

1 T x Ax − bT x + c, 2

(E.1)

where A is a symmetric matrix. The quadratic Eq. (E.1) is a typical example that deserves attention. The gradient of f (.) is Ax − b, and a necessary condition for optimality is given by ∇f (x) = 0. The solution to this linear equation provides a partial answer, however. To get a complete answer, one has to compute the second derivative at the solution of the necessary condition, to yield the Hessian:  H = (hij ) =

∂ 2f ∂xi ∂xj

 = A.

(E.2)

Clearly, if A is positive semi-definite, the solution of the necessary condition is a global minimum. In general, however, the function to be minimised is nonquadratic, and more advanced tools have to be applied, and this is the aim of this appendix.

E.2 Single Variable In general the one-dimensional case is the easiest minimisation problem, particularly when the objective function is smooth.

E.2.1 Direct Search A simple direct search method is based on successive function evaluation, aiming at reducing the length of the interval containing the minimum. The most widely used methods are: • Dicholomus search—It is based on successively halving the interval containing the minimum. After n iterations, the length In of the interval containing the minimum is

E

Optimisation Algorithms

523

In =

1 I0 , 2n/2

where I0 = x2 − x1 if [x1 , x2 ] is the initial interval. • Golden section—It is based on subdividing the initial interval [x1 , x2 ] into three subintervals using two extra points x3 and x4 , with x1 < x3 < x4 < x2 . For example, if f (x3 ) ≤ f (x4 ), then the minimum is expected to lie within [x1 , x4 ]; otherwise, it is in [x3 , x2 ]. The iterative procedure takes the form: (i)

x3 = (i)

x4 =



(i) (i) (i) τ −1 + x1 1 τ x2 − x (i) (i) (i) 1 + x1 , τ x2 − x1 √

where τ is the Golden number1 1+2 5 . There are various other methods such as quadratic interpolation and Powell’s method, see Everitt (1987) and Box et al. (1969).

E.2.2 Derivative Methods When the first and perhaps the second derivatives are available, then it is known that the two conditions: d ∗ dx f (x ) = 0 d2 f (x ∗ ) > 0 dx 2

(E.3)

are sufficient conditions for x ∗ to be a minimum of f (). In this case the most widely used method is based on Newton algorithm, also known as Newton–Raphson, and aims at computing the zero of dfdx(x) based on the tangent line at dfdx(x) . The algorithm reads xk+1 = xk −

df (xk )/dx d 2 f (xk )/dx 2

(E.4)

when d 2 f (xk )/dx 2 = 0. Note that when the second derivative is not provided, then the denominator of Eq. (E.4) can be approximated using a finite difference scheme: xk+1 = xk −

1 It

is the limit of

un+1 un

df (xk ) xk − xk−1 . df (xk ) − df (xk−1 ) dxk

when n → ∞ where u0 = u1 = 1 and un+1 = un + un−1 .

524

E Optimisation Algorithms

E.3 Direct Multivariate Search As for the one-dimensional case, there are direct search methods and gradientbased algorithms. Among the most widely used direct search methods, one finds the following:

E.3.1 Downhill Simplex Method This method is due to Nelder and Mead (1965) and was originally described by Spendley et al. (1962). The method is based on a simplex,2 generally with mutually equidistant vertices, from which a new simplex is formed simply by reflection of the vertex (where the objective function is largest) through the opposite facet, i.e. through the hyperplane formed by the remaining m points (or vertices), to a “lower” vertex where the function is smaller. Details on the method can be found, e.g. in Box et al. (1969) and Press et al. (1992). The method can be useful for a quick search but can become inefficient for large dimensions.

E.3.2 Conjugate Direction/Powell’s Method Most multivariate minimisation algorithms attempt to find the best search direction along which the function can be minimised. The conjugate direction method is based on minimising a quadratic function and is known as quadratically convergent. Consider the quadratic function: f (x) =

1 T x Gx + bT x + c. 2

(E.5)

The directions u and v are said to be conjugate (with respect to G) if uT Gv = 0. The method is then based on finding a set of mutually conjugate search directions along which minimisation can proceed. Powell (1964) has shown that if a set of search (i) (i) (i) (i) directions u1 , . . . un , at the ith iteration,

are normalised so that uk Guk = 1, (i)

(i)

k = 1, . . . , n, then det u1 , . . . , un is maximised only when the vectors are (linearly independent) mutually conjugate. This provides a way of finding a new search direction that can replace an existing one. The procedure is therefore to minimise the function along individual lines and proceeds as follows. Starting from

2 A simplex in a m-dimensional space is a polygonal geometric figure with m + 1 vertices, or m + 1 facets. Triangles and pyramids are examples of simplexes in three- and four-dimensional spaces, respectively.

E

Optimisation Algorithms

525

x0 and a direction u0 , one minimises the univariate function f (x0 + λu0 ) and then replaces x0 and u0 by x0 + λu0 and λu0 , respectively. Powell’s algorithms run as follows: 0. 1. 2. 3. 4.

Initialise ui = ei , i.e. the canonical basis vectors, i = 1, . . . , m. Initialise x = x0 . Minimise f (xi−1 + λui ), xi = x0 + λui , i = 1, . . . , m. Set ui+1 = ui , i = 1, . . . , m, um = xm − x0 . Minimise f (xm + λum ), x0 = xm + λum , and then go to 1.

Powell (1964) showed that the procedure yields a set of k mutually conjugate directions after k iterations. The procedure has to be reinitialised with new vectors after every m iterations in order to escape dependency of the obtained vectors, see Press et al. (1992) for further details. Remark The reason for using one-dimensional minimisation is conjugacy. In fact, if u1 , . . . , um are mutually conjugate with respect to G, the required minimum is taken to be x1 = x0 +

m 

αk uk ,

k=1

where m the parameters αk , k = 1, . . . , m, are chosen to minimise f (x1 ) = f (x0 + k=1 αk uk ). These coefficients therefore minimise f (x1 ) =

m   1 i=1

2

 αi2 uTi Gui + αi uTi (Gx0 + b) + f (x0 )

(E.6)

based on conjugacy of ui , i = 1, . . . , m. Hence, the effect of searching along ui is to find αi that minimises 12 αi2 uTi Gui + αi uTi (Gx0 + b), and this value of αi is independent of the other terms in Eq. (E.6). It is shown precisely by Fletcher (1972) that for a quadratic function, with particular definite Hessian G, when the search directions ui , i = 1, . . . , m, are conjugate of each other, then the minimum is found in at most m iterations. Furthermore, x(i+1) = x(i) + α (i) ui is the minimum point in the subspace generated by the initial approximation x(1) , and the directions u1 , . . . , ui , and the identities gTi+1 uj = 0, j = 1, . . . , i,, also hold (gi+1 = ∇f (x(i+1) )).

E.3.3 Simulated Annealing This algorithm is based on concepts from statistical mechanics and makes use of Boltzmann probability of energy distribution in thermodynamical systems in equilibrium (Metropolis et al. 1953). The method uses Monte Carlo simulation to

526

E Optimisation Algorithms

generate moves and is particularly useful because it can escape local minima. The algorithm can be applied to continuous and discrete problems (Press et al. 1992), see also Hannachi and Legras (1995) for an application to atmospheric low-frequency variability.

E.4 Multivariate Gradient-Based Methods Unlike direct search methods, gradient-based approaches use the gradient of the objective function. Here we assume that the (smooth) objective function can be approximated by 1 f (x + δx) = f (x) + g(x)T δx + δxT Hδx + o(|δx|2 ), (E.7) 2

where g(x) = ∇f (x), and H = ∂xi∂∂xj f (x) are, respectively, the gradient vector and Hessian matrix of f (x). Gradient methods also belong to the class of descent algorithms where the approximation of the desired minimum at various iterations is perturbed in an additive manner as xm+1 = xm + λu.

(E.8)

Descent algorithms are distinguished by the manner in which the search direction u is chosen. Most gradient methods use the gradient as search direction since the gradient ∇f (x) points in the direction where the function increases most rapidly.

E.4.1 Steepest Descent g The steepest descent uses u = − g and choses λ that minimises the univariate objective function:

h(λ) = f (xm + λu).

(E.9)

The solution at iteration m + 1 is then given by xm+1 = xm − λ

∇f (xm ) . ∇f (xm )

(E.10)

E

Optimisation Algorithms

527

Note that Eq. (E.9) is quadratic when Eq. (E.7) is used, in which case the solution is given by3 λ=

∇f (xm )3 . ∇f (xm )T H∇f (xm )

(E.11)

Note that because of the one-dimensional minimisation at each step, the method can be computationally expensive. Some authors use decreasing step-size selection λ = α k , (0 < α < 1), for k = 1, 2, . . . , until the first k where f () has decreased (Cadzow 1996).

E.4.2 Newton–Raphson Method This is a generalisation of the one-dimensional Newton–Raphson method and is based on the minimisation of the quadratic form Eq. (E.1), where the search direction is given by u = −H−1 ∇f (x).

(E.12)

At the (m + 1)th iteration, the approximation to the minimum is given by xm+1 = xm − H−1 (xm )∇f (xm ).

(E.13)

Note that it is also possible to choose xm+1 = xm − λH−1 ∇f (xm ), where λ can be found through a one-dimensional minimisation as in the steepest descent. Newton method requires the inverse of the Hessian at each iteration, and this can be quite expensive particularly for large problems. There is also another drawback of the approach, namely, the convergence towards the minimum can be secured only if the Hessian is positive definite. Similarly, the steepest descent is no better since it is known to exhibit a linear convergence, i.e. a slow convergence rate. These drawbacks have led to the development of more advanced and improved algorithms. Among these methods, two main classes of algorithms stand out, namely, the conjugate gradient and the quasi-Newton methods discussed next.

3 One

can eliminate the Hessian from Eq. (E.11) by choosing a first guess λ0 for λ and then using

 −1 λ20 g . 2 g f x − λ0 g − f (x) + λ0 g

g , which yields λ = Eq. (E.7) with δx = −λ0 g

528

E Optimisation Algorithms

E.4.3 Conjugate Gradient It is possible that the descent direction −g = −∇f (x) and the direction to the minimum may be near to orthogonality, which can explain the slow convergence rate of the steepest descent. For a quadratic function, for example, the best search direction is conjugate to that taken at the previous step (Fletcher 1972, th. 1). This is the basic idea of conjugate gradient for which the new search direction is constructed to be conjugate to the gradient of the previous step. The method can be thought of as an association of conjugacy with steepest descent (Fletcher 1972), and is also known as Fletcher–Reeves (or projection) method. From the set of conjugate gradients −gk , k = 1, . . . , m, a new set of conjugate directions is formed via linear combination as uk = −gk−1 +

k−1 

αj k uj ,

(E.14)

j =1

where αj k = −gTk−1 Huj /uTj Huj , j = 1, . . . , k − 1, and gk = ∇f (xk ). Since in a quadratic form, e.g. Eq. (E.5), one has δgk = gk+1 − gk = Hδx = H(xk+1 − xk ), and because in a linear (one-dimensional) search δxk = λk uk , one gets αj k = −

gTk−1 δgj −1

(E.15)

uTj δgj −1

for j = 1, . . . , k − 1. Furthermore, αj k , j = 0, . . . , k − 2, vanishes,4 yielding uk = −gk−1 +

gTk−1 δgk−2 uTk−1 δgk−2

uk−1 ,

which simplifies to uk = −gk−1 +

gTk−1 gk−1 gTk−2 gk−2

uk−1 ,

(E.16)

k − 1 one-dimensional searches in (u1 , . . . , uk−1 ), the quadratic form is minimised at xk−1 , then gk−1 is orthogonal to uj , j = 1, . . . , k − 2, because of the one-dimensional requirement for

d f (xk−2 + αuj ) = gTk−1 uj = 0 . Furminimisation in each direction um , m = 1, . . . , k − 2, dα thermore, since the vectors uj are linear combinations of gi , i = 1, . . . , j , the vectors gj are j also linear combinations of u1 , . . . , uj , i.e. gj = i=1 αi ui , hence gTk−1 gj = 0, j = 1, . . . , k − 2.

4 After

E

Optimisation Algorithms

529

for k = 1, . . . , n with u1 = −g0 . For a quadratic function, the algorithm converges in at most n iterations, where n is the problem dimension. For a general function, Eq. (E.16) can be used to update the search direction every iteration, and that in practice uk is reset to −gk−1 after every n iterations.

E.4.4 Quasi-Newton Method The Newton–Raphson direction −H−1 g may be thought of as an improvement (or correction) to the steepest descent direction −g = −∇f (x). The quasi-Newton approach attempts to take advantage of the steepest descent and quadratic convergence rates of the basic second-order Newton method. It is based on approximating the inverse of the Hessian matrix H. In the modified Newton–Raphson, Goldfeld et al. (1966) propose to use the following iterative scheme: xk+1 = xk − (λIn + H)−1 g

(E.17)

based on the approximation:

−1 ≈ In − λ−1 H + λ−2 H2 + . . . . In + λ−1 H

(E.18)

The most widely used quasi-Newton procedure, however, is the Davidson–Fletcher– Powell method (Fletcher and Powell 1963), sometimes referred to as variable metric method, which is based on approximating the inverse H−1 by an iterative procedure for which the kth iteration reads xk+1 = xk − λk Sk gk ,

(E.19)

where Sk is a sequence that converges to H−1 and is given by Sk+1 = Sk −

Sk δgk δgTk Sk δgTk Sk δgk

+

δxk δxTk δxTk δgk

,

(E.20)

where δgk = gk+1 − gk and δxk = xk+1 − xk = −λk Sk δgk . Note that there exist in the literature various other formulae for updating Sk , see e.g. Adby and Dempster (1974) and Press et al. (1992). These techniques can be adapted and simplified further depending on the objective function, such as the case of the sum of squares, encountered in least square regression analysis, see e.g. Everitt (1987).

530

E Optimisation Algorithms

E.4.5 Ordinary Differential Equations-Based Methods Optimisation techniques based on solving systems of ordinary differential equations have also been proposed and used for some time, although not much in atmospheric science, see e.g. Hannachi et al. (2006); Hannachi (2007). The approach seeks the solution to min F (x), x

(E.21)

where x is a n-dimensional real vector, by following trajectories of a system of ordinary differential equations. For instance, we know that if x∗ is a solution to Eq. (E.21), then ∇F (x∗ ) = 0. Therefore by integrating the dynamical system dx = −∇F (x), dt

(E.22)

starting from a suitable initial condition, one should converge in principle to x∗ . This method can be regarded as the continuous version of the steepest descent algorithm. In fact, Eq. (E.22) becomes equivalent to the steepest algorithm when xt+h −xt dx . The system of ODE, dt is approximated by the simple finite difference h Eq. (E.22), can be interpreted as the equation describing a particle moving in a potential well given by F (.). Note that Eq. (E.22) can also be replaced by a continuous Newton equation of the form: dx = −H−1 (x) ∇F (x), dt

(E.23)

where H is the Hessian matrix of F () at x. Some of these techniques have been reviewed in Brown (1986), Botsaris (1978), Aluffi-Pentini et al. (1984) and Snyman (1982). It is argued, for example, in Brown (1986) that ordinary differential equation-based methods compare very favourably with conventional Newton and quasi-Newton algorithms, see Hannachi et al. (2006) for an application to simplified EOFs.

E.5 Constrained Minimisation E.5.1 Background Constrained minimisation problems are more subtle than unconstrained problems. We give here a brief review, and for more details the reader is referred to more specialised textbooks on optimisation, see e.g. Gill et al. (1981). A typical (smooth) constrained minimisation problem takes the form:

E

Optimisation Algorithms

531

minx f (x) s.t. gi (x) = 0 i = 1, . . . , r hj (x) ≤ 0 j = 1, . . . , m.

(E.24)

When the functions involved in Eq. (E.24) are convex or polynomials, the problem is known under the name of mathematical programming. For instance, if f (.) is quadratic or convex and the constraints are linear, efficient programming procedures exist for the minimisation. In general, most algorithms attempt to transform Eq. (E.24) into an unconstrained problem. This can be done easily, via a change of variable, when the constraints are simple. The following examples illustrate this. • For constraints of the form x ≥ 0, the change of variable is x = y 2 . b−a • For a ≤ x ≤ b, one can have x = a+b 2 + 2 sin y. In a number of cases the inequality h(x) ≤ 0 can be handled by introducing a slack variable y yielding h(x) + y 2 = 0. Equality constraints can be handled in general by introducing Lagrange multipliers. Under some regularity conditions,5 a necessary condition for x∗ to be a constrained local minimum of Eq. (E.24) is the existence of Lagrange multipliers u∗ = (u∗1 , . . . , u∗r )T and v∗ = (v∗1 , . . . , v∗m )T such that   ∇f (x∗ ) + ri=1 u∗i ∇gi (x∗ ) + m j =1 v∗j ∇hj (x∗ ) = 0 j = 1, . . . , m v∗j hj (x∗ ) = 0 j = 1, . . . , m; v∗j ≥ 0

(E.25)

the conditions given by Eq. (E.25) are known as Kuhn–Tucker optimality conditions and express the stationarity of the Lagrangian: L (x; u, v) = f (x) +

r  i=1

ui gi (x) +

m 

vj hj (x)

(E.26)

j =1

at x∗ for the optimum values u∗ and v∗ . Note that the first vector equation in Eq. (E.25) can be solved by minimising the sum of squares of its elements, i.e. 2  ∂L min nk=1 ∂x . In mathematical programming, system Eq. (E.25) is generally k referred to as the dual problem of Eq. (E.24).

linear independence between ∇hj (x∗ ) and ∇gi (x∗ ), i = 1, . . . , r, for all j satisfying hj (x∗ ) = 0.

5 Namely,

532

E Optimisation Algorithms

E.5.2 Approaches for Constrained Minimisation Lagrangian Method It is based on minimising, at each iteration, the Lagrangian: L (x; u, v) = f (x) + uTk g(x) + vTk h(x),

(E.27)

yielding a minimum xk+1 at the next iteration step k + 1. The multipliers uk+1 and vk+1 are taken to be the optimal multipliers for the linearised constraints: g(xk+1 ) + (x − xk+1 )T ∇g(xk+1 ) = 0 h(xk+1 ) + (x − xk+1 )T ∇h(xk+1 ) ≤ 0.

(E.28)

This method is based on linearising the constraints about the current point xk+1 . More details can be found in Adby and Dempster (1974) and Gill et al. (1981). Note that in most iterative techniques, an initial feasible point can be obtained by minimising j hj (x) + i gi2 (x). Penalty Function The basic idea of penalty is as follows. In the search of a constrained minimum of some function, one often encounters the common situation where the constraints are of the form g(x) ≤ 0, and at each iteration the newly formed x has to satisfy these constraints. A simple way to handle this is by forming a linear combination of the elements of g(x), i.e. uT g(x), known as penalty function that accounts for the positive components of g(x). The components of u are zero whenever the corresponding components of g(x) do not violate the constraints (i.e. non-positive) and large positive otherwise. One then has to minimise the sum of the original objective function and the penalty function, i.e. the penalised objective function. Minimising the penalised objective function can prevent the search algorithm from choosing directions where the constraints are violated. In general terms, the penalised method is based on sequentially minimising an unconstrained problem of the form: F (x) = f (x) +

 j

 wj G hj (x), ρ + H (gi (x, ρ)) ,

(E.29)

i

where wj , j = 1, . . . , m, and ρ are parameters that can change value during the minimisation, and usually ρ decreases to zero as the iteration number increases. The functions G() and H () are penalty functions. For example, the function

E

Optimisation Algorithms

533

G(u, ρ) =

u2 ρ

(E.30)

is one of the widely used penalties. When inequality constraints are present, and for a fixed ρ, the barrier function G() is non-zero in the interior of the feasible region (hj (x) ≤ 0, j = 1, . . . , m) and infinite on its border. This maintains iterates xk inside the feasible set, and as ρ → 0 the constrained minimum is approached. Examples of barrier functions in this case include log (−h(x)) and ρ . The following penalty function h2 (x) ρ3

 wj j

h2j (x)

+

1 2 gi (x) ρ

(E.31)

i

has also been used for problem Eq. (E.24).

Gradient Projection This method is based on finding search directions by projecting the gradient −g = −∇f (x) onto the hyperplane tangent to the feasible set, i.e. the set satisfying the constraints, at the current point x. The inequality constraints (that are not satisfied) and the equality constraints are linearised around the current point, i.e. by considering Kx = 0,

(E.32)

where K is (r + l1 ) × n matrix, and l1 is the number of constraints h(x) ≥ 0. Then at each iteration the constraints are linear, and the direction is obtained by projecting −g onto the tangent space to obtain u, i.e. − g = u + KT w.

(E.33)

−1 Kg, w = − KKT

(E.34)

Using Eq. (E.32), one gets

and the negative projected gradient reads  u = − In − K

T

The algorithm goes as follows: 0. Choose x0 from the feasible set.



−1  KK K g. T

(E.35)

534

E Optimisation Algorithms

1. Linearise equations gi (), i = 1, . . . , r, and the currently binding inequalities, i.e. those for which hj (xk ) ≥ 0 to compute K in Eq. (E.32). 2. Compute the projected gradient u from Eq. (E.35). (1) 3. Minimise the one-dimensional function f (xk + λk+1 u), and then set xk+1 = (1)

(2)

(1)

xk + λk+1 u. If xk+1 is feasible, then set xk+1 = xk+1 , otherwise use a suitable  (2) version of the Newton’s method applied to ρ1 j h2j (x) to find a point xk+1 on the boundary of the feasible region. (2) 4. If f (x(2) k+1 ) ≤ f (xk ), set xk+1 = xk+1 and then go to 1. Otherwise go to 3, (t)

and solve for λk+1 by generating, e.g. a sequence xk+1 = xk + (t) f (xk+1 )

1 λ u τ t−2 k+1

until

≤ f (xk ) is satisfied. 5. Iterate steps 1 to 4 until the constrained minimum is obtained. The optimal multipliers vi corresponding to the binding inequalities and u∗ are given by w in Eq. (E.34).

Other Gradient-Related Methods Another gradient-related approach is the multiple gradient summation where the search direction is given by u=−

∇f (xk )  ∇hj (xk ) , ∇f (xk ) ∇hj (xk )

(E.36)

j

where the summation is taken over the violated constraints at the current point xk . Another search method, based on small step gradient, is given by u = −∇f (xk ) −

m 

wj (xk )∇hj (xk ),

(E.37)

j =1

where wj (xk ) = w if hj (xk ) > 0 (w is a suitably chosen large constant) and zero otherwise, see Adby and Dempster (1974). The ordinary differential equations-based method can also be used in constrained minimisation in a similar way after the problem has been transformed into an unconstrained minimisation problem, see e.g. Brown (1986) and Hannachi et al. (2006) for the case of simplified EOFs.

Appendix F

Hilbert Space

This appendix introduces some concepts of linear vector spaces, metrics and Hilbert spaces.

F.1 Linear Vector and Metric Spaces F.1.1 Linear Vector Space A linear vector space X is a set of elements x, y, . . . on which one can define addition x + y between elements x and y of X satisfying, for all elements x, y, and z, the following properties: • • • •

x + y = y + x. (x + y) + z = x + (y + z). The null element 0, satisfying x + 0 = x, belongs to X . The “inverse” −x of x, satisfying x + −x = 0, is also in X .

The first two properties are known, respectively, as commutativity and associativity. These properties make X a commutative group. In addition, a scalar multiplication has to be defined on X with the following properties: • α (x + y) = αx + αy and (α + β) x = αx + βx, • α (βx) = (αβ) x, • 1x = x, for any x and y in X , and scalars α and β.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

535

536

F Hilbert Space

F.1.2 Metric Space A metric d(., .) defined on a set X is a real valued function defined over X × X with the following properties: (i) d(x, y) = d(y, x), (ii) d(x, y) = 0 if and only if x = y, (iii) d(x, y) ≤ d(x, z) + d(z, y), for all x, y and z in X . A set X with a metric d(., .) is referred to as a metric space (X , d).

F.2 Norm and Inner Products F.2.1 Norm A norm on a linear metric space X , denoted by ., is a real valued function satisfying, for all vectors x and y in X and scalar λ the following properties: (i) x ≥ 0, and x = 0 if and only if x = 0, (ii) λx = |λ|x, (iii) x + y ≤ x + y. A linear vector space with a norm is named normed space.

F.2.2 Inner Product An inner product defined on a linear vector space X is a scalar function defined on X × X , denoted by < ., . >, satisfying, for all vectors x and y in X and scalar λ, the following properties: (i) < λx + y, z >= λ < x, z > + < y, z >, (ii) < x, y >= (< y, x >)∗ , where the superscript (∗ ) stands for the complex conjugate, (iii) < x, x >≥ 0, and < x, x >= 0 if and only if x = 0. A linear vector space X with an inner product is an inner product space.

F.2.3 Consequences The existence of a metric and/or a norm leads to defining various topologies. For example, given a metric space (X , d), a sequence {xn }∞ n=1 of elements in X

F Hilbert Space

537

converges to an element x0 in X if lim d (xn , x0 ) = 0.

n→∞

Similarly, a sequence {xn }∞ n=1 of elements in a normed space X , with norm ., is said to converge to x0 in X if lim xn − x0  = 0.

n→∞

Also, a sequence {xn }∞ n=1 of elements in an inner product space X converges to an element x0 in X if lim < xn − x0 , xn − x0 >= 0.

n→∞

The existence of an inner product in a linear vector space X allows the definition of orthogonality as follows. Two vectors x and y are orthogonal, denoted by x ⊥ y, if < x, y >= 0.

F.2.4 Properties 1. A normed linear space, with norm ., defines a metric space with the metric given by d(x, y) = x − y. 2. An inner product space X , with inner product < ., . >, is a normed linear space with the norm defined by x =< x, x >1/2 , and is consequently a metric space. 3. For any x and y in an inner product space X , the following properties hold. • | < x, y > | ≤ xy, • x + y2 + x − y2 = 2x2 + 2y2 . This is known as the parallelogram identity. 4. Given an n-dimensional linear vector space with an inner product, one can always construct an orthonormal basis (u1 , . . . , un ), i.e. < uk , ul >= δkl . 5. Also, the limit of the sum of two sequences in an inner product space is the sum of the limit of the sequences. Similarly, the limit of the inner product of two sequences is the inner product of the limits of the corresponding sequences.

538

F Hilbert Space

F.3 Hilbert Space F.3.1 Completeness Let (X , d) be a metric space. A sequence {xn }∞ n=1 of elements in X is a Cauchy sequence if for every real  > 0 there exists a positive integer N for which d (xn , xm ) <  for all m ≥ N, and n ≥ N. A metric space is said to be complete if every Cauchy sequence in the metric space converges to an element within the space.

F.3.2 Hilbert Space A complete inner product space X with the metric: d (x, y) =< x − y, x − y >1/2 = x − y is a Hilbert space. A number of results can be drawn from a Hilbert space. given Hilbert space X is For example, if the sequence {xn }∞ n=1 of elements in a  orthogonal, then the sequence {yn }∞ , given by yn = nk=1 xk , converges in X n=1  2 if and only if the scalar series ∞ ckouá et al. k=1 xk  converges, see e.g. Kubᡠ(1987). A fundamental result in Hilbert spaces concerns the concept of approximation of vectors from the Hilbert space by vectors from subspaces. This result is expressed under the so-called projection theorem, given below (see e.g. Halmos 1951). Projection Theorem Let U be a Hilbert space and V a Hilbert subspace of U . Let also u be a vector in U but not in V, and v a vector in V. Then there exists a unique vector v in V such that u − v  = min u − v. v in V Furthermore, the vector v is uniquely determined by the property that < u − v , v >= 0 for all v in V. The vector v is called the (orthogonal) projection of u onto V. The concept of Hilbert space finds its natural way in time series and prediction theory, and we provide a few examples below.

Examples of Hilbert Space • Example 1

F Hilbert Space

539

Consider the collection U of all (complex) random variables U , with zero mean and finite variance, i.e. E(U ) = 0, and V ar(|U |2 ) < ∞, defined on some sample space. The following operation defined for all random variables U and V in U by

< U, V >= E U ∗ V , where U ∗ is the complex conjugate of U , defines a scalar product over U and makes U a Hilbert space, see e.g. Priestly (1981, p. 190). Exercise Show that the above operation is well defined. Hint Use the fact that V ar(λU + V ) ≥ 0 for all scalar λ to deduce that < U, V > is well defined. The theory of Hilbert space in stochastic processes and time series started towards the late 1940s (Loève 1948) and was lucidly formulated by Parzen (1959, 1961) in the context of random function (stochastic processes). The concept of Hilbert space, and in particular the projection theorem, finds natural application in the theory of time series prediction. • Example 2 Designate by T a subset of the real numbers,

and let {Xt , for t in T } be a stochastic process (or random function) satisfying E |Xt |2 < ∞ for t in T . Such stochastic process  is said to be second order. Let U be the set of random variables of the form U = nk=1 ck Xtk , where n is a positive integer, c1 , . . . , cn are scalars and t1 , . . . , tn are elements in T . That is, U is the set of all finite linear combinations of random variables Xt for t in T and is known as the space spanned by the random function {Xt , for t in T }. The inner product < U, V >= E(U V ∗ ) induces an inner product on U . The space U , extended by including all random variables that are limit of sequences in U , i.e. random variables W satisfying lim Wn − W  = 0

n→∞

for some sequences {Wn }∞ n=1 in U , is a Hilbert space (see e.g. Parzen 1959).

F.3.3 Application to Prediction The Univariate Case Let H be the Hilbert space defined in example 1 above and {Xt , t = 0, ±1, ±2, . . .} a (discrete) stochastic process. Let now Ht be the subset spanned by the sequence Xt , Xt−1 , Xt−2 , . . .. Using the same reasoning as in example 2 above, Ht is a Hilbert space.

540

F Hilbert Space

Let now m be a given positive integer, and our objective is to estimate Xt+m using elements from Ht . This is the classical prediction problem, which seeks an element Xˆ t+m in Ht satisfying

Xt+m − Xˆ t+m 2 = E |Xt+m − Xˆ t+m |2 = min Xt+m − Y 2 . Y in Ht Hence Xˆ t+m is simply the orthogonal projection of Xt+m onto Ht . From the projection theorem, we get  E

 Xt+m − Xˆ t+m Y = 0,

that is, Xt+m − Xˆ t+m is orthogonal to Y , for any random variable Y in Ht . In prediction theory, the set Ht is sometimes referred to as the set of all possible predictors and that the predictor Xˆ t+m provides the minimum mean square prediction error (see e.g. Karlin and Taylor 1975, p. 464). Since Xˆ t+m is an element of Ht , the previous orthogonality also yields another orthogonality between Xt+m − Xˆ t+m and Xt+n − Xˆ t+n for n < m. This is because Ht is a subspace of Hs for t ≤ s. This also yields a general orthogonality between Xt+k − Xˆ t+k and Xt+l − Xˆ t+l , i.e. E



  Xt+k − Xˆ t+k Xt+l − Xˆ t+l = 0, for k = l.

In probabilistic terms we consider the stochastic process (Xt ) observed for t ≤ n, and we seek to “estimate” the random variable Xn+h . The conditional probability distribution of the random variable Xn+h given In = {Xt , t ≤ n} is fh (xn+h |xt , t ≤ n) = P r(Xn+h ≤ x|Xt , t ≤ n) = fh (x). The knowledge of fh (.) permits the determination of all the conditional properties of Xn+h |In . The estimate Xˆ n+h of Xn+h |In is then chosen as a solution to the minimisation problem: 

min E (Xˆ n+h − Y )2 |In = min (x − y)2 fh (x)dx.

(F.1)

The solution is automatically given by Xˆ n+h = E (Xn+h |Xt , t ≤ n) ,

(F.2)

and the term εn+h = Xn+h − Xˆ n+h is known as the forecast error. Exercise Show that the solution to Eq. (F.1) is given by Eq. (F.2). Hint Recall the condition fh (x)dx = 1 and use Lagrange multiplier. An important result emerges when {Xt } is Gaussian, namely, E (Xn+h |Xt , t ≤ n) is a linear function of Xt , t ≤ n, and this what makes the reason behind choosing

F Hilbert Space

541

linear predictors. The general linear predictor is then Xˆ n+h =



αk Xt−k .

(F.3)

k≥0

The predictor Xˆ n+h is meant to optimally approximate the (unobserved) future value of Xn+h of the time series. In stationary time series the forecast error εn+h is also 2 ) is the forecast error variance. stationary, and its variance σ2 = E(εn+h The Multivariate Case Prediction of multivariate time series is more subtle than single variables time series not least because matrices are involved. Matrices have two main features, namely, they do not (in general) commute, and they can be singular without being null. In this appendix a brief review of the multivariate prediction problem is given. For a full discussion on prediction of vector time series, the reader is referred to Doob (1953), Wiener and Masani (1957, 1958), Helsen and Lowdenslager (1958), Rozanov (1967), Masani (1966), Hannan (1970) and Koopmans (1974), and the up-to-date text by Wei (2019). T



Let xt = Xt1 , . . . , Xtp denote a p-dimensional second-order (E |xt |2 < ∞) zero-mean random vector. Let also {xt , t = 0, ±1, ±2, . . .} be a second-order vector random function (or stochastic process, or time series), H the Hilbert space spanned by this random function, i.e. the space spanned by Xt,k , k = 1, . . . , p, t = 0, ±1, ±2, . . ., and finally, Hn the Hilbert space

spanned by Xt,k , k = 1, . . . , p, t ≤ n. A p-dimensional random vector y = Y1 , . . . , Yp is an element of Hn if each component Yk , k = 1, . . . , p belongs to Hn . Stated otherwise Hn can be regarded as composed of random vectors y that are finite linear combinations of elements of the vector random functions of the form: y=

m 

Ak xtk

k=1

for some integers m, t1 , . . . , tm , and p ×p (complex) matrices Ak , k = 1, . . . , m. To be consistent with the definition of uncorrelated random vectors, a generalised inner product on H, known as the Gramian matrix (see e.g. Koopmans 1974), is defined by

 u, v p = E uv∗ ,

542

F Hilbert Space

where (∗ ) stands for the transpose complex conjugate.1 Note that the norm over H is the trace of the Gramian matrix, i.e. x2 =

p 

E|Xk |2 = tr  xxT p .

k=1

It can be seen that the orthogonality of random vectors is equivalent to noncorrelation (as in the univariate case). Let u = (U1 , . . . , Up ) be a random vector in H. The projection uˆ = (Uˆ 1 , . . . , Uˆ p ) of u onto Hn is a random vector whose components are the projection of the associated components of u. The projection theorem yields, for any y in Hn , ˆ y p = E  u − u,



 u − uˆ y∗ = O.

As for the univariate case, the predictor xˆ t+m of xt+m is given by the orthogonal projection of xt+m onto Ht . The prediction error εt+m = xt+m − xˆ t+m is orthogonal to all vectors in Ht . Also, ε k is orthogonal to ε l for l = k, i.e.

E ε k ε Tl = δkl , where  is the covariance  matrix of the prediction error εk . The prediction error variance tr E ε t+1 ε Tt+1 of the one-step ahead prediction is given in Chap. 8.

1 That

is, the Gramian matrix consists of all the inner products between the individual components of u and v.

Appendix G

Systems of Linear Ordinary Differential Equations

This appendix gives the solutions of systems of ordinary differential equations (ODEs) of the form: dx = Ax + b dt

(G.1)

with the initial condition x0 = x(t0 ), where A is a m × m real (or complex) matrix and b is a m-dimensional real (or complex) vector. When A is constant, the solution is quite simple, but when it is time-dependent the solution is slightly more elaborate.

G.1 Case of a Constant Matrix A G.1.1 Homogeneous System By using the exponential form of matrices: eA =

 1 Ak , k!

(G.2)

k≥0

which can also be extended to etA , for any scalar t, one gets detA = etA A = AetA . dt Remark In general eA+B = eA eB . However, if A and B commute, then we get equality.

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

543

544

G Systems of Linear Ordinary Differential Equations

Using the above result, the solution of dx = Ax dt

(G.3)

x(t) = etA x0 .

(G.4)

with initial condition x0 is

Remark The above result can be used to solve the differential equation: d my d m−1 y dy + a0 y = 0 + am−1 m−1 + . . . + a1 m dt dt dt

(G.5)

m−1

d y(t0 ) 0) with initial conditions y(t0 ), dy(t dt , . . . , dt m−1 . The above ODE can be transformed into a system similar to Eq. (G.3), with the Fröbenius matrix A given by



0 0 .. .

1 ... 0 ...

0 0

0 0



⎜ ⎟ ⎜ ⎟ ⎜ ⎟ A=⎜ ⎟, ⎜ ⎟ ⎝ 0 0 ... 0 1 ⎠ −a0 −a1 . . . −am−2 −am−1 m−1

d y(t) T and x(t) = (y(t), dy(t) dt , . . . , dt m−1 ) , with initial condition x0 = x(t0 ).

G.1.2 Non-homogeneous System Here we consider the non-homogeneous case corresponding to Eq. (G.1), and we assume that b = b(t), i.e. b is time-dependent. By noting that dx dt − Ax = d (e−tA x), the solution is given by etA dt  x(t) = etA x0 +

t

e(t−s)A b(s)ds.

(G.6)

t0

Remark Equation (G.6) can be used to integrate a mth-order non-homogeneous differential equation.

G

Systems of Linear Ordinary Differential Equations

545

G.2 Case of a Time-Dependent Matrix A G.2.1 General Case We consider now the following system of ODEs: dx = A(t)x dt

(G.7)

with initial condition x0 . The theory behind the integration of Eq. (G.7) is based on using a set of independent solutions of the differential equation. If x1 (t), . . . , xm (t) is a set of m solutions of Eq. (G.7) with respective initial conditions x1 (t0 ), . . . , xm (t0 ), assumed to be linearly independent, then the matrix M(t) = (x1 (t), . . . , xm (t)) satisfies the following system of ODEs: dM = AM. dt

(G.8)

It turns out that if M(t0 ) is invertible the solution to (G.8) is also invertible. Remark It can be shown, see e.g. Said-Houari (2015) or Teschl (2012), that the Wronskian W (t) = det (M(t)) satisfies the ODE: dW = tr(A)W, dt

(G.9)

t (or W (t) = W (t0 ) exp( t0 tr(A(u))du). The Wronskian can be used to show that, like M(t0 ), M(t) is also invertible. The solution to Eq. (G.7) then takes the form: x(t) = S(t, t0 )x(t0 ),

(G.10)

where S(., .) is the propagator of Eq. (G.7) and is given by S(t, u) = M(t)M−1 (u).

(G.11)

These results can be extended to the case of a non-homogeneous system: dx = A(t)x + b(t), dt

(G.12)

with initial condition x0 , which takes the form:  x(t) = S(t, t0 )x(t0 ) +

t

S(t, u)b(u)du. t0

(G.13)

546

G Systems of Linear Ordinary Differential Equations

The above solution can again be used to integrate a mth-order non-homogeneous differential equation with varying coefficients. A useful simplification of Eq. (G.11) can be obtained when the matrix A satisfies A(t)A(s) = A(s)A(t) for all t and s. In this case the propagator S(., .) takes a simple expression, namely S(t, s) = e

t s

A(u)du

.

(G.14)

It is worth mentioning here that Eq. (G.13) can be extended to the case when the term b is a random forcing in relation to time-dependent multivariate autoregressive models.

G.2.2 Particular Case of Periodic Matrix A: Floquet Theory A particularly important case in physical sciences corresponds to a periodic A(t), with period T , i.e. A(t +T ) = A(t). This case is particularly relevant to atmospheric science because of the strong seasonality. The theory of the solution of x˙ = A(t)x,

(G.15)

with initial condition x0 = x(t0 ), for a periodic m × m matrix A(t) is covered by the so-called Floquet theory (Floquet 1883). The solution takes the form: x(t) = eμt y(t),

(G.16)

for some periodic function y(t), and therefore need not be periodic. A set of m independent solutions x1 (t), . . . xm (t) make what is known as the fundamental matrix X(t), i.e. X(t) = [x1 (t), . . . , xm (t)], and if the initial condition X0 = X(t0 ) is the identity matrix, i.e. X0 = Im , then X(t) is called the principal fundamental matrix. It is therefore clear that the solution to Eq. (G.15) is x(t) = X(t)X−1 0 x0 , where X(t) is a fundamental matrix. An important result from Floquet theory is that if X(t) is a fundamental matrix so is X(t + T ), and that there exists a nonsingular matrix B such that X(t + T ) = X(t)B. Using the Wronskian, Eq. (G.9), one gets the determinant

T of B, i.e. |B| = exp 0 tr(A(u))du . Furthermore, the eigenvalues of B, or characteristic multipliers, which can be written as eμ1 T , . . . , eμm T , yield the socalled characteristic (or Floquet) exponents μ1 , . . . , μm . Remark In terms of the resolvent, see Sect. G2, the propagator S(t, τ ) is the principal fundamental matrix. The characteristic exponents, which may be complex, are not unique but the characteristic multipliers are. In addition, the system (or the origin) is asymptotically

G

Systems of Linear Ordinary Differential Equations

547

stable if the characteristic exponents have negative real parts. It can be seen that if u is an eigenvector of B with eigenvalue ρ = eμT , then x(t) = X(t)u is a solution to Eq.

(G.15), and that x(t + T ) = ρx(t). The solution then takes the form x(t) = eμT x(t)e−μT = eμT y(t), where precisely the vector y(t) is T-periodic.

Appendix H

Links for Software Resource Material

An EOF primer by the author can be found here: https://pdfs.semanticscholar.org/f492/b48483c83f70b8e6774d3cc88bec918ab630. pdf. A CRAN (R programming language) package for EOFs and EOF rotation by Alan Jassby is here: https://www.rdocumentation.org/packages/wq/versions/0.4.8/topics/eof. The site of David M. Kaplan provides Matlab codes for EOFs and varimax rotation: https://websites.pmc.ucsc.edu/~dmk/notes/EOFs/EOFs.html. Mathworks provides codes for PCA, factor analysis and factor rotation using different rotation criteria at: https://uk.mathworks.com/help/stats/rotatefactors.html. https://uk.mathworks.com/help/stats/analyze-stock-prices-using-factor-analysis. html. There are also freely available Matlab source codes of factor analysis at freesourcecode.net: http://freesourcecode.net/matlabprojects/57962/factor-analysis-by-the-principalcomponents-method.--in-matlab#.XysoXfJS80o. Python (and R) PCA and varimax rotation can be found at this site: https://mathtuition88.com/2019/09/13/python-code-for-pca-rotation-varimaxmatrix/. A R package provided by MetNorway, including EOF, CCA and more, can be found here:

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

549

550

H Links for Software Resource Material

https://rdrr.io/github/metno/esd/man/ERA5.CDS.html. The site of Imad Dabbura from HMS provides coding implementation in R and Python at: https://imaddabbura.github.io/. A step-by-step introduction to NN with programming codes in Python is provide by Dr Andy Thomas: “An Introduction to Neural Networks for Beginners” at https://adventuresinmachinelearning.com/wp-content/uploads/2017/07/. Mathworks also provides softwares for recurrent NN used in time series forecasting: https://uk.mathworks.com/help/deeplearning/. The following site provides various Matlab codes in Machine learning: http://codeforge.com/s/0/self-organizing-map-matlab-code. The site of Dr Qadri Hamarsheh “Neural Network and Fuzzy Logic: SelfOrganizing Map Using Matlab” here: http://www.philadelphia.edu.jo/academics/qhamarsheh/uploads/Lecture%2016_ Self-organizing%20map%20using%20matlab.pdf. Self-organising Maps Using Python, by James McCaffrey here: https://visualstudiomagazine.com/articles/2019/01/01/self-organizing-maps-python. aspx. The book by Nielsen (2015) provides hands-on approach on NN (and deep learning) with Python (2.7) here: http://neuralnetworksanddeeplearning.com/about.html. The book by Buduma (2017) provides codes for deep learning in Tensorflow at: https://github.com/darksigma/Fundamentals-of-Deep-Learning-Book. The book by Chollet (2018) provides an exploration of deep learning from scratch with Python codes here: https://www.manning.com/books/deep-learning-with-python. Random Forest: Simple Implementation with Python: https://holypython.com/rf/random-forest-simple-implementation/. Random Forest (Easily Explained), with Python, by Shubham Gupta: https://medium.com/@gupta020295/random-forest-easily-explained-4b8094feb90. An Implementation and Explanation of the Random Forest in Python by Will Koehrsen: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-

H Links for Software Resource Material

551

forest-in-python-77bf308a9b76. Forecasting with Random Forest (Python implementation), by Eric D. Brown: https://pythondata.com/forecasting-with-random-forests/. Time series forecasting with random forest via time delay embedding (In R programming language), by Mauel Tilgner: https://www.statworx.com/at/blog/time-series-forecasting-with-random-forest/. A Python-based learning library to evaluate mathematical expression efficiently is found here: http://deeplearning.net/software/theano/. Other programming languages. Yann Lecun provides a set of softwares in Lush at: http://yann.lecun.com/ex/downloads/index.html. A toolkit for recurrent NN applied to language modelling is given by Tomas Mikolov at: http://www.fit.vutbr.cz/~imikolov/rnnlm/. A recurrent NN library for LSTM, multidimensional RNN, and more, can be found here: https://sourceforge.net/projects/rnnl/. A Matlab 5 SOM toolbox by Juha Vasento et al. can be found here: http://www.cis.hut.fi/projects/somtoolbox/. The following site provides links to a number of softwares on deep learning: http://deeplearning.net/software_links/.

References

Absil P-A, Mahony R, Sepulchre R (2010) Optimization on manifolds: Methods and applications. In: Diehl M, Glineur F, Michiels EJ (eds) Recent advances in optimizations and its application in engineering. Springer, pp 125–144 Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66:671–687 Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9 Adby PR, Dempster MAH (1974) Introduction to optimization methods. Chapman and Hall, London Aires F, Rossow WB, Chédin A (2002) Rotation of EOFs by the independent component analysis: toward a solution of the mixing problem in the decomposition of geophysical time series. J Atmospheric Sci 59:111–123 Aires F, Chédin A, Nadal J-P (2000) Independent component analysis of multivariate time series: application to the tropical SST variability. J Geophys Res 105(D13):17437–17455 Akaike H (1969) Fitting autoregressive models for prediction. Ann Inst Stat Math 21:243–247 Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Control 19:716–723 Allen MR, Smith LA (1997) Optimal filtering in singular spectrum analysis. Phys Lett A 234:419– 423 Allen MR, Smith LA (1996) Monte Carlo SSA: Detecting irregular oscillations in the presence of colored noise. J Climate 9:3373–3404 Aluffi-Pentini F, Parisi V, Zirilli F (1984) Algorithm 617: DAFNE: a differential-equations algorithm for nonlinear equations. Trans Math Soft 10:317–324 Amari S-I (1990) Mathematical foundation of neurocomputing. Proc IEEE 78:1443–1463 Ambaum MHP, Hoskins BJ, Stephenson DB (2001) Arctic oscillation or North Atlantic oscillation? J Climate 14:3495–3507 Ambaum MHP, Hoskins BJ, Stephenson DB (2002) Corrigendum: Arctic oscillation or North Atlantic oscillation? J Climate 15:553 Ambrizzi T, Hoskins BJ, Hsu H-H (1995) Rossby wave propagation and teleconnection patterns in the austral winter. J Atmos Sci 52:3661–3672 Ambroise C, Seze G, Badran F, Thiria S (2000) Hierarchical clustering of self-organizing maps for cloud classification. Neurocomputing 30:47–52. ISSN: 0925–2312 Anderson JR, Rosen RD (1983) The latitude-height structure of 40–50 day variations in atmospheric angular momentum. J Atmos Sci 40:1584–1591 Anderson TW (1963) Asymptotic theory for principle component analysis. Ann Math Statist 34:122–148

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

553

554

References

Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. Wiley, New York Angell JK, Korshover J (1964) Quasi-biennial variations in temperature, total ozone, and tropopause height. J Atmos Sci 21:479–492 Ångström A (1935) Teleconnections of climatic changes in present time. Geografiska Annaler 17:242–258 Annas S, Kanai T, Koyama S (2007) Principal component analysis and self-organizing map for visualizing and classifying fire risks in forest regions. Agricul Inform Res 16:44–51. ISSN: 1881–5219 Asimov D (1985) The grand tour: A tool for viewing multidimensional data. SIAM J Sci Statist Comp 6:128–143 Adachi K, Trendafilov N (2019) Some inequalities contrasting principal component and factor analyses solutions. Jpn J Statist Data Sci. https://doi.org/10.1007/s42081-018-0024-4 Astel A, Tsakouski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Research 41:4566–4578. ISSN: 0043-1354 Bach F, Jorda M (2002) kernel independent component analysis. J Mach Learn Res 3:1–48 Bagrov NA (1959) Analytical presentation of the sequences of meteorological patterns by means of the empirical orthogonal functions. TSIP Proceeding 74:3–24 Bagrov NA (1969) On the equivalent number of independent data (in Russian). Tr Gidrometeor Cent 44:3–11 Baker CTH (1974) Methods for integro-differential equations. In: Delves LM, Walsh J (eds) Numerical solution of integral equations. Oxford University Press, Oxford Baldwin MP, Gray LJ, Dunkerton TJ, Hamilton K, Haynes PH, Randel WJ, Holton JR, Alexander MJ, Hirota I, Horinouchi T, Jones DBA, Kinnersley JS, Marquardt C, Sao K, Takahas M (2001) The Quasi-biennial oscillation. Rev Geophys 39:179–229 Baldwin MP, Stephenson DB, Jolliff IT (2009) Spatial weighting and iterative projection methods for EOFs. J Climate 22:234–243 Barbosa SM, Andersen OB (2009) Trend patterns in global sea surface temperature. Int J Climatol 29:2049–2055 Barlow HB (1989) Unsupervised learning. Neural Computation 1:295–311 Barnett TP (1983) Interaction of the monsoon and Pacific trade wind system at international time scales. Part I: The equatorial case. Mon Wea Rev 111:756–773 Barnston AG, Liveze BE (1987) Classification, seasonality, and persistence of low-frequency atmospheric circulation patterns. Mon Wea Rev 115:1083–1126 Barnett TP (1984a) Interaction of the monsoon and the Pacific trade wind systems at interannual time scales. Part II: The tropical band. Mon Wea Rev 112:2380–2387 Barnett TP (1984b) Interaction of the monsoon and the Pacific trade wind systems at interannual time scales. Part III: A partial anatomy of the Southern Oscillation. Mon Wea Rev 112:2388– 2400 Barnett TP, Preisendorfer R (1987) Origins and levels of monthly and seasonal forecast skill for United States srface air temperatures determined by canonical correlation analysis. Mon Wea Rev 115:1825–1850 Barnston AG, Ropelewski CF (1992) Prediction of ENSO episodes using canonical correlation analysis. J Climate 5:1316–1345 Barreiro M, Marti AC, Masoller C (2011) Inferring long memory processes in the climate network via ordinal pattern analysis. Chaos 21:13,101. https://doi.org/10.1063/1.3545273 Bartholomew DJ (1987) Latent variable models and factor analysis. Charles Griffin, London Bartlett MS (1939) The standard errors of discriminant function coefficients. J Roy Statist Soc Suppl. 6:169–173 Bartlett MS (1950) Periodogram analysis and continuous spectra. Biometrika 37:1–16 Bartlett MS (1955) An introduction to stochastic processes. Cambridge University Press, Cambridge

References

555

Basak J, Sudarshan A, Trivedi D, Santhanam MS (2004) Weather data mining using independent component analysis. J Mach Lear Res 5:239–253 Basilevsky A, Hum PJ (1979) Karhunen-Loève analysis of historical time series with application to Plantation birth in Jamaica. J Am Statist Ass 74:284–290 Basilevsky A (1983) Applied matrix algebra in the statistical science. North Holland, New York Bauckhage C, Thurau C (2009) Making archetypal analysis practical. In: Pattern recognition, Lecture Notes in Computer Science, vol 5748. Springer, Berlin, Heidelberg, pp 272–281. https://doi.org/10.1007/978-3-642-03798-6-28 Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Phil Trans 53:370 Beatson RK, Cherrie JB, Mouat CT (1999) Fast fitting of radial basis functions: Methods based on preconditioned GMRES iteration. Adv Comput Math 11:253–270 Beatson RK, Light WA, Billings S (2000) Fast solution of the radial basis function interpolation equations: Domain decomposition methods. SIAM J Sci Comput 200:1717–1740 Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396 Bellman R (1961) Adaptive control processes: A guide tour. Princeton University Press, Princeton Bell AJ, Sejnowski TJ (1995) An information-maximisation approach to blind separation and blind deconvolution. Neural Computing 7:1004–1034 Bell AJ, Sejnowski TJ (1997) The “independent components” of natural scenes are edge filters. Vision Research 37:3327–3338 Belouchrani A, Abed-Meraim K, Cardoso J-F, Moulines E (1997) A blind source separation technique using second order statistics. IEEE Trans Signal Process 45:434–444 Bentler PM, Tanaka JS (1983) Problems with EM algorithms for ML factor analysis. Psychometrika 48:247–251 Berthouex PM, Brown LC (1994) Statistics for environmental engineers. Lewis Publishers, Boca Raton Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford, 482 p. Bishop CM (2006) Pattern recognition and machine learning. Springer series in information science and statistics. Springer, New York, 758 p. Bjerknes J (1969) Atmospheric teleconnections from the equatorial Pacific. Mon Wea Rev 97:163– 172 Björnsson H, Venegas SA (1997) A manual for EOF and SVD analyses of climate data. Report No 97-1, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global Change Research, McGill University, p 52 Blumenthal MB (1991) Predictability of a coupled ocean-atmosphere model. J Climate 4:766–784 Bloomfield P, Davis JM (1994) Orthogonal rotation of complex principal components. Int J Climatol 14:759–775 Bock H-H (1986) Multidimensional scaling in the framework of cluster analysis. In: Degens P, Hermes H-J, Opitz O (eds) Studien Zur Klasszfikation. INDEKS-Verlag, Frankfurt, pp 247– 258 Bock H-H (1987) On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: Bozdogan H, Kupta AK (eds) Multivariate statistical modelling and data analysis. Reidel, Boston Boers N, Donner RV, Bookhagen B, Kurths J (2014) Complex network analysis helps to identify impacts of the El Niño Southern Oscillation on moisture divergence in South America. Clim Dyn. https://doi.org/10.1007/s00382-014-2265-7 Bolton RJ, Krzanowski WJ (2003) Projection pursuit clutering for exploratory data analysis. J Comput Graph Statist 12:121–142 Bonnet G (1965) Theorie de linformation−sur l’interpolation optimale d’une fonction aléatoire èchantillonnée. C R Acad Sci Paris 260:297–343 Bookstein FL (1989) Principal warps: thin plate splines and the decomposition of deformations. IEEE Trans Pattern Anal Mach Intell 11:567–585 Borg I, Groenen P (2005) Modern multidimensional scaling. Theory and applications, 2nd edn. Springer, New York

556

References

Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifier. In: Haussler D (ed) Proceedings of the 5th anuual ACM workshop on computational learning theory. ACM Press, pp 144–152 Pittsburgh. Botsaris CA, Jacobson HD (1976) A Newton-type curvilinear search method for optimisation. J Math Anal Applics 54:217–229 Botsaris CA (1978) A class of methods for unconstrained minimisation based on stable numerical integration techniques. J Math Anal Applics 63:729–749 Botsaris CA (1979) A Newton-type curvilinear search method for constrained optimisation. J Math Anal Applics 69:372–397 Botsaris CA (1981) Constrained optimisation along geodesics. J Math Anal Applics 79:295–306 Box MJ, Davies D, Swann WH (1969) Non-linera optimization techniques. Oliver and Boyd, Edinburgh Box GEP, Jenkins MG, Reinsel CG (1994) Time series analysis: forecasting and control. Prentice Hall, New Jersey Box GEP, Jenkins MG (1970) Time series analysis. Forecasting and control. Holden-Day, San Fracisco (Revised and published in 1976) Branstator G, Berner J (2005) Linear and nonlinear Signatures in the planetary wave dynamics of an AGCM: Phase space tendencies. J Atmos Sci 62:1792–1811 Breakspear M, Brammer M, Robinson PA (2003) Construction of multivariate surrogate sets from nonlinear data using the wavelet transform. Physica D 182:1–22 Breiman L (2001) Random forests. Machine Learning 45:5–32 Bretherton CS, Smith C, Wallace JM (1992) An intercomparison of methods for finding coupled patterns in climate data. J Climate 5:541–560 Bretherton CS, Widmann M, Dymnykov VP, Wallace JM, Bladé I (1999) The effective number of spatial degrees of freedom of a time varying field. J Climate 12:1990–2009 Brillinger DR, Rosenblatt M (1967) Computation and interpretation of k-th order spectra. In: Harris B (ed) Spectral analysis of time series. Wiley, New York, pp 189–232 Brillinger DR (1981) Time series-data: analysis and theory. Holden-Day, San-Francisco Brink KH, Muench RD (1986) Circulation in the point conception-Santa Barbara channel region. J Geophys Res C 91:877–895 Brockwell PJ, Davis RA (1991) Time series: theory and methods, 2nd edn. Springer, New York Brockwell PJ, Davis RA (2002) Introduction to time series and forecasting. Springer, New York Brown AA (1986) Optimisation methods involving the solution of ordinary differential equations. Ph.D. thesis, the Hatfield polytechnic, available from the British library Brownlee J (2018) Statistical methods for machine learning. e-learning. ISBN-10. https://www. unquotebooks.com/get/ebook.php?id=386nDwAAQBAJ Broomhead DS, King GP (1986a) Extracting qualitative dynamics from experimental data. Physica D 20:217–236 Broomhead DS, King GP (1986b) On the qualitative analysis of experimental dynamical systems. In: Sarkar S (ed) Nonlinear phenomena and chaos. Adam Hilger, pp 113–144 Buduma N (2017) Fundamentals of deep learning, 1st edn. O’Reilly, Beijing Bürger G (1993) Complex principal oscillation pattern analysis. J Climate 6:1972–1986 Burg JP (1972) The relationship between maximum entropy spectra and maximum likelihood spectra. Geophysics 37:375–376 Cadzow JA, Li XK (1995) Blind deconvolution. Digital Signal Process J 5:3–20 Cadzow JA (1996) Blind deconvolution via cumulant extrema. IEEE Signal Process Mag (May 1996), 24–42 Cahalan RF, Wharton LE, Wu M-L (1996) Empirical orthogonal functions of monthly precipitation and temperature ever over the united States and homogeneous Stochastic models. J Geophys Res 101(D21): 26309–26318 Capua GD, Runge J, Donner RV, van den Hurk B, Turner AG, Vellore R, Krishnan R, Coumou D (2020) Dominant patterns of interaction between the tropics and mid-latitudes in boreal summer: Causal relationships and the role of time scales. Weather Climate Discuss. https:// doi.org/10.5194/wcd-2020-14.

References

557

Cardoso J-F (1989) Source separation using higher order moments. In: Proc. ICASSP’89, pp 2109– 2112 Cardoso J-F (1997) Infomax and maximum likelihood for source separation. IEEE Lett Signal Process 4:112–114 Cardoso J-F, Souloumiac A (1993) Blind beamforming for non-Gaussian signals. IEE Proc F 140:362–370 Cardoso J-F, Hvam Laheld B (1996) Equivalent adaptive source separation. IEEE Trans Signal Process 44:3017–3030 Carr JC, Fright RW, Beatson KR (1997) Surface interpolation with radial basis functions for medical imaging. IEEE Trans Med Imag 16:96–107 Carreira-Perpiñán MA (2001) Continuous latent variable models for dimensionality reduction and sequential data reconstruction. Ph.D. dissertation. Department of Computer Science, University of Sheffield Carroll JB (1953) An analytical solution for approximating simple structure in factor analysis. Psychometrika 18:23–38 Caroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an n-way generalization of ’Eckart-Young’ decomposition. Psychometrika 35:283–319 Cassano EN, Glisan JM, Cassano JJ, Gutowski Jr. WJ, Seefeldt MW (2015) Self-organizing map analysis of widespread temperature extremes in Alaska and Canada. Clim Res 62:199–218 Cassano JJ, Cassano EN, Seefeldt MW, Gutowski WJ, Glisan JM (2016) Synoptic conditions during wintertime temperature extremes in Alaska. J Geophys Res Atmos 121:3241–3262. https://doi.org/10.1002/2015JD024404 Causa A, Raciti F (2013) A purely geometric approach to the problem of computing the projection of a point on a simplex. JOTA 156:524–528 Cavazos T, Comrie AC, Liverman DM (2002) Intraseasonal variability associated with wet monsoons in southeast Arizona. J Climate 15:2477–2490. ISSN: 0894-8755 Chan JCL, Shi J-E (1997) Application of projection-pursuit principal component analysis method to climate studies. Int J Climatol 17(1):103–113 Charney JG, Devore J (1979) Multiple equilibria in the atmosphere and blocking. J Atmos Sci 36:1205–1216 Chatfield C (1996) The analysis of time series. An introduction 5th edn. Chapman and Hall, Boca Raton Chatfield C, Collins AJ (1980) Introduction to multivariate analysis. Chapman and Hall, London Chatfield C (1989) The analysis of time series: An introduction. Chapman and Hall, London, 241 p Chekroun MD, Kondrashov D (2017) Data-adaptive harmonic spectra and multilayer StuartLandau models. Chaos 27:093110 Chen J-M, Harr PA (1993) Interpretation of extended empirical orthogonal function (EEOF) analysis. Mon Wea Rev 121:2631–2636 Chen R, Zhang W, Wang X (2020) Machine learning in tropical cyclone forecast modeling: A Review. Atmosphere 11:676. https://doi.org/10.3390/atmos11070676 Cheng X, Nitsche G, Wallace MJ (1995) Robustness of low-frequency circulation patterns derived from EOF and rotated EOF analysis. J Climate 8:1709–1720 Chernoff H (1973) The use of faces to represent points in k-dimensional space graphically. J Am Stat Assoc 68:361–368 Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21:5–30. https://doi. org/10.1016/j.acha.2006.04.006 Chollet F (2018) Deep learning with Python. Manning Publications, New York, 361 p Christiansen B (2009) Is the atmosphere interesting? A projection pursuit study of the circulation in the northern hemisphere winter. J Climate 22:1239–1254 Cleveland WS, McGill R (1984) The many faces of a scatterplot. J Am Statist Assoc 79:807–822 Cleveland WS (1993) Visualising data. Hobart Press, New York Comon P, Jutten C, Herault J (1991) Blind separation of sourcesi, Part ii: Problems statement. Signal Process 24:11–20 Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314

558

References

Cook D, Buja A, Cabrera J (1993) Projection pursuit indices based on expansions with orthonormal functions. J Comput Graph Statist 2:225–250 Cover TM, Thomas JA (1991) Elements of information theory. Wiley Series in Telecommunication. Wiley, New York Cox DD (1984) Multivariate smoothing spline functions. SIAM J Num Anal 21:789–813 Cox TF, Cox MAA (1994) Mulyidimensional scaling. Chapman and Hall, London Craddock JM (1965) A meteorological application of principal component analysis. Statistician 15:143–156 Craddock JM (1973) Problems and prospects for eigenvector analysis in meteorology. Statistician 22:133–145 Craven P, Wahba G (1979) Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 31:377–403 Cristianini N, Shawe-Taylor J, Lodhi H (2001) Latent semantic kernels. In: Brodley C, Danyluk A (eds) Proceedings of ICML-01, 18th international conference in machine learning. Morgan Kaufmann, San Francisco, pp 66–73 Crommelin DT, Majda AJ (2004) Strategies for model reduction: Comparing different optimal bases. J Atmos Sci 61:2206–2217 Cupta AS (2004) Calculus of variations with applications. PHI Learning, India, 256p. ISBN: 9788120311206 Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36:338–347 Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Zhang C, Ma YQ (eds) Ensemble machine learning. Springer, New York, pp 157–175 Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal Sys 2:303–314 Daley R (1991) Atmospheric data assimilaltion. Cambridge University Press, Camnbridge, 457 p Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithm 22:60–65 Daubechies I (1992) Ten lectures on wavelets. Soc. for Ind. and Appl. Math., Philadelphia, PA Davis JM, Estis FL, Bloomfield P, Monahan JF (1991) Complex principal components analysis of sea-level pressure over the eastern USA. Int J Climatol 11:27–54 de Lathauwer L, de Moor B, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Analy Appl 21:1253–1278 DeGroot MH, Shervish MJ (2002) Probability and statistics, 4th edn. Addison–Wesley, Boston, p 893 DelSole T (2001) Optimally persistent patterns in time varying fields. J Atmos Sci 58:1341–1356 DelSole T (2006) Low-frequency variations of surface temperature in observations and simulations. J Climate 19:4487–4507 DelSole T, Tippett MK (2009a) Average predictability time. Part I: theory. J Atmos Sci 66:1172– 1187 DelSole T, Tippett MK (2009b) Average predictability time. Part II: seamless diagnoses of predictability on multiple time scales. J Atmos Sci 66:1188–1204 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc B 39:1–38 De Swart HE (1988) Low-order spectral models of the atmospheric circulation: A survey. Acta Appl Math 11:49–96 Derouiche S, Mallet C, Hannachi A, Bargaoui Z (2020) Rainfall analysis via event features and self-organizing maps with application to northern Tunisia. J Hydrolo revised Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12:793– 815 Diamantaras KI, Kung SY (1996) Principal component neural networks. Wiley, New York Ding C, Li T, Jordan IM (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32:45–55

References

559

Donges JF, Petrova I, Loew A, Marwan N, Kurths J (2015) How complex climate networks complement eigen techniques for the statistical analysis of climatological data. Clim Dyn 45:2407–2424 Donges JF, Zou Y, Marwan N, Kurths J (2009) Complex networks in climate dynamics. Eur Phys J Spec Top 174:157–179. https://doi.org/10.1140/epjst/e2009--01098-2 Dommenget D, Latif M (2002) A cautionary note on the interpretation of EOFs. J Climate 15:216– 225 Dommenget D (2007) Evaluating EOF modes against a stochastic null hypothesis. Clim Dyn 28:517–331 Donner RV, Zou Y, Donges JF, Marwan N, Kurths J (2010) Recurrence networks—a novel paradigm for nonlinear time series analysis. New J Phys 12:033025. https://doi.org/10.1088/ 1367-2630/12/3/033025 Donohue KD, Hennemann J, Dietz HG (2007) Performance of phase transform for detecting sound sources with microphone arrays in reverberant and noisy environments. Signal Process 87:1677–1691 Doob JL (1953) Stochastic processes. Wiley, New York Dorn M, von Storch H (1999) Identification of regional persistent patterns through principal prediction patterns. Beitr Phys Atmos 72:15–111 Dwyer PS (1967) Some applications of matrix derivatives in multivariate analysis. J Am Statist Ass 62:607–625 Ebdon RA (1960) Notes on the wind flow at 50 mb in tropical and subtropical regions in January 1957 and in 1958. Q J R Meteorol Soc 86:540–542 Ebert-Uphoff I, Deng Y (2012) Causal discovery for climate research using graphical models. J Climate 25:5648–5665. https://doi.org/10.1175/JCLI-D-11-00387.1 Efron B (1979) Bootstrap methods: Another look at the Jackknife. Ann Stat 7:1–26 Efron B, Tibshirani RJ (1994) An introduction to bootstrap. Chapman-Hall, Boca-Raton. ISBN-13: 978-0412042317 Eslava G, Marriott FHC (1994) Some criteria for projection pursuit. Stat Comput 4:13–20 Eugster MJA, Leisch F (2011) Weighted and robustarchetypal analysis. Comput Stat Data Anal 55:1215–1225 Eugster MJA, Leisch F, (2013) Archetypes: Archetypal analysis. http://CRAN.R-project.org/ package=archetypes. R package version 2.1-2 Everitt BS (1978) Graphical techniques for multivariate data. Heinemann Educational Books, London Everitt BS (1984) An introduction to latent variable models. Chapman and Hall, London Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman and Hall, London Everitt BS (1993) Cluster analysis, 3rd edn. Academic Press, London, 170pp Everitt BS, Dunn G (2001) Applied Multivariate Data Analysis, 2nd edn. Arnold, London Evtushenko JG (1974) Two numerical methods of solving non-linear programming problems. Sov Math Dokl 15:20–423 Evtushenko JG, Zhadan GV (1977) A relaxation method for solving problems of non-linear programming. USSR Comput Math Math Phys 17:73–87 Fang K-T, Zhang Y-T (1990) Generalized multivariate analysis. Springer, 220p Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy U (eds) (1996) Advances in knowledge discovery and data mining. AAAI Press/The MIT Press, Menlo Park, CA Ferguson GA (1954) The concept of parsimony in factor analysis. Psychometrika 18:23–38 Faddeev DK, Faddeeva NV (1963) Computational methods of linear algebra. W.H. Freeman and Company, San Francisco Fisher RA (1925) Statistical methods for research workers. Oliver & Boyd, Edinburgh Fischer MJ, Paterson AW (2014) Detecting trends that are nonlinear and asymmetric on diurnal and seasonal time scales. Clim Dyn 43:361–374 Fischer MJ (2016) Predictable components in global speleothem δ 18 O. Q Sci Rev 131:380–392

560

References

Fischer MJ (2015) Predictable components in Australian daily temperature data. J Climate 28:5969–5984 Fletcher R (1972) Conjugate direction methods. In: Murray W (ed) Numerical methods for unconstrained optimization. Academic Press, London, pp 73–86 Fletcher R, Powell MJD (1963) A rapidly convergent descent method for minimization. Comput J 6:163–168 Floquet G (1883) Sur les équations différentielles linéaires à coefficients periodiques. Annales de l’École Normale Supérieure 12:47–88 Flury BN (1988) Common principal components and related mutivariate models. Wiley, New York Flury BN (1984) Common principal components in k groups. J Am Statist Assoc 79:892–898 Flury BN (1983) Some relations between the comparison of covariance matrices and principal component analysis. Comput Statist Dana Anal 1:97–109 Fodor I, Kamath C (2003) On the use of independent component analysis to separate meaningful sources in global temperature series. Technical Report, Lawrence Livermore National Laboratory Foulds LR (1981) Optimization techniques: An introduction. Springer, New York Frankl P, Maehara H (1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J Combin Theor 44:355–362 Fraedrich K (1986) Estimating the dimensions of weather and climate attractors. J Atmos Sci 43:419–432 Franke R (1982) Scattered data interpolation: tests of some methods. Math Comput 38(157):181– 200 Franzke C, Feldstein SB (2005) The continuum and dynamics of Northern Hemisphere teleconnection patterns. J Atmos Sci 62:3250–3267 Franzke C, Majda AJ, Vanden-Eijnden E (2005) Low-order stochastic mode reduction for a realistic barotropic model climate. J Atmos Sci 62:1722–1745 Franzke C, Majda AJ, Branstator G (2007) The origin of nonlinear signatures of planetary wave dynamics: Mean phase space tendencies and contributions from non-Gaussianity. J Atmos Sci 64:3987–4003 Franzke C, Feldstein SB, Lee S (2011) Synoptic analysis of the Pacific-North American teleconnection pattern. Q J R Meterol Soc 137:329–346 Fraser AM, Dimitriadis A (1994) Forecasting probability densities by using hidden Markov models with mixed states. In: Weigend SA, Gershenfeld NA (eds) Time series prediction: forecasting the future and understanding the past. Persus Books, Reading, MA, pp 265–282 Frawley WJ, Piatetsky-Shapiro G, Mathews CJ (1992) Knowledge discovery in databases: an overview. Al Magazine 13:57–70 Frederiksen JS (1997) Adjoint sensitivity and finite-time normal mode disturbances during blocking. J Atmos Sci 54:1144–1165 Frederiksen JS, Branstator G (2001) Seasonal and intraseasonal variability of large-scale barotropic modes. J Atmos Sci 58:50–69 Frederiksen JS, Branstator G (2005) Seasonal variability of teleconnection patterns. J Atmos Sci 62:1346–1365 Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput C23:881–890 Friedman JH, Stuetzle W, Schroeder A (1984) Projection pursuit density estimation. J Am Statist Assoc 79:599–608 Friedman JH (1987) Exploratory projection pursuit. J Am. Statist Assoc 82:249–266 Fuller WA (1976) Introduction to statistical time series. Wiley, New York Feldstein SB (2000) The timescale, power spectra, and climate noise properties of teleconnection patterns. J Climate 13:4430–4440 Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Amer Statist Assoc 76:817–823 Fukunaga K, Koontz WLG (1970) Application of the Karhunen-Loève expansion to feature selection and ordering. IEEE Trans Comput C-19:311–318

References

561

Fukuoka A (1951) A study of 10-day forecast (A synthetic report). Geophys Mag Tokyo XXII:177– 218 Galton F (1885) Regression towards mediocrity in hereditary stature. J Anthropological Inst 15:246–263 Gámez AJ, Zhou CS, Timmermann A, Kurths J (2004) Nonlinear dimensionality reduction in climate data. Nonlin Process Geophys 11:393–398 Gardner WA, Napolitano A, Paura L (2006) Cyclostationarity: Half a century of research. Signal Process 86:639–697 Gardner WA (1994) Cyclostationarity in communications and signal processing. IEEE Press, 504 p Gardner WA, Franks LE (1975) Characterization of cyclostationary random signal processes. IEEE Trans Inform Theory 21:4–14 Gavrilov A, Mukhin D, Loskutov E, Volodin E, Feigin A, Kurths J (2016) Method for reconstructing nonlinear modes with adaptive structure from multidimensional data. Chaos 26:123101. https://doi.org/10.1063/1.4968852 Geary RC (1947) Testing for normality. Biometrika 34:209–242 Gelfand IM, Vilenkin NYa (1964) Generalized functions-vol 4: Applications of harmonic analysis. Academic Press Ghil M, Allen MR, Dettinger MD, Ide K, Kondrashov D, Mann ME, Robertson AW, Saunders A, Tian Y, Varadi F, Yiou P (2002) Advanced spectral methods for climatic time series. Rev Geophys 40:1.1–1.41 Giannakis D, Majda AJ (2012) Nonlinear laplacian spectral analysis for time series with intermittency and low-frequency variability. Proc Natl Sci USA 109:2222–2227 Giannakis D, Majda AJ (2013) Nonlinear laplacian spectral analysis: capturing intermittent and low-frequency spatiotemporal patterns in high-dimensional data. Stat Anal Data Mining 6. https://doi.org/10.1002/sam.11171 Gibbs JW (1902) Elementary principles in statistical mechanics developed with especial reference to the rational foundation of thermodynamics. Yale University Press, New Haven, CT. Republished by Dover, New York in 1960 Gibson J (1994) What is the interpretation of spectral entropy? In: Proceedings of IEEE international symposium on information theory, p 440 Gibson PB, Perkins-Kirkpatrick SE, Uotila P, Pepler AS, Alexander LV (2017) On the use of selforganizing maps for studying climate extremes. J Geophys Res Atmos 122:3891–3903. https:// doi.org/10.1002/2016JD026256 Gibson PB, Perkins-Kirkpatrick SE, Renwick JA (2016) Projected changes in synoptic weather patterns over New Zealand examined through self-organizing maps. Int J Climatol 36:3934– 3948. https://doi.org/10.1002/joc.4604 Gill PE, Murray W, Wright HM (1981) Practical optimization. Academic Press, London Gilman DL (1957) Empirical orthogonal functions applied to thirty-day forecasting. Sci Rep No 1, Department of Meteorology, Mass Inst of Tech, Cambridge, Mass, 129pp. Girshik MA (1939) On the sampling theory of roots of determinantal equations. Ann Math Statist 43:128–136 Glahn HR (1962) An experiment in forecasting rainfall probabilities by objective methods. Mon Wea Rev 90:59–67 Goerg GM (2013) Forecastable components analysis. J Mach Learn Res Workshop Conf Proc 28:64–72 Goldfeld SM, Quandt RE, Trotter HF (1966) Maximization by quadratic hill-climbing. Econometrica 34:541–551 Golub GH, van Loan CF (1996) Matrix computation. John Hopkins University Press, Baltimore, MD Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge, MA, 749 p. http://www.deeplearningbook.org Gordon AD (1999) Classification, 2nd edn. Chapman and Hall, 256 p Gordon AD (1981) Classification: methods for the exploratory analysis of multivariate data. Chapman and Hall, London

562

References

Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338 Graybill FA (1969)Introduction to matrices with application in statistics. Wadsworth, Belmont, CA Graystone P (1959) Meteorological office discussion−Tropical meteorology. Meteorol Mag 88:113–119 Grenander U, Rosenblatt M, (1957) Statistical analysis of time series. Wiley, New York Grimmer M (1963) The space-filtering of monthly surface temperature anomaly data in terms of pattern using empirical orthogonal functions. Q J Roy Meteorol Soc 89:395–408 Hackbusch W (1995) Integral equations: theory and numerical treatment. Birkhauser Verlag, Basel Haghroosta T (2019) Comparative study on typhoon’s wind speed prediction by a neural networks model and a hydrodynamical model. MethodsX 6:633–640 Haines K, Hannachi A (1995) Weather regimes in the Pacific from a GCM. J Atmos Sci 52:24442462 Hall A, Manabe S (1997) Can local linear stochastic theory explain sea surface temperature and salinity variability? Clim Dyn 13:167–180 Hall, P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann Statist 17:589–605 Halmos PR (1951) Introduction to Hilbert space. Chelsea, New York Halmos PR (1972) Positive approximants of operators. Indian Univ Math J 21:951–960 Hamlington BD, Leben RR, Nerem RS, Han W, Kim K-Y (2011) Reconstructing sea level using cyclostationary empirical orthogonal functions. J Geophys Res 116:C12015. https://doi.org/10. 1029/2011JC007529 Hamlington BD, Leben RR, Strassburg MW, Kim K-Y (2014) Cyclostationary empirical orthogonal function sea-level reconstruction. Geosci Data J 1:13–19 Hamming RW (1980) Coding and information theory. Prentice-Hall, Englewood Cliffs, New Jersey Hannachi A, Allen M (2001) Identifying signals from intermittent low-frequency behaving systems. Tellus A 53A:469–480 Hannachi A, Legras B (1995) Simulated annealing and weather regimes classification. Tellus 47A:955–973 Hannachi A, Iqbal W (2019) On the nonlinearity of winter northern hemisphere atmospheric variability. J Atmos Sci 76:333–356 Hannachi A, Turner AG (2013a) Isomap nonlinear dimensionality reduction and bimodality of Asian monsoon convection. Geophys Res Lett 40:1653–1658 Hannachi A, Turner GA (2013b) 20th century intraseasonal Asian monsoon dynamics viewed from isomap. Nonlin Process Geophys 20:725–741 Hannachi A, Dommenget D (2009) Is the Indian Ocean SST variability a homogeneous diffusion process. Clim Dyn 33:535–547 Hannachi A, Unkel S, Trendafilov NT, Jolliffe TI (2009) Independent component analysis of climate data: A new look at EOF rotation. J Climate 22:2797–2812 Hannachi, A (2010) On the origin of planetary-scale extratropical winter circulation regimes. J Atmos Sci 67:1382–1401 Hannachi A (1997) Low frequency variability in a GCM: three-dimensional flow regimes and their dynamics. J Climate 10:1357–1379 Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Q J R Meteorol Soc 127:939–958 Hannachi A (2008) A new set of orthogonal patterns in weather and climate: Optimall interpolated patterns. J Climate 21:6724–6738 Hannachi A, Jolliffe TI, Trendafilov N, Stephenson DB (2006) In search of simple structures in climate: Simplifying EOFs. Int J Climatol 26:7–28 Hannachi A, Jolliffe IT, Stephenson DB (2007) Empirical orthogonal functions and related techniques in atmospheric science: A review. I J Climatol 27:1119–1152 Hannachi A (2007) Pattern hunting in climate: A new method for finding trends in gridded climate data. Int J Climatol 27:1–15 Hannachi A (2000) A probobilistic-based approach to optimal filtering. Phys Rev E 61:3610–3619

References

563

Hannachi A, Stephenson DB, Sperber KR (2003) Probability-based methods for quantifying nonlinearity in the ENSO. Climate Dynamics 20:241–256 Hannachi A, Mitchell D, Gray L, Charlton-Perez A (2011) On the use of geometric moments to examine the continuum of sudden stratospheric warmings. J Atmos Sci 68:657–674 Hannachi A, Woollings T, Fraedrich K (2012) The North Atlantic jet stream: a look at preferred positions, paths and transitions. Q J Roy Meteorol Soc 138:862–877 Hannachi A (2016) Regularised empirical orthogonal functions. Tellus A 68:31723. https://doi. org/10.3402/tellusa.v68.31723 Hannachi A, Stendel M (2016) What is the NAO? In: Colijn (ed) Appendix 1 in Quante, North sea region climate change assessment. Springer, Berlin, Heidelberg, pp 489–493 Hannachi A, Trendafilov N (2017) Archetypal analysis: Mining weather and climate extremes. J Climate 30:6927–6944 Hannachi A, Straus DM, Franzke CLE, Corti S, Woollings T (2017) Low-frequency nonlinearity and regime behavior in the Northern Hemisphere extratropical atmosphere. Rev Geophys 55:199–234. https://doi.org/10.1002/2015RG000509 Hannan EJ (1970) Multiple time series. Wiley, New York Harada Y, Kamahori H, Kobayashi C, Endo H, Kobayashi S, Ota Y, Onoda H, Onogi K, Miyaoka K, Takahashi K (2016) The JRA-55 reanalysis: Representation of atmospheric circulation and climate variability. J Meteor Soc Jpn 94:269–302 Hardy RL (1971) Multiquadric equations of topography and other irregular surfaces. J Geophys Res 76:1905–1915 Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16:2639–2664 Harman HH (1976) Modern factor analysis, 3d edn. The University of Chicago Press, Chicago Harshman RA (1970) Foundation of the PARAFAC procedure: models and methods for an ’Explanatory’ multi-mode factor analysis. In: UCLA working papers in phonetics, vol 16, pp 1– 84 Hartigan JA (1975) Clutering algorithms. Wiley, New York Hasselmann K (1976) Stochastic climate models. Part I. Theory. Tellus 28:474–485 Hasselmann K (1988) PIPs and POPs−A general formalism for the reduction of dynamical systems in terms of principal interaction patterns and principal oscillation patterns. J Geophys Res 93:11015–11020 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer series in statistics. Springer, New York Haykin S (1999) Neural networks: A comprehensive foundation, 2nd edn. Prentice Hall International, New Jersey, 897 p Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice Hall, New York, 938 p Hayashi Y (1973) A method of analyzing transient waves by space-time cross spectra. J Appl Meteorol 12:404–408 Haykin S (ed) (1994) Blind deconvolution. Prentice-Hall, Englewood Cliffs, New Jersey Heinlein RA (1973) Time enough for love. New English Library, London Heiser WJ, Groenen PJF (1997) Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima. Psychometrika 62:63–83 Held IM (1983) Stationary and quasi-stationary eddies in the extratropical troposphere: theory. In: Hoskins BJ, Pearce RP (eds) Large-scale dynamical processes in the atmosphere. Academic Press, pp 127–168 Helsen H, Lowdenslager D (1958) Prediction theory and Fourier series in several variables. Acta Math 99:165–202 Hendon HH, Salby ML (1994) The life cycle of the Madden-Julian oscillation. J Atmos Sci 51:2225–2237 Hertz JA, Krogh AS, Palmer RG (1991) Introduction to the theory of neural computation. Lecture Notes Volume I, Santa Fe Institute Series. Addison-Wesley Publishing Company, Reading, MA Hewitson BC, Crane RG (2002) Self-organizing maps: applications to synoptic climatology. Clim Res 22:13–26. ISSN: 0936-577X

564

References

Hewitson BC, Crane RG (1994) Neural nets: Applications in geography. Springer, New York. ISBN: 978-07-923-2746-2 Higham NJ (1988) Computing nearest symmetric positive semi-definite matrix. Linear Algebra Appl 103:103–118 Hill T, Marquez L, O’Connor M, Remus W (1994) Artificial neural network models for forecasting and decision making. Int J Forecast 10:5–15 Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of hand written digits. IEEE Trans Neural Netw 8:65–74 Hirsch MW, Smale S (1974) Differential equations, dynamical systems, and linear algebra. Academic Press, London Hochstadt H (1973) Integral equations. Wiley, New York Hodges JL, Lehmann EL (1956) The efficiency of some non-parametric competitors of the t-test. Ann Math Statist 27:324–335 Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67 Holsheimer M, Siebes A (1994) Data mining: The search for knowledge in databases. Technical Report CS-R9406, CWI Amsterdam Horel JD (1981) A rotated principal component analysis of the interannual variability variability of the Northern Hemisphere 500 mb height field. Mon Wea Rev 109:2080–2092 Horel JD (1984) Complex principal component analysis: Theory and examples. J Climate Appl Meteor 23:1660–1673 Horn RA, Johnson CA (1985) Matrix analysis. Cambridge University Press, Cambridge Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4:251–257 Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2:359–366 Horton DE, Johnson NC, Singh D, Swain DL, Rajaratnam B, Diffenbaugh NS (2015) Contribution of changes in atmospheric circulation patterns to extreme temperature trends. Nature 522:465– 469. https://doi.org/10.1038/nature14550 Hosking JRM (1990) L-moments: analysis and estimation of distributions using linear combinations of order statistics. J R Statist Soc B 52:105–124 Hoskins BJ, Karoly DJ (1981) The steady linear response to a spherical atmosphere to thermal and orographic forcing. J Atmos Sci 38:1179–1196 Hoskins BJ, Ambrizzi T (1993) Rossby wave propagation on a realistic longitudinally varying flow. J Atmos Sci 50:1661–1671 Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psych 24:417–520, 498–520 Hotelling H (1935) The most predictable criterion. J Educ Psych 26:139–142 Hotelling H (1936a) Simplified calculation of principal components. Psychometrika 1:27–35 Hotelling H (1936b) Relation between two sets of variables. Biometrika 28:321–377 Hsieh WW (2001a) Nonlinear canonical correlation analysis of the tropical Pacific climate variability using a neural network approach. J Climate 14:2528–2539 Hsieh WW (2001b) Nonlinear principal component analysis by neural networks. Tellus 53A:599– 615 Hsieh WW (2009) Machine learning methods in the environmental sciences: neural networks and kernels. Cambridge University Press, Cambridge Hsieh W, Tang B (1998) Applying neural network models to prediction and data analysis in meteorology and oceanography. Bull Am Meteorol Soc 79:1855–1870 Hubbert S, Baxter B (2001) Radial basis functions for the sphere. In: Haussmann W, Jetter K, Reimer M (eds) Recent progress in multivariate approximation, 4th international conference, September 2000, Witten-Bommerholz. International Series of Numerical Mathematics, vol. 137. Birkhäuser, Basel, pp 33–47 Huber PJ (1985) Projection pursuit. Ann Statist 13:435–475 Huber PJ (1981) Robust statistics. Wiley, New York, 308 p

References

565

Hunter JS (1988) The digidot plot. Am Statistician 42:54 Hurrell JW (1996) Influence of variations in extratropical wintertime teleconnections on Northern Hemisphere temperature. Geophys Res Lett 23:665–668 Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (2003) An overview of the North Atlantic Oscillation. In: Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (eds) The North Atlantic Oscillation, climate significance and environmental impact, Geophysical Monograph, vol 134. American Geophysical Union, Washington, pp 1–35 Huth, R., C. Beck, A. Philipp, M. Demuzere, Z. Ustrnul, M. Cahynová, J. Kyselý, O. E. Tveito, (2008) Classifications of atmospheric circulation patterns, Recent advances and applications. Ann. NY Acad Sci 1146(1):105–152. ISSN: 0077-8923 Huva R, Dargaville R, Rayner P (2015) The impact of filtering self-organizing maps: A case study with Australian pressure and rainfall. Int J Climatol 35:624–633. https://doi.org/10.1002/joc. 4008 Hyvärinen A (1998) New approximations of differential entropy for independent component analysis and projection. In: Jordan MA, Kearns MJ, Solla SA (eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA, pp 273–279 Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128 Hyvärinen A, Oja E (2000) Independent component analysis: Algorithms and applications. Neural Net 13:411–430 Hyvärubeb A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, 481pp Iskandar I (2009) Variability of satellite-observed sea surface height in the tropical Indian Ocean: comparison of EOF and SOM analysis. Makara Seri Sains 13:173–179. ISSN: 1693-6671 Izenman AJ (2008) Modern multivariate statistical techniques, regression, classification and manofold learning. Springer, New York Jackson JE (2003) A user’s guide to principal components. Wiley, New Jersey, 569pp James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning: with application in R. Springer texts in statistics. Springer, New York. https://doi.org/10.1007/9781-4614-7138-7-5 Jee JR (1985) A study of projection pursuit methods. Technical Report TR 776-311-4-85, Rice University Jenkins MG, Watts DG (1968) Spectral analysis and its applications. Holden-Day, San Francisco Jennrich RI (2001) A simple general procedure for orthogonal rotation. Psychometrika 66:289–306 Jennrich RI (2002) A simple general procedure for oblique rotation. Psychometrika 67:7–19 Jennrich RI (2004) Rotation to simple loadings using component loss function: The orthogonal case. Psychometrika 69:257–273 Jenssen R (2000) Image denoising based on independent component analysis. M.Sc. Thesis, the University of Tromso Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. (2016) Random forests for global and regional crop yield predictions. PLOS ONE 11:e0156571. https://doi.org/10.1371/ journal.pone.0156571 Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary mathematics, vol 26. American Mathematical Society, pp 189–206 Johnson ES, McPhaden MJ (1993) On the structure of intraseasonal Kelevin waves in the equatorial Pacific ocean. J Phys Oceanogr 23:608–625 Johnson NC, Feldstein SB, Tremblay B (2008) The continuum of northern hemisphere teleconnection patterns and a description of the NAO Shift with the use of self-organizing maps. J Climate 21:6354–6371 Johansson JK (1981) An extension of Wollenberg’sredundancy analysis. Psychometrika 46:93–103 Jolliffe IT (2003) A cautionary note on artificial examples of EOFs. J Climate 16:1084–1086 Jolliffe IT, Cadima J (2016) Principal components analysis: a review and recent developments. Phil Trans R Soc A 374:20150202 Jolliffe IT, Uddin M, Vines KS (2002) Simplified EOFs−three alternatives to retain. Clim Res 20:271–279

566

References

Jolliffe IT, Trendafilov TN, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547 Jolliffe IT (1987) Rotation of principal components: Some comments. J Climatol 7:507–510 Jolliffe IT (1995) Rotation of principal components: Choice of normalization constraints. J Appl Stat 22:29–35 Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York Jones MC (1983) The projection pursuit algorithm for exploratory data analysis. Ph.D. Thesis, University of Bath Jones MC, Sibson R (1987) What is projection pursuit? J R Statist Soc A 150:1–36 Jones RH (1975) Estimating the variance of time averages. J Appl Meteor 14:159–163 Jöreskog KG (1967) Some contributions to maximum likelihood factor analysis. Psychometrika 32:443–482 Jöreskog KG (1969) A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 34:183–202 Jung T-P, Makeig S, Mckeown MJ, Bell AJ, Lee T-W, Sejnowski TJ (2001) Imaging brain dynamics using independent component analysis. Proc IEEE 89:1107–1122 Jungclaus J (2008) MPI-M earth system modelling framework: millennium full forcing experiment (ensemble member 1). World Data Center for climate. CERA-DB “mil0010”. http://cera-www. dkrz.de/WDCC/ui/Compact.jsp?acronym=mil0010 Jutten C, Herault J (1991) Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Process 24:1–10 Kaiser HF (1958) The varimax criterion for analytic rotation in favor analysis. Psychometrika 23:187–200 Kano Y, Miyamoto Y, Shimizu S (2003) Factor rotation and ICA. In: Proceedings of the 4th international symposium on independent component analysis and blind source separation (Nara, Japan), pp 101–105 Kao SK (1968) Governing equations and spectra for atmospheric motion and transports in frequency-wavenumber space. J Atmos Sci 25:32–38 Kapur JN (1989) Maximum-entropy models in science and engineering. Wiley, New York Karlin S, Taylor HM (1975) A first course in stochastic processes, 2nd edn. Academic Press, Boston Karthick S, Malathi D, Arun C (2018) Weather prediction analysis using random forest algorithm. Int J Pure Appl Math 118:255–262 Keller LB (1935) Expanding of limit theorems of probability theory on functions with continuous arguments (in Russian). Works Main Geophys Observ 4:5–19 Kendall MG (1994) Advanced theory of statistics. Vol I: distribution theory, 6th edn. In: Stuart A, Ord JK (eds). Arnold, London. Kendall MG, Stuart A (1961) The advanced theory of statistics: Inference and relationships, 3rd edn. Griffin, London. Kendall MG, Stuart A (1977) The advanced Theory of Statistics. Volume 1: distribution theory, 4th edn. Griffin, London Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In: Proceedings 2001 IEEE international conference on data mining, pp 289–296. https://doi.org/ 10.1109/ICDM.2001.989531 Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451 Khatri CG (1976) A note on multiple and canonical correlation for a singular covariance matrix. Psychometrika 41:465–470 Khedairia S, Khadir MT (2008) Self-organizing map and k-means for meteorological day type identification for the region of Annaba–Algeria. In: 7th computer information systems and industrial management applications, Ostrava, pp 91–96. ISBN: 978-0-7695-3184-7 Kiers HAL (1994) Simplimax: Oblique rotation to an optimal target with simple structure. Psychometrika 59:567–579 Kikkawa S, Ishida M (1988) Number of degrees of freedom, correlation times, and equivalent bandwidths of a random process. IEEE Trans Inf Theory 34:151–155

References

567

Kiladis GN, Weickmann KM (1992) Circulation anomalies associated with tropical convection during northern winter. Mon Wea Rev 120:1900–1923 Killworth PD, McIntyre ME (1985) Do Rossby-wave critical layers absorb, reflect or over-reflect? J Fluid Mech 161:449–492 Kim K-Y, Hamlington B, Na H (2015) Theoretical foundation of cyclostationary EOF analysis for geophysical and climate variables: concept and examples. Eart Sci Rev 150:201–218 Kim K-Y, North GR (1999) A comparison of study of EOF techniques: analysis of non-stationary data with periodic statistics. J Climate 12:185–199 Kim K-Y, Wu Q (1999) A comparison study of EOF techniques: Analysis of nonstationary data with periodic statistics. J Climate 12:185–199 Kim K-Y, North GR, Huang J (1996) EOFs of one-dimensional cyclostationary time series: Computations, examples, and stochastic modeling. J Atmos Sci 53:1007–1017 Kim K-Y, North GR (1997) EOFs of harmonizable cyclostationary processes. J Atmos Sci 54:2416–2427 Kimoto M, Ghil M, Mo KC (1991) Spatial structure of the extratropical 40-day oscillation. In: Proc. 8’th conf. atmos. oceanic waves and stability. Amer. Meteor. Soc., Boston, pp 115–116 Knighton J, Pleiss G, Carter E, Walter MT, Steinschneider S (2019) Potential predictability of regional precipitation and discharge extremes using synoptic-scale climate information via machine learning: an evaluation for the eastern continental United States. J Hydrometeorol 20:883–900 Knutson TR, Weickmann KM (1987) 30–60 day atmospheric oscillation: Composite life cycles of convection and circulation anomalies. Mon Wea Rev 115:1407–1436 Kobayashi S, Ota Y, Harada Y, Ebita A, Moriya M, Onoda H, Onogi K, Kamahori H, Kobayashi C, Endo H, Miyaoka K, Takahashi K (2015) The JRA-55 Reanalysis: General specifications and basic characteristics. J Meteor Soc Jpn 93:5–48 Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin, 501 p Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics 43:59–69 Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480 Kolmogorov AN (1933) Foundations of the theory of probability (Grundbegriffe der Wahrscheinlichkeitsrechnung). Translated by Nathan Morrison and Published by Chelsea Publishing Company, New York, 1950 Kolmogorov AN (1939) Sur l’interpolation et l’extrapolation des suites stationaires. Comptes Rendus Acad Sci Paris 208:2043–2045 Kolmogorov AN (1941) Stationary sequences in Hilbert space. Bull Math Univ Moscow 2:1–40 Kondrashov D, Chekroun MD, Yuan X, Ghil M (2018a) Data-adaptive harmonic decomposition and stochastic modeling of Arctic sea ice. Dyn Statist Clim Syst 3:179–205 Kondrashov, D., M. D. Chekroun, P. Berloff, (2018b) Multiscale Stuart-Landau emulators: Application wind-driven ocean gyres. Fluids 3:21. https://doi.org/10.3390/fluids3010021 Kooperberg C, O’Sullivan F (1996) Prediction oscillation patterns: A synthesis of methods for spatial-temporal decomposition of random fields. J Am. Statist Assoc 91:1485–1496 Koopmans LH (1974) The spectral analysis of time series. Academic Press, New York Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE J 37:233–243 Kress R, Martensen E (1970) Anwendung der rechteckregel auf die reelle Hilbertransformation mit unendlichem intervall. Z Angew Math Mech 50:T61–T64 Krishnamurthi TN, Chakraborty DR, Cubucku N, Stefanova L, Vijaya Kumar TSV (2003) A mechanism of the Madden-Julian oscillation based on interactions in the frequency domain. Q J R Meteorol Soc 129:2559–2590 Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27 Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29:115–129

568

References

Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation, New York Kruskal JB (1972) Linear transformations of multivariate data to reveal clustering. In: Multidimensional scaling: theory and application in the behavioural sciences, I, theory. Seminra Press, New York Krzanowski WJ, Marriott FHC (1994) Multivariate analysis, Part 1. Distributions, ordination and inference. Arnold, London Krzanowski WJ (2000) Principles of multivariate analysis: A user’s perspective, 2nd edn. Oxford University Press, Oxford Krzanowski WJ (1984) Principal component analysis in the presence of group structure. Appl Statist 33:164–168 Krzanowski WJ (1979) Between-groups comparison of principal components. J Am Statist Assoc 74:703–707 Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25. Curran Associates, Red Hook, NY, pp 1097–1105 Kubáˇckouá L, Kubáˇcek L, Kukuˇca J (1987) Probability and statistics in geodesy and geophysics. Elsevier, Amsterdam Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86 Kundu PK, Allen JS (1976) Some three-dimensional characteristics of low-frequency current fluctuations near the Oregon coast. J Phys Oceanogr 6:181–199 Kutzbach JE (1967) Empirical eigenvectors of sea-level pressure, surface temperature and precipitation complexes over North America. J Appl Meteor 6:791–802 Kwon S (1999) Clutering in multivariate data: visualization, case and variable reduction. Ph.D. Thesis, Iowa State University Kwasniok F (1996) The reduction of complex dynamical systems using principal interaction patterns. Physica D 92:28–60 Kwasniok F (1997) Optimal Galerkin approximations of partial differential equations using principal interaction patterns. Phys Rev E 55:5365–5375 Kwasniok F (2004) Empirical low-order models of barotropic flow. J Atmos Sci 61:235–245 Labitzke K, van Loon H (1999) The stratosphere. Springer, New York Laplace PS (1951) A philosophical essay on probabilities. Dover Publications, New York Larsson E, Fornberg B (2003) A numerical study of some radial basis function based solution methods for elliptic PDEs. Comput Math Appli 47:37–55 Laughlin S (1981) A simple coding procedure enhances a neuron’s information capacity. Z Natureforsch 36c:910–912 Lawley DN (1956) Tests of significance for the latent roots of covariance and correlation matrices. Biometrika 43:128–136 Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworth, London Lazante JR (1990) The leading modes of 10–30 day variability in the extratropics of the Northern Hemisphere during the cold season. J Atmos Sci 47:2115–2140 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/ nature14539 Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices. J Multivar Anal 88:365–411 Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791 Legates DR (1991) The effect of domain shape on principal components analyses. Int J Climatol 11:135–146 Legates DR (1993) The effect of domain shape on principal components analyses: A reply. Int J Climatol 13:219–228

References

569

Leith CE (1973) The standard error of time-average estimates of climatic means. J Appl Meteorol 12:1066–1069 Leloup JA, Lachkar Z, Boulanger JP, Thiria S (2007) Detecting decadal changes in ENSO using neural networks. Clim Dyn 28:147–162. https://doi.org/10.1007/s00382-006-0173-1. ISSN: 0930-7575 Leurgans SE, RA Moyeed, Silverman BW (1993) Canonical correlation analysis when the data are curves. J R Statist Soc B 55:725–740 Li G, Ren B, Yang C, Zheng J (2011a) Revisiting the trend of the tropical and subtropical Pacific surface latent heat fluxduring 1977–2006. J Geophys Res 116:D10115. https://doi.org/10.1029/ 2010JD015444 Li G, Ren B, Zheng J, Yang C (2011b) Trend singular value decomposition analysis and its application to the global ocean surfacelatent heat flux and SST anomalies. J Climate 24:2931– 2948 Lin G-F, Chen L-H (2006) Identification of homogeneous regions for regional frequency analysis using the self-organizing map. J Hydrology 324:1–9. ISSN: 0022-1694 Lingoes JC, Roskam EE (1973) A mathematical and empirical analysis of two multidimensional analysis scaling algorithms. Psychometrika 38(Monograph supplement):1–93 Linz P, Wang RLC (2003) Exploring numerical methods: An introduction to scientific computing using MATLAB. Jones and Bartlett Publishers, Sudbury, MA Lim Y-K, Kim K-Y (2006) A new perspective on the climate prediction of Asian summer monsoon precipitation. J Climate 19:4840–4853 Lim Y-K, Cocke S, Shin DW, Schoof JT, LaRow TE, O’Brien JJ (2010) Downscaling large-scale NCEP CFS to resolve fine-scale seasonal precipitation and extremes for the crop growing seasons over the southeastern United States. Clim Dyn 35:449–471 Liu Y, Weisberg RH (2007) Ocean currents and sea surface heights estimated across the West Florida Shelf. J Phys Oceanog 37:1697–1713. ISSN: 0022-3670 Liu Y, Weisberg RH, Mooers CNK (2006) Performance evaluation of the selforganizing map for feature extraction. J Geophys Res 111:C05018. https://doi.org/10.1029/2005JC003117. ISSN: 0148-0227 Liu Y, Weisberg RH (2005) Patterns of ocean current variability on the West Florida Shelf using the self-organizing map. J Geophys Res 110:C06003. https://doi.org/10.1029/2004JC002786 Loève M (1948) Functions aléatoires du second order. Suplement to P. Lévy: Processus Stochastiques et Mouvement Brownien. Gauthier-Villars, Paris Loève M (1963) Probability theory. Van Nostrand Reinhold, New York Loève M (1978) Probability theory, vol II, 4th edn. Springer, 413 p Lorenz EN (1963) Deterministic non-periodic flow. J Atmos Sci 20:130–141 Lorenz EN (1970) Climate change as a mathematical problem. J Appl Meteor 9:325–329 Lorenz EN (1956) Empirical orthogonal functions and statistical weather prediction. Technical report, Statistical Forecast Project Report 1, Dept. of Meteor., MIT, 49 p Losada IJ, Reguero BG, Méndez FJ, Castanedo S, Abascal AJ, Minguez R (2013) Long-term changes in sea-level components in Latin America and the Caribbean. Global Planetary Change 104:34–50 Lucio JH, Valdés R, Rodríguez LR (2012) Improvements to surrogate data methods for nonstationary time series. Phys Rev E 85:056202 Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17:395–416 Lütkepoch H (2006) New introduction to multiple time series analysis. Springer, Berlin Madden RA, Julian PR (1971) Detection of a 40–50 day oscillation in the zonal wind in the tropical pacific. J Atmos Sci 28:702–708 Madden RA, Julian PR (1972) Description of global-scale circulation cells in the tropics with a 40–50 day period. J Atmos Sci 29:1109–1123 Madden RA, Julian PR (1994) Observations of the 40–50-day tropical oscillation−A review. Mon Wea Rev 122:814–837 Magnus JR, Neudecker H (1995) Matrix differential calculus with applications in statistics and econometrics. Wiley, Chichester

570

References

Malik N, Bookhagen B, Marwan N, Kurths J (2012) Analysis of spatial and temporal extreme monsoonal rainfall over South Asia using complex networks. Clim Dyn 39:971–987. https:// doi.org/10.1007/s00382-011-1156-4 Malozemov VN, Pevnyi AB (1992) Fast algorithm of the projection of a point onto the simplex. Vestnik St. Petersburg University 1(1):112–113 Mansour A, Jutten C (1996) A direct solution for blind separation of sources. IEEE Trans Signal Process 44:746–748 Mardia KV, Kent TJ, Bibby MJ (1979) Multivariate analysis. Academic Press, London Mardia KV (1980) Tests of univariate and multivariate normality. In: Krishnaiah PR (ed) Handbook of statistics 1: Analysis of variance. North-Holland Publishing, pp 279–320 Martinez WL, Martinez AR (2002) Computational statistics handbook with MATLAB. Chapman and Hall, Boca Raton Martinez AR, Solka J, Martinez WL (2010) Exploratory data analysis with MATLAB, 2nd edn. CRS Press, 530 p Maruyama T (1997) The quasi-biennial oscillation (QBO) and equatorial waves−A historical review. Pap Meteorol Geophys 47:1–17 Marwan N, Donges JF, Zou Y, Donner RV, Kurths J (2009) Complex network approach for recurrence analysis of time series. Phys Lett A 373:4246–4254 Mathar R (1985) The best Euclidean fit to a given distance matrix in prescribed dimensions. Linear Algebra Appl 67:1–6 Matsubara Y, Sakurai Y, van Panhuis WG, Faloutsos C (2014) FUNNEL: automatic mining of spatially coevolving epidemics. In: KDD, pp 105–114 https://doi.org/10.1145/2623330. 2623624 Matthews AJ (2000) Propagation mechanisms for the Madden-Julian oscillation. Q J R Meteorol Soc 126:2637–2651 Masani P (1966) Recent trends in multivariate prediction theory. In: Krishnaiah P (ed) Multivariate analysis – I. Academic Press, New York, pp 351–382 Mazloff MR, Heimbach P, Wunch C (2010) An eddy-permitting Southern Ocean State Estimate. J Phys Oceano 40:880–899 McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London, 511 p McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133 McDonald AJ, Cassano JJ, Jolly B, Parsons S, Schuddeboom A (2016) An automated satellitecloud classification scheme usingself-organizing maps: Alternative ISCCP weather states. J Geophys Res Atmos 121:13,009–13,030. https://doi.org/10.1002/2016JD025199 McEliece RJ (1977) The theory of information and coding. Addison-Wesley, Reading, MA McGee VE (1968) Multidimensional scaling of N sets of similarity measures: a nonmetric individual differences approach. Multivar Behav Res 3:233–248 McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience, 545 p Meila M, Shi J (2000) Learning segmentation by random walks. In: Proceedings of NIPS, pp 873– 879 Mercer T (1909) Functions of positive and negative type and their connection with the theory of integral equations. Trans Lond Phil Soc A 209:415–446 Merrifield MA, Winant CD (1989) Shelf circulation in the Gulf of California: A description of the variability. J Geophys Res 94:18133–18160 Merrifield MA, Guza RT (1990) Detecting propagating signals with complex empirical orthogonal functions: A cautionary note. J Phys Oceanogr 20:1628–1633 Mestas-Nuñez AM (2000) Orthogonality properties of rotated empirical modes. Int J Climatol 20:1509–1516 Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculation by fast computing machines. J Chem Phys 21:1087–1092

References

571

Meyer Y (1992) Wavelets and operators. Cambridge University Press, New York, 223 p Meza–Padilla R, Enriquez C, Liu Y, Appendini CM (2019) Ocean circulation in the western Gulf of Mexico using self–organizing maps. J Geophys Res Oceans 124:4152–4167. https://doi.org/ 10.1029/2018JC014377 Michelli CA (1986) Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constr Approx 2:11–22 Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of Rn . JOTA 50:195–200 Mirsky L (1955) An introduction to linear algebra. Oxford University Press, Oxford, 896pp Mitchell TM (1998) Machine learning. McGraw-Hill, New York, 432 p Mikhlin SG (1964) Integral equations, 2nd edn. Pergamon Press, London Minnotte MC, West RW (1999) The data image: A tool for exploring high dimensional data sets. In: Proc. ASA section on stat. graphics, Dallas, TX, American Statistical Association, pp 25–33 Moiseiwitsch BL (1977) Integral equations. Longman, London Monahan AH, DelSole T (2009) Information theoretic measures of dependence, compactness, and non-Gaussianity for multivariate probability distributions. Nonlin Process Geophys 16:57–64 Monahan AH, Fyfe CJ (2007) Comment on the shortcomings of nonlinear principal component analysis in identifying circulation regimes. J Climate 20:374–377 Monahan, A.H., L. Pandolfo, Fyfe JC (2001) The prefered structure of variability of the Northern Hemisphere atmospheric circulation. Geophys Res Lett27:1139–1142 Monahan AH, Tangang FT, Hsieh WW (1999) A potential problem with extended EOF analysis of standing wave fields. Atmosphere-Ocean 3:241–254 Monahan AH, Fyfe JC, Flato GM (2000) A regime view of northern hemisphere atmospheric variability and change under global warming. Geophys Res Lett 27:1139–1142 Monahan AH (2000) Nonlinear principal component analysis by neural networks: theory and application to the Lorenz system. J Climate 13:821–835 Monahan AH (2001) Nonlinear principal component analysis: tropical Indo–Pacific sea surface temperature and sea level pressure. J Climate 14:219–233 Moody J, Darken CJ (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294 Moon TK (1996) The expectation maximization algorithm. IEEE Signal Process Mag, 47–60 Mori A, Kawasaki N, Yamazaki K, Honda M, Nakamura H (2006) A reexamination of the northern hemisphere sea level pressure variability by the independent component analysis. SOLA 2:5–8 Morup M, Hansen LK (2012) Archetypal analysis for machine Learning and data mining. Neurocomputing 80:54–63 Morozov VA (1984) Methods for solving incorrectly posed problems. Springer, Berlin. ISBN: 3540-96059-7 Morrison DF (1967) Multivariate statistical methods. McGraw-Hill, New York Morton SC (1989) Interpretable projection pursuit. Technical Report 106. Department of Statistics, Stanford University, Stanford. https://www.osti.gov/biblio/5005529-interpretable-projectionpursuit Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J (2015) Principal nonlinear dynamical modes of climate variability. Sci Rep 5:15510. https://doi.org/10.1038/srep15510 Munk WH (1950) On the wind-driven ocean circulation. J Metorol 7:79–93 Nadler B, Lafon S, Coifman RR, Kevrikedes I (2006) Diffusion maps, spectral clustering, and reaction coordinates of dynamical systems. Appl Comput Harmon Anal 21:113–127 Nason G (1992) Design and choice of projection indices. Ph.D. Thesis, The University of Bath Nason G (1995) Three-dimensional projection pursuit. Appl Statist 44:411–430 Nason GP, Sibson R (1992) Measuring multimodality. Stat Comput 2:153–160 Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313 Newman MEJ (2006) Modularity and community structure in networks. PNAS 103:8577–8582. www.pnas.org/cgi/doi/10.1073/pnas.0601602103 Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113

572

References

Nielsen MA (2015) Neural networks and deep learning. Determination Press North GR (1984) Empirical orthogonal functions and normal modes. J Atmos Sci 41:879–887 North GR, Bell TL, Cahalan FR, Moeng JF (1982) Sampling errors in the estimation of empirical orthogonal functions. Mon Wea Rev 110:699–706 Neumaier A, Schneider T (2001) Estimation of parameters and eigenmodes of multivariate autoregressive models. ACL Trans Math Soft 27:27–57 Nuttal AH, Carter GC (1982) Spectral estimation and lag using combined time weighting. Proc IEEE 70:1111–1125 Obukhov AM (1947) Statistically homogeneous fields on a sphere. Usp Mat Navk 2:196–198 Obukhov AM (1960) The statistically orthogonal expansion of empirical functions. Bull Acad Sci USSR Geophys Ser (English Transl.), 288–291 Ohba M, Kadokura S, Nohara D, Toyoda Y (2016) Rainfall downscaling of weekly ensemble forecasts using self-organising maps. Tellus A 68:29293. https://doi.org/10.3402/tellusa.v68. 29293 Oja E (1982) A simplified neuron model as a principal component analyzer. J Math Biol 15:267– 273 Önskog T, Franzke C, Hannachi A (2018) Predictability and non-Gaussian characteristics of the North Atlantic Oscillation. J Climate 31:537–554 Önskog T, Franzke C, Hannachi A (2020) Nonlinear time series models for the North Atlantic Oscillation. Adv Statist Clim Meteorol Oceanog 6:1–17 Osborne AR, Kirwan AD, Provenzale A, Bergamasco L (1986) A search for chaotic behavior in large and mesoscale motions in the pacific ocean. Physica D Nonlinear Phenomena 23:75–83 Overland JE, Preisendorfer RW (1982) A significance test for principal components applied to a cyclone climatology. Mon Wea Rev 110:1–4 Packard NH, Crutchfield JP, Farmer JDR, Shaw RS (1980) Geometry from a time series. Phys Rev Lett 45:712–716 Palmer CE (1954) The general circulation between 200 mb and 10 mb over the equatorial Pacific, Weather 9:3541–3549 Pang B, Yue J, Zhao G, Xu Z (2017) Statistical downscaling of temperature with the random forest model. Hindawi Adv Meteorol Article ID 7265178:11 p. https://doi.org/10.1155/2017/7265178 Panagiotopoulos F, Shahgedanova M, Hannachi A, Stephenson DB (2005) Observed trends and teleconnections of the Siberian High: a recently declining center of action. J Climate 18:1411– 1422 Parlett BN, Taylor DR, Liu ZS (1985) A look-ahead Lanczos algorithm for nonsymmetric matrices. Math Comput 44:105–124 Parzen E (1959) Statistical inference on time series by Hilbert space methods, I. Technical Report No. 23, Department of Statistics, Stanford University. (Published in Time Series Analysis Papers by E. Parzen, Holden-Day, San Francisco Parzen E (1961) An approach to time series. Ann Math Statist 32:951–989 Parzen E (1963) A new approach to the synthesis of optimal smoothing and prediction systems. In: Bellman R (ed) Proceedings of a symposium on optimization. University of California Press, Berkeley, pp 75–108 Pasmanter RA, Selten MF (2010) Decomposing data sets into skewness modes. Physica D 239:1503–1508 Pauthenet E (2018) Unraveling the thermohaline of the Southern Ocean using functional data analysis. Ph.D. thesis, Stockholm University Pauthenet E, Roquet F, Madec G, Nerini D (2017) A linear decomposition of the Southern Ocean thermohaline structure. J Phys Oceano 47:29–47 Pavan V, Tibaldi S, Brankovich C (2000) Seasonal prediction of blocking frequency: Results from winter ensemble experiments. Q J R Meteorol Soc 126:2125–2142 Pawitan Y (2001) In all likelihood: statistical modelling and inference using likelihood. Oxford University Press, Oxford Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559– 572

References

573

Pearson K (1895) Notes on regression and inheritance in the case of two parents. Proc R Soc London 58:240–242 Pearson K (1920) Notes on the history of correlation. Biometrika 13:25–45 Pham D-T, Garrat P, Jutten C (1992) Separation of mixture of independent sources through maximum likelihood approach. In: Proc EUSIPCO, pp 771–774 Pires CAL, Hannachi A (2021) Bispectral analysis of nonlinear interaction, predictability and stochastic modelling with application to ENSO. Tellus A 73, 1–30 Plaut G, Vautard R (1994) Spells of low-frequency oscillations and weather regimes in the northern hemisphere. J Atmos sci 51:210–236 Pearson K (1902) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559– 572 Penland C (1989) Random forcing and forecasting using principal oscillation patterns. Mon Wea Rev 117:2165–2185 Penland C, Sardeshmukh PD (1995) The optimal growth of tropical sea surface temperature anomalies. J Climate 8:1999–2024 Pezzulli S, Hannachi A, Stephenson DB (2005) The variability of seasonality. J Climate 18:71–88 Philippon N, Jarlan L, Martiny N, Camberlin P, Mougin E (2007) Characterization of the interannual and intraseasonal variability of west African vegetation between 1982 and 2002 by means of NOAA AVHRR NDVI data. J Climate 20:1202–1218 Pires CAL, Hannachi A (2017) Independent subspace analysis of the sea surface temperature variability: non-Gaussian sources and sensitivity to sampling and dimensionality. Complexity. https://doi.org/10.1155/2017/3076810 Pires CAL, Ribeiro AFS (2017) Separation of the atmospheric variability into non-Gaussian multidimensional sources by projection pursuit techniques. Climate Dynamics 48:821–850 Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497 Polya G, Latta G (1974) Complex variables. Wiley, New York, 334pp Powell MJD (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J 7:155–162 Powell MJD (1987) Radial basis functions for multivariate interpolation: a review. In: Mason JC, Cox MG (eds) Algorithms for the approximation of functions and data. Oxford University Press, Oxford, pp 143–167 Powell MJD (1990) The theory of radial basis function approximation in (1990) In: Light W (ed) Advances in numerical analysis, Volume 2: wavelets, subdivision algorithms and radial basis functions. Oxford University Press, Oxford Preisendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanography. Elsevier, Amsterdam Press WH, et al (1992) Numerical recipes in Fortran: The Art of scientific computing. Cambridge University Press, Cambridge Priestly MB (1981) Spectral analysis of time series. Academic-Press, London Posse C (1995) Tools for two-dimensional exploratory projection pursuit. J Comput Graph Statist 4:83–100 Ramsay JO, Silverman BW (2006) Functional data analysis, 2nd edn. Springer Series in Statistics, New York Rasmusson EM, Arkin PA, Chen W-Y, Jalickee JB (1981) Biennial variations in surface temperature over the United States as revealed by singular decomposition. Mon Wea Rev 109:587–598 Rayner NA, Parker DE, Horton EB, Folland CK, Alexander LV, Rowell DP, Kent EC, Kaplan A (2003) Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century. J Geophys Res 108:014, 4407. Rényi A (1961) On measures of entropy and information. In: Neyman J (ed) Proceedings of the Fourth Bekeley symposium on mathematical statistics and probability, vol I. The University of California Press, Berkeley, pp 547–561 Rényi A (1970) Probability theory. North Holland, Amsterdam, 666pp Reed RJ, Campbell WJ, Rasmussen LA, Rogers RG (1961) Evidence of a downward propagating annual wind reversal in the equatorial stratosphere. J Geophys Res 66:813–818

574

References

Reichenback H (1937) Les fondements logiques du calcul des probabilités. Ann Inst H Poincaré 7:267–348 Rennert KJ, Wallace MJ (2009) Cross-frequency coupling, skewness and blocking in the Northern Hemisphere winter circulation. J Climate 22:5650–5666 Renwick AJ, Wallace JM (1995) Predictable anomaly patterns and the forecast skill of northern hemisphere wintertime 500-mb height fields. Mon Wea Rev 123:2114–2131 Reusch DB, Alley RB, Hewitson BC (2005) Relative performance of self-organizing maps and principal component analysis in pattern extraction from synthetic climatological data. Polar Geography 29(3):188–212. https://doi.org/10.1080/789610199 Reyment RA, Jvreskog KG (1996) Applied factor analysis in the natural sciences. Cambridge University Press, Cambridge Richman MB (1981) Obliquely rotated principal components: An improved meteorological map typing technique. J Appl Meteor 20:1145–1159 Richman MB (1986) Rotation of principal components. J Climatol 6:293–335 Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520 Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520 Richman M (1993) Comments on: The effect of domain shape on principal components analyses. Int J Climatol 13:203–218 Richman M, Adrianto I (2010) Classification and regionalization through kernel principal component analysis. Phys Chem Earth 35:316–328 Richman MB, Leslie LM (2012) Adaptive machine learning approaches to seasonal prediction of tropical cyclones. Procedia Comput Sci 12:276–281 Richman MB, Leslie LM, Ramsay HA, Klotzbach PJ (2017) Reducing tropical cyclone prediction errors using machine learning approaches. Procedia Comput Sci 114:314–323 Ripley BD (1994) Neural networks and related methods for classification. J Roy Statist Soc B 56:409–456 Riskin H (1984) The Fokker-Planck quation. Springer Ritter H (1995) Self-organizing feature maps: Kohonen maps. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, MA, pp 846–851 Roach GF (1970) Greens functions: introductory theory with applications. Van Nostrand Reinhold Campany, London Rodgers JL, Nicewander WA (1988) Thirten ways to look at the correlation coefficients. Am Statistician 42:59–66 Rodwell MJ, Hoskins BJ (1996) Monsoons and the dynamics of deserts. Q J Roy Meteorol Soc 122:1385–1404 Rogers GS (1980) Matrix derivatives. Marcel Dekker, New York Rojas R (1996) Neural networks: A systematic introduction. Springer, Berlin, 509 p Rosenblatt F (1962) Principles of neurodynamics. Spartman, New York Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Rev 65:386–408 Ross SM (1998) A first course in probability, 5th edn. Prentice-Hall, New Jersey Roweis ST (1998) The EM algorithm for PCA and SPCA. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linera embedding. Science 290:2323–2326 Rozanov YuA (1967) Stationary random processes. Holden-Day, San-Francisco Rubin DB, Thayer DT (1982) EM algorithms for ML factor analysis. Psychometrika 47:69–76 Rubin DB, Thayer DT (1983) More on EM for ML factor analysis. Psychometrika 48:253–257 Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation errors. Nature 323:533–536 Rumelhart DE, Widrow B, Lehr AM (1994) The basic ideas in neural networks. Commun ACM 37:87–92

References

575

Runge J, Petoukhov V, Kurths J (2014) Quantifying the strength and delay of climatic interactions: the ambiguities of cross correlation and a novel measure based on graphical models. J Climate 27:720–739 Runge J, Heitzig J, Kurths J (2012) Escaping the curse of dimensionality in estimating multivariate transfer entropy. Phys Rev Lett 108:258701. https://doi.org/10.1103/PhysRevLett.108.258701 Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia Saad Y (1990) Numerical solution of large Lyapunov equations. In: Kaashoek AM, van Schuppen JH, Ran AC (eds) Signal processing, scattering, operator theory, and numerical methods, Proceedings of the international symposium MTNS-89, vol III, pp 503–511, Boston, Birkhauser Saad Y, Schultz MH (1985) Conjugate gradient-like algorithms for solving nonsymmetric linear systems. Math Comput 44:417–424 Said-Houari B (2015) Diffierential equations: Methods and applications. Springer, Cham, 212pp Salim A, Pawitan Y, Bond K (2005) Modelling association between two irregularly observed spatiotemporal processes by using maximum covariance analysis. Appl Statist 54:555–573 Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C18:401–409 Samuel AL (1959) Some studies in machine learning using the game of of checkers. IBM J Res Dev 3:211–229 Saunders DR (1961) The rationale for an “oblimax” method of transformation in factor analysis. Psychometrika 26:317–324 Scher S (2020) Artificial intelligence in weather and climate prediction. Ph.D. Thesis in Atmospheric Sciences and Oceanography, Stockholm University, Sweden 2020 Scher S (2018) Toward data-driven weather and climate forecasting: Approximating a simple general circulation model with deep learning. Geophys Res Lett 45:12,616–12,622. https:// doi.org/10.1029/2018GL080704 Scher S, Messori G (2019) Weather and climate forecasting with neural networks: using general circulation models (GCMs) with different complexity as a study ground. Geosci Model Dev 12:2797–2809 Schmidtko S, Johnson GC, Lyman JM (2013) MIMOC: A global monthly isopycnal upper-ocean climatology with mixed layers. J Geophys Res, 118. https://doi.org/10.1002/jgrc.20122 Schneider T, Neumaier A (2001) Algorithm 808: ARFit − A Matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Trans Math Soft 27:58–65 Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319 Schölkopf B, Mika S, Burgers CJS, Knirsch P, Müller K-R, Rätsch G, Smola A (1999) Input space vs. feature space in kernel-based methods. IEEE Trans Neural Netw 10:1000–1017 Schoenberg IJ (1935) Remarks to Maurice Fréchet’s article ‘sur la définition axiomatique d’une classe e’espace distanciés vectoriellement applicable sur l’espace de Hilbert’. Ann Math (2nd series) 36:724–732 Schoenberg IJ (1964) Spline interpolation and best quadrature formulae. Bull Am Soc 70:143–148 Schott JR (1991) Some tests for common principal component subspaces in several groups. Biometrika 78:771–778 Schott JR (1988) Common principal component subspaces in two groups. Biometrika 75:229–236 Scott DW (1992) Multivariate density estimation: theory, practice, and vizualization. Wiley, New York Schuenemann KC, Cassano JJ (2010) Changes in synoptic weather patterns and Greenland precipitation in the 20th and 21st centuries: 2. Analysis of 21st century atmospheric changes using self-organizing maps, J Geophys Res 115:D05108. https://doi.org/10.1029/2009JD011706. ISSN: 0148-0227 Schuenemann KC, Cassano JJ, Finnis J (2009) Forcing of precipitation over Greenland: Synoptic climatology for 1961–99. J Hydrometeorol 10:60–78. https://doi.org/10.1175/2008JHM1014. 1. ISSN: 1525-7541

576

References

Scott DW, Thompson JR (1983) Probability density estimation in higher dimensions. In: Computer science and statistics: Proceedings of the fifteenth symposium on the interface, pp 173–179 Seal HL (1967) Multivariate statistical analysis for biologists. Methuen, London Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech 656(1):5–28 Seitola T, Mikkola V, Silen J, Järvinen H (2014) Random projections in reducing the dimensionality of climate simulation data. Tellus A, 66. Available at www.tellusa.net/index.php/tellusa/ article/view/25274 Seitola T, Silén J, Järvinen H (2015) Randomized multi-channel singular spectrum analysis of the 20th century climate data. Tellus A 67:28876. Available at https://doi.org/10.3402/tellusa.v67. 28876. Seltman HJ (2018) Experimental design and analysis. http://www.stat.cmu.edu/~hseltman/309/ Book/Book.pdf Seth S, Eugster MJA (2015) Probabilistic archetypal analysis. Machine Learning. https://doi.org/ 10.1007/s10994-015-5498-8 Shalvi O, Weinstein E (1990) New criteria for blind deconvolution of nonminimum phase systems (channels). IEEE Trans Inf Theory 36:312–321 Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623– 656 Shepard RN (1962a) The analysis of proximities: multidimensional scaling with unknown distance function. Part I. Psychometrika 27:125–140 Shepard RN (1962b) The analysis of proximities: multidimensional scaling with unknown distance function. Part II. Psychometrika 27:219–246 Sheridan SC, Lee CC (2010) Synoptic climatology and the general circulation model. Progress Phys Geography 34:101–109. ISSN: 1477-0296 Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22:888–905 Schnur R, Schmitz G, Grieger N, von Storch H (1993) Normal modes of the atmosphere as estimated by principal oscillation patterns and derived from quasi-geostrophic theory. J Atmos Sci 50:2386–2400 Sibson R (1972) Order invariant methods for data analysis. J Roy Statist Soc B 34:311–349 Sibson R (1978) Studies in the robustness of multidimensional scaling: procrustes statistics. J Roy Statist Soc B 40:234–238 Sibson R (1979) Studies in the robustness of multidimensional scaling: Perturbational analysis of classical scaling. J Roy Statist Soc B 41:217–229 Sibson R (1981) Multidimensional scaling. Wiley, Chichester Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London Simmons AJ, Wallace MJ, Branstator WG (1983) Barotropic wave propagation and instability, and atmospheric teleconnection patterns. J Atmos Sci 40:1363–1392 Smith S (1994) Optimization techniques on Riemannian manifolds. In Hamiltonian and gradient flows, algorithm and control (Bloch A, Ed.), Field Institute Communications, Vol 3, Amer Math Soc, 113–136. Snyman JA (1982) A new and dynamic method for unconstrained optimisation. Appl Math Modell 6:449–462 Solidoro C, Bandelj V, Barbieri P, Cossarini G, Fonda Umani S (2007) Understanding dynamic of biogeochemical properties in the northern Adriatic Sea by using self-organizing maps and kmeans clustering. J Geophys Res 112:C07S90. https://doi.org/10.1029/2006JC003553. ISSN: 0148-0227 Soˇcan G (2003) The incremental value of minimum rank factor analysis. Ph.D. Thesis, University of Groningen, Groningen Spearman C (1904a) General intelligence, objectively determined and measured. Am J Psy 15:201–293

References

577

Spearman C (1904b) The proof and measurement of association between two things. Am J Psy 15:72, and 202 Spence I, Garrison RF (1993) A remarkable scatterplot. Am Statistician 47:12–19 Spendley W, Hext GR, Humsworth FR (1962) Sequential applications of simplex designs in optimization and evolutionary operations. Technometrics 4:441–461 Stewart D, Love W (1968) A general canonical correlation index. Psy Bull 70:160–163 Steinschneiders S, Lall U (2015) Daily precipitation and tropical moisture exports across the eastern United States: An application of archetypal analysis to identify spatiotemporal structure. J Climate 28:8585–8602 Stephenson G (1973) Mathematical methods for science students, 2nd edn. Dover Publication, Mineola, 526 p Stigler SM (1986) The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge, MA Stommel H (1948) The westward intensification of wind-driven ocean currents. EOS Trans Amer Geophys Union 29:202–206 Stone M, Brooks RJ (1990) Continuum regression: cross-validation sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J Roy Statist Soc B52:237–269 Su Z, Hu H, Wang G, Ma Y, Yang X, Guo F (2018) Using GIS and Random Forests to identify fire drivers in a forest city, Yichun, China. Geomatics Natural Hazards Risk 9:1207–1229. https:// doi.org/10.1080/19475705.2018.1505667 Subashini A, Thamarai SM, Meyyappan T (2019) Advanced weather forecasting Prediction using deep learning. Int J Res Appl Sci Eng Tech IJRASET 7:939–945. www.ijraset.com Sura P, Hannachi A (2015) Perspectives of non-Gaussianity in atmospheric synoptic and lowfrequency variability. J Cliamte 28:5091–5114 Swenson ET (2015) Continuum power CCA: A unified approach for isolating coupled modes. J Climate 28:1016–1030 Takens F (1981) Detecting strange attractors in turbulence. In: Rand D, Young LS (eds) Dynamical systems and turbulence, warwick 1980. Lecture Notes in Mathematics, vol 898. Springer, New York, pp 366–381 Talley LD (2008) Freshwater transport estimates and the global overturning circulation: shallow, deep and throughflow components. Progress Ocenaography 78:257–303 Taylor GI (1921) Diffusion by continuous movement. Proc Lond Math Soc 20(2):196–212 Telszewski M, Chazottes A, Schuster U, Watson AJ, Moulin C, Bakker DCE, Gonzalez-Davila M, Johannessen T, Kortzinger A, Luger H, Olsen A, Omar A, Padin XA, Rios AF, Steinhoff T, Santana-Casiano M, Wallace DWR, Wanninkhof R (2009) Estimating the monthly pCO2 distribution in the North Atlantic using a self-organizing neural network. Biogeosciences 6:1405–1421. ISSN: 1726–4170 Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323 TerMegreditchian MG (1969) On the determination of the number of independent stations which are equivalent to prescribed systems of correlated stations (in Russian). Meteor Hydrol 2:24–36 Teschl G (2012) Ordinary differential equations and dynamical systems. Graduate Studies in Mathematics, vol 140, Amer Math Soc, Providence, RI, 345pp Thacker WC (1996) Metric-based principal components: data uncertainties. Tellus 48A:584–592 Thacker WC (1999) Principal predictors. Int J Climatol 19:821–834 Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method. Sov Math Dokl 4:1035–1038 Theiler J, Eubank S, Longtin A, Galdrikian B, Farmer JD (1992) Testing for nonlinearity in time series: the method of surrogate data. Physica D 58:77–94 Thiebaux HJ (1994) Statistical data analyses for ocean and atmospheric sciences. Academic Press Thomas JB (1969) An introduction to statistical communication theory. Wiley Thomson RE, Emery WJ (2014) Data analysis methods in physical oceanography, 3rd edn. Elsevier, Amsterdam, 716 p

578

References

Thompson DWJ, Wallace MJ (1998) The arctic oscillation signature in wintertime geopotential height and temperature fields. Geophys Res Lett 25:1297–1300 Thompson DWJ, Wallace MJ (2000) Annular modes in the extratropical circulation. Part I: Monthto-month variability. J Climate 13:1000–1016 Thompson DWJ, Wallace JM, Hegerl GC (2000) Annular modes in the extratropical circulation, Part II: Trends. J Climate 13:1018–1036 Thurstone LL (1940) Current issues in factor analysis. Psychological Bulletin 37:189–236 Thurstone LL (1947) Multiple factor analysis. The University of Chicago Press, Chicago Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288 Tippett MK, DelSole T, Mason SJ, Barnston AG (2008) Regression based methods for finding coupled patterns. J Climate 21:4384–4398 Tipping ME, Bishop CM (1999) Probabilistic principal components. J Roy Statist Soc B 61:611– 622 Toumazou V, Cretaux J-F (2001) Using a Lanczos eigensolver in the computation of empirical orthogonal functions. Mon Wea Rev 129:1243–1250 Torgerson WS (1952) Multidimensional scaling I: Theory and method. Psychometrika 17:401–419 Torgerson WS (1958) Theory and methods of scaling. Wiley, New York Trenberth KE, Jones DP, Ambenje P, Bojariu R, Easterling D, Klein Tank A, Parker D, Rahimzadeh F, Renwick AJ, Rusticucci M, Soden B, Zhai P (2007) Observations: surface and atmospheric climate change. In: Solomon S, Qin D, Manning M, et al. (eds) Climate Change (2007) The physical science basis. Contribution of working Group I to the fourth assessment report of the intergovernmental panel on climate change. Cambridge University Press, p 235–336 Trenberth KE, Shin W-TK (1984) Quasi-biennial fluctuations is sea level pressures over the Northern Hemisphere. Mon Wea Rev 111:761–777 Trendafilov NT (2010) Stepwise estimation of common principal components. Comput Statist Data Anal 54:3446–3457 Trendafilov NT, Jolliffe IT (2006) Projected gradient approach to the numerical solution of the SCoTLASS. Comput Statist Data Anal 50:242–253 Tsai YZ, Hsu K-S, Wu H-Y, Lin S-I, Yu H-L, Huang K-T, Hu M-C, Hsu S-Y (2020) Application of random forest and ICON models combined with weather forecasts to predict soil temperature and water content in a greenhouse. Water 12:1176 Tsonis AA, Roebber PJ (2004) The architecture of the climate network. Phys A 333:497–504. https://doi.org/10.1016/j.physa.2003.10.045 Tsonis AA, Swanson KL, Roebber PJ (2006) What do networks have to do with climate? Bull Am Meteor Soc 87:585–595. https://doi.org/10.1175/BAMS-87-5-585 Tsonis AA, Swanson KL, Wang G (2008) On the role of atmospheric teleconnections in climate. J Climate 21(2990):3001 Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decomposition: Theory and applications. J Comput Dyn 1:391–421. https://doi.org/10.3934/jcd.2014.1. 391 Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279– 311 Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, MA Tukey PA, Tukey JW (1981) Preparation, prechosen sequences of views. In: Barnett V (ed) Interpreting multivariate data. Wiley, Chichester, pp 189–213 Tyler DE (1982) On the optimality of the simultaneous redundancy transformations. Psychometrika 47:77–86 Ulrych TJ, Bishop TN (1975) Maximum entropy spectral analysis and autoregressive decomposition. Rev Geophys Space Phys 13:183–200 Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2010) Independent exploratory factor analysis with application to atmospheric science data. J Appl Stat 37:1847–1862 Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2011) Independent component analysis for threeway data with an application from atmospheric science. J Agr Biol Environ Stat 16:319–338

References

579

Uppala SM, Kallberg PW, Simmons AJ, Andrae U, Bechtold VDC, Fiorino M, Gibson JK, Haseler J, Hernandez A, Kelly GA, Li X, Onogi K, Saarinen S, Sokka N, Allan RP, Andersson E, Arpe K, Balmaseda MA, Beljaars ACM, Berg LVD, Bidlot J, Bormann N, Caires S, Chevallier F, Dethof A, Dragosavac M, Fisher M, Fuentes M, Hagemann S, Hólm E, Hoskins BJ, Isaksen L, Janssen PAEM, Jenne R, Mcnally AP, Mahfouf J-F, Morcrette J-J, Rayner NA, Saunders RW, Simon P, Sterl A, Trenberth KE, Untch A, Vasiljevic D, Viterbo P, Woollen J (2005) The ERA-40 re-analysis. Q J Roy Meteorol Soc 131:2961–3012 van den Dool HM, Saha S, Johansson Å(2000) Empirical orthogonal teleconnections. J Climate 13:1421–1435 van den Dool HM (2011) An iterative projection method to calculate EOFs successively without use of the covariance matrix. In: Science and technology infusion climate bulletin NOAA’s National Weather Service. 36th NOAA annual climate diagnostics and prediction workshop, Fort Worth, TX, 3–6 October 2011. www.nws.noaa.gov/ost/climate/STIP/36CDPW/36cdpwvandendool.pdf van den Wollenberg AL (1977) Redundancy analysis: an alternative to canonical correlation analysis. Psychometrika 42:207–219 Vasicek O (1976) A test for normality based on sample entropy. J R Statist Soc B 38:54–59 Vautard R, Ghil M (1989) Singular spectrum analysis in nonlinear dynamics, with applications to paleoclimatic time series. Physica D 35:395–424 Vautard R, Yiou P, Ghil M (1992) Singular spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D 58:95–126 Venables WN, Ripley BD (1994) Modern applied statistics with S-plus. McGraw-Hill, New York Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Net 11:586–600 Vesanto J (1997) Using the SOM and local models in time series prediction. In Proceedings of workshop on self-organizing maps (WSOM’97), Espo, Finland, pp 209–214 Vinnikov KY, Robock A, Grody NC, Basist A (2004) Analysis of diurnal and seasonal cycles and trends in climate records with arbitrary observations times. Geophys Res Lett 31. https://doi. org/10.1029/2003GL019196 Vilibi´c I, et al (2016) Self-organizing maps-based ocean currents forecasting system. Sci Rep 6:22924. https://doi.org/10.1038/srep22924 von Mises R (1928) Wahrscheinlichkeit, Statistik und Wahrheit, 3rd rev. edn. Springer, Vienna, 1936; trans. as Probability, statistics and truth, 1939. W. Hodge, London von Storch H (1995a) Spatial patterns: EOFs and CCA. In: von Storch H, Navarra A (eds) Analysis of climate variability: Application of statistical techniques. Springer, pp 227–257 von Storch J (1995b) Multivariate statistical modelling: POP model as a first order approximation. In: von Storch H, Navarra A (eds) Analysis of climate variability: application of statistical techniques. Springer, pp 281–279 von Storch H, Zwiers FW (1999) Statistical analysis in climate research. Cambridge University Press, Cambridge von Storch H, Xu J (1990) Principal oscillation pattern analysis of the tropical 30- to 60-day oscillation. Part I: Definition of an index and its prediction. Climate Dynamics 4:175–190 von Storch H, Bruns T, Fisher-Bruns I, Hasselmann KF (1988) Principal oscillation pattern analysis of the 30- to 60-day oscillation in a general circulation model equatorial troposphere. J Geophys Res 93:11022–11036 von Storch H, Bürger G, Schnur R, Storch J-S (1995) Principal ocillation patterns. A review. J Climate 8:377–400 von Storch H, Baumhefner D (1991) Principal oscillation pattern analysis of the tropical 30- to 60-day oscillation. Part II: The prediction of equatorial velocity potential and its skill. Climate Dynamics 5:1–12 Wahba G (1979) Convergence rates of “Thin Plate” smoothing splines when the data are noisy. In: Gasser T, Rosenblatt M (eds) Smoothing techniques for curve estimation. Lecture notes in mathematics, vol 757. Springer, pp 232–246

580

References

Wahba G (1990) Spline models for observational data SIAM. Society for Industrial and Applied Mathematics, Philadelphia, PA, 169 p Wahba G (2000) Smoothing splines in nonparametric regression. Technical Report No 1024, Department of Statistics, University of Wisconsin. https://www.stat.wisc.edu/sites/default/files/ tr1024.pdf Walker GT (1909) Correlation in seasonal variation of climate. Mem Ind Met Dept 20:122 Walker GT (1923) Correlation in seasonal variation of weather, VIII, a preliminary study of world weather. Mem Ind Met Dept 24:75–131 Walker GT (1924) Correlation in seasonal variation of weather, IX. Mem Ind Met Dept 25:275–332 Walker GT, Bliss EW (1932) World weather V. Mem Roy Met Soc 4:53–84 Wallace JM (2000) North Atlantic Oscillation/annular mode: Two paradigms–one phenomenon. QJR Meteorol Soc 126:791–805 Wallace JM, Dickinson RE (1972) Empirical orthogonal representation of time series in the frequency domain. Part I: Theoretical consideration. J Appl Meteor 11:887–892 Wallace JM (1972) Empirical orthogonal representation of time series in the frequency domain. Part II: Application to the study of tropical wave disturbances. J Appl Meteor 11:893–900 Wallace JM, Gutzler DS (1981) Teleconnections in the geopotential height field during the Northern Hemisphere winter. Mon Wea Rev 109:784–812 Wallace JM, Smith C, Bretherton CS (1992) Singular value decomposition of wintertime sea surface temperature and 500-mb height anomalies. J Climate 5:561–576 Wallace JM, Thompson DWJ (2002) The Pacific Center of Action of the Northern Hemisphere annular mode: Real or artifact? J Climate 15:1987–1991 Walsh JE, Richman MB (1981) Seasonality in the associations between surface temperatures over the United States and the North Pacific Ocean. Mon Wea Rev 109:767–783 Wan EA (1994) Time series prediction by using a connectionist network with internal delay lines. In: Weigend AS, Gershenfeld NA (eds) Time series prediction: forecasting the future and understanding the past. Addison-Wesley, Boston, MA, pp 195–217 Wang D, Arapostathis A, Wilke CO, Markey MK (2012) Principal-oscillation-pattern analysis of gene expression. PLoS ONE 7 7:1–10. https://doi.org/10.1371/journal.pone.0028805 Wang Y-H, Magnusdottir G, Stern H, Tian X, Yu Y (2014) Uncertainty estimates of the EOFderived North Atlantic Oscillation. J Climate 27:1290–1301 Wang D-P, Mooers CNK (1977) Long coastal-trapped waves off the west coast of the United States, summer (1973) J Phys Oceano 7:856–864 Wang XL, Zwiers F (1999) Interannual variability of precipitation in an ensemble of AMIP climate simulations conducted with the CCC GCM2. J Climate 12:1322–1335 Watkins DS (2007) The matrix eigenvalue problem: GR and Krylov subspace methods. SIAM, Philadelphia Watt J, Borhani R, Katsaggelos AK (2020) Machine learning refined: foundation, algorithms and applications, 2nd edn. Cambridge University Press, Cambridge, 574 p Weare BC, Nasstrom JS (1982) Examples of extended empirical orthogonal function analysis. Mon Wea Rev 110:481–485 Wegman E (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 78:310–322 Wei WWS (2019) Multivariate time series analysis and applications. Wiley, Oxford, 518 p Weideman JAC (1995) Computing the Hilbert transform on the real line. Math Comput 64:745–762 Weyn JA, Durran DR, Caruana R (2019) Can machines learn to predict weather? Using deep learning to predict gridded 500-hPa geopotential heighjt from historical weather data. J Adv Model Earth Syst 11:2680–2693. https://doi.org/10.1029/2019MS001705 Werbos PJ (1990) Backpropagation through time: What it does and how to do it. Proc IEEE, 78:1550–1560 Whittle P (1951) Hypothesis testing in time series. Almqvist and Wicksell, Uppsala Whittle P (1953a) The analysis of multiple stationary time series. J Roy Statist Soc B 15:125–139 Whittle P (1953b) Estimation and information in stationary time series. Ark Math 2:423–434

References

581

Whittle P (1983) Prediction and regulation by linear least-square methods, 2nd edn. University of Minnesota, Minneapolis Widrow B, Stearns PN (1985) Adaptive signal processing. Prentice-Hall, Englewood Cliffs, NJ Wikle CK (2004) Spatio-temporal methods in climatology. In: El-Shaarawi AH, Jureckova J (eds) UNESCO encyclopedia of life support systems (EOLSS). EOLSS Publishers, Oxford, UK. Available: https://pdfs.semanticscholar.org/e11f/f4c7986840caf112541282990682f7896199. pdf Wiener N, Masani L (1957) The prediction theory of multivariate stochastic processes, I. Acta Math 98:111–150 Wiener N, Masani L (1958) The prediction theory of multivariate stochastic processes, II. Acta Math 99:93–137 Wilkinson JH (1988) The algebraic eigenvalue problem. Clarendon Oxford Science Publications, Oxford Wilks DS (2011) Statistical methods in the atmospheric sciences. Academic Press, San Diego Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J Nonlin Sci 25:1307–1346 Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances. Neural Comput 14:715–770 Wise J (1955) The autocorrelation function and the spectral density function. Biometrika 42:151– 159 Woollings T, Hannachi A, Hoskins BJ, Turner A (2010) A regime view of the North Atlantic Oscillation and its response to anthropogenic forcing. J Climate 23:1291–1307 Wright RM, Switzer P (1971) Numerical classification applied to certain Jamaican eocene numuulitids. Math Geol 3:297–311 Wunsch C (2003) The spectral description of climate change including the 100 ky energy. Clim Dyn 20:353–363 Wu C-J (1996) Large optimal truncated low-dimensional dynamical systems. Discr Cont Dyn Syst 2:559–583 Xinhua C, Dunkerton TJ (1995) Orthogonal rotation of spatial patterns derived from singular value decomposition analysis. J Climate 8:2631–2643 Xu J-S (1993) The joint modes of the coupled atmosphere-ocean system observed from 1967 to 1986. J Climate 6:816–838 Xue Y, Cane MA, Zebiak SE, Blumenthal MB (1994) On the prediction of ENSO: A study with a low order Markov model. Tellus 46A:512–540 Young GA, Smith RL (2005) Essentials of statistical inference. Cambridge University Press, New York, 226 p. ISBN-10: 0-521-54866-7 Young FW (1987) Multidimensional scaling: history, theory and applications. Lawrence Erlbaum, Hillsdale, New Jersey Young FW, Hamer RM (1994) Theory and applications of multidimensional scaling. Eribaum Associates, Hillsdale, NJ Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances. Psychometrika 3:19–22 Young G, Householder AS (1941) A note on multidimensional psycho-physical analysis. Psychometrika 6:331–333 Yu Z-P, Chu P-S, Schroeder T (1997) Predictive skills of seasonal to annual rainfall variations in the U.S. Affiliated Pacific Islands: Canonical correlation analysis and multivariate principal component regression approaches. J Climate 10:2586–2599 Zveryaev II, Hannachi AA (2012) Interannual variability of Mediterranean evaporation and its relation to regional climate. Clim Dyn. https://doi.org/10.1007/s00382-011-1218-7 Zveryaev II, Hannachi AA (2016) Interdecadal changes in the links between Mediterranean evaporation and regional atmospheric dynamics during extended cold season. Int J Climatol. https://doi.org/10.1002/joc.4779 Zeleny, M (1987) Management support systems: towards integrated knowledge management. Human Syst Manag 7:59–70

582

References

Zhang G, Patuwo BE, Hu MY (1997) Forecasting with artificial neural networks: The state of the art. Int J Forecast 14:35–62 Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp 358–369. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8732

Index

A Absolutely summable, 157 Activation, 419 Activation function, 416, 425 Activation level, 423 Active phase, 214 Adaptive filter, 150 Adjacancy matrix, 67, 169, 305 Adjoint, 344 Adjoint operator, 422 Adjoint patterns, 134 Adjoint vector, 134 Advection, 118 African jet, 164 Agulhas currents, 405 Air temperature, 199 Algebraic topology, 403 Altimetry data, 374 Amplitude, 96 Analytic functions, 102 Analytic signal, 102 Angular momentum, 101 Annual cycle, 160 Annular mode, 184 Anthropogenic, 3 APHRODITE, 216 Approximation theorem, 416 Archetypal analysis, 55, 397 Arctic-like oscillation, 233 Arctic Oscillation (AO), 42, 56, 180, 285, 288, 332, 387 ARMA process, 154 Arnoldi algorithm, 40, 138 Arnoldi method, 518 AR(1) process, 51

Artificial intelligence, 415 Artificial neural network, 416 Asimov, D., 12 Aspect ratio, 167 Assimilation, 2 Asymptotically unbiased, 256 Asymptotic approximation, 45 Asymptotic distribution, 228 Asymptotic limit, 88 Asymptotic uncertainty, 46 Atlantic Multidecadal Oscillation (AMO), 293 Atlantic Niño, 293 Atmosphere-Land-Ocean-Ice system, 2 Atmospheric models, 119 Attractors, 146–148, 295 Augmenting function, 259 Autocorelation function, 176 Autocorrelation, 49, 50 Autocorrelation function, 48, 152, 172, 481 Autocovariance function, 26, 27, 98, 156, 483 Autoregression matrix, 119 Autoregressive model, 125, 311, 425, 486, 546 Autoregressive moving-average (ARMA) processes, 48 Auxiliary matrix, 152 Average information content, 244 Average predictability, 183 Average predictability pattern (APP), 183

B Back fitting algorithm, 258 Background noise, 149 Backpropagation, 422, 425, 426 Back-transformation, 381

© Springer Nature Switzerland AG 2021 A. Hannachi, Patterns Identification and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3

583

584 Band pass filter, 108 Bandwidth, 256 Baroclinic structures, 125 Baroclinic waves, 135 Barotropic models, 56 Barotropic quasi-geostrophic model, 141 Barotropic vorticity, 440 Bartlett’s factor score, 229 Basis functions, 35, 320, 357 Bayesian framework, 411 Bayes theorem, 470 Bernoulli distribution, 475 Beta-plane, 142 Between-groups sums-of-squares, 254 Betweenness, 68 Bias, 419 Bias parameter, 421, 423 Bias-variance trade-off, 47 Biennial cycles, 132 Bimodal, 259 Bimodal behaviour, 440 Bimodality, 213, 259, 311, 314 Binary split, 435 Binomial distribution, 475–476 Bi-orthogonality, 178 Bi-quartimin criterion, 231 Bivariate cumulant, 250 Blind deconvolution, 267, 271 Blind separation, 4 Blind source separation (BSS), 266, 268 Bloch’s wave, 373 Blocked flow, 313 Bloc-Toeplitz, 159 Bock’s procedure, 254 Boltzmann H-function, 245, 271 Bootstrap, 45, 46, 49, 437 Bootstrap blocks, 49 Botstrap, 47 Botstrap resampling, 48 Boundary condition, 329 Boundary currents, 405 Box plots, 17 Branch, 434 Break phases, 214 Broadband, 115 Broad-banded waves, 114 B-slpines, 322, 326 Bubble, 430 Burg, 196

C Calculus of variations, 273 Canberra, 203

Index Canonical correlation, 339 Canonical correlation analysis (CCA), 4, 14 Canonical correlation patterns, 341 Canonical covariance analysis (CCOVA), 344 Canonical covariance pattern, 345 Canonical variates, 254, 339 Caotic time series, 48 Carbone dating, 2 Categorical data, 203 Categorical predictors, 435 Cauchy distribution, 424 Cauchy principal value, 101 Cauchy sequence, 538 Cauchy’s integral formula, 101 Causal interactions, 68 Caveats, 168 Centered, 256 Centering, 23 Centering operator, 341 Central England temperature, 149 Central limit theorem (CLT), 45, 247, 277 Centroid, 167, 205 Chaotic attractor, 147 Chaotic system, 147 Characteristic multipliers, 546 Chernoff, H., 17 Chi-square, 342, 385, 436 Chi-squared distance, 253 Chi-square distribution, 478 Cholesky decomposition, 320, 397 Cholesky factorisation, 505 Circular, 154 Circular autocorrelation matrix, 154 Circular covariance matrix, 154 Circulation patterns, 445 Circulation regime, 314 Classical MDS, 206 Classical scaling, 207–209 Classical scaling problem, 204 Classification, 3, 434 Climate analysis, 16 Climate change, 442 Climate change signal, 184 Climate dynamics, 3 Climate extremes, 444, 445 Climate forecast system (CFS), 443 Climate Modeling Intercomparison Project (CMIP), 368 Climate models, 2, 445 Climate modes, 3 Climate networks, 67, 68, 169, 281 Climate prediction, 440 Climatic extreme events, 68 Climatic sub-processes, 68

Index Climatological covariance, 183 Climatology, 23, 160 Climbing algorithms, 247 Closeness, 68 Clouds, 243 Cluster analysis, 254 Clustering, 4, 260, 397, 416, 429 Clustering index, 254 Clustering techniques, 242 Clusters, 243, 253 CMIP5, 369, 383 CMIP models, 387 Cocktail-party problem, 268 Codebook, 430 Co-decorrelation time matrix, 177 Coding principle, 280 Coherence, 492 Coherent structures, 33, 296 Cokurtosis, 293 Colinearity, 82 Combined field, 338 Common EOF analysis, 368 Common EOFs, 383 Common factors, 220, 221 Common PC, 383 Communality, 222 Communities, 306 Compact support, 173 Competitive process, 431 Complex conjugate, 26, 29, 106, 187, 191 Complex conjugate operator, 96 Complex covariance matrix, 95 Complex data matrix, 97 Complex EOFs (CEOFs), 4, 13, 94, 95 Complex frequency domain, 97 Complexified fileds, 35, 95, 96 Complexified multivariate signal, 106 Complexified signal, 106 Complexified time series, 99, 100 Complexity, 1–3, 11, 55, 67, 241, 265 Complex network, 169 Complex nonlinear dynamical system, 2 Complex principal components, 94, 96 Composition operator, 138 Comprehensive Ocean-Atmosphere Data Set (COADS), 169 Conditional distribution, 223 Conditional expectations, 189, 228 Conditional probability, 129, 470 Condition number, 42 Confidence interval, 109 Confidence limits, 41, 46, 58 Conjugate directions, 525 Conjugate gradient, 422, 528

585 Conjugate information, 136 Connection, 68 Constrained minimization, 76, 530 Contingency table, 203 Continuous AR(1), 222 Continuous CCA, 352 Continuous curves, 321 Continuous predictor, 435 Continuum, 5 Continuum power regression, 388 Convective processes, 118 Convex combination, 398 Convex hull, 397, 398 Convex least square problems, 398, 400 Convex set, 398 Convolution, 26, 38, 107, 266 Convolutional, 422 Convolutional linear (integral) operator, 169 Convolutional NN, 440, 442 Convolving fIlter, 267 Coriolis parameter, 310 Correlation, 16 Correlation coefficient, 16, 21, 193 Correlation integral, 178 Correlation matrix, 24, 61 Coskewness, 293 Co-spectrum, 29, 114 Costfunction, 402, 435 Coupled, 2 Coupled pattern, 33, 337 Coupled univariate AR(1), 138 Covariability, 201, 338 Covariance, 16, 21, 24 Covariance function, 36 Covariance matrix, 24, 25, 40, 53, 54, 58, 61, 88 Covariance matrix spectra, 49 Covariance matrix spectrum, 57 Critical region, 229 Cross-correlations, 93 Cross-covariance function, 27 Cross-covariance matrix, 106, 134, 137, 339, 341, 392 Cross-entropy, 417 Cross-spectra, 98, 106 Cross-spectral analysis, 93 Cross-spectral covariances, 94 Cross-spectral matrix, 97 Cross-spectrum, 28, 100, 497 Cross-spectrum matrix, 28, 29, 98, 105, 113, 114 Cross-validation (CV), 45–47, 238, 325, 330, 362, 437, 454, 458 Cross-validation score, 391

586 Cubic convergence, 283 Cubic spline, 19, 453, 454 Cumulant, 249, 250, 267, 275, 472–473 Cumulative distribution function, 20, 251, 302, 419, 471 Currents, 405 Curse of dimensionality, 9 Curvilinear coordinate, 321 Curvilinear trajectory, 85 Cyclone frequencies, 61 Cyclo-stationarity, 132, 373 Cyclo-stationary, 100, 125, 132, 370 Cyclo-stationary EOFs, 367, 372 Cyclo-stationary processes, 372

D Damp, 164 Damped oscillators, 138 Damped system, 135 Damping, 56 Damping times, 138 Data-adaptive harmonic decomposition (DAHD), 169 Data analysis, 11 Data assimilation, 426 Database, 3 Data image, 448 Data-Information-Knowledge-Wisdom (DIKW), 10 Data mapping, 11 Data matrix, 22, 25, 36, 38, 41, 42, 45, 58, 61, 63, 158 Data mining, 3, 4 Data space, 223 Davidson–Fletcher–Powell, 529 Dawson’s integral, 103 Deaseasonalised, 17 Decadal modes, 317 Decay phase, 164 Decision making, 4 Decision node, 434 Decision trees, 433 Deconvolution, 4, 266, 267 Decorrelating matrix, 276 Decorrelation time, 171, 172, 176, 178, 180, 241 Degeneracy, 168, 169 Degenerate, 91, 110, 152, 165, 168 Degenerate eigenvalue, 155 Degrees of freedom (dof), 2, 11, 50 Delay coordinates, 148, 167, 317 Delayed vector, 158 Delay embedding, 169

Index Delay operator, 485 Delays, 148 Delay space, 149 Dendrogram, 448 Descent algorithm, 83, 187, 209, 331, 526 Descent numerical algorithm, 236 Descriptive data mining, 3 Determinant, 500 Determinantal equation, 153 Deterministic, 175 Detrending, 17 Diagonalisable, 40, 66, 96 Diagonal matrix, 40 Dicholomus search, 522–523 Differenced data, 58 Differencing operator, 58 Differentiable function, 209 Differential entropy, 244, 271 Differential manifold, 400 Differential operator, 464 Diffierentiable, 242 Diffusion, 56, 310 Diffusion map, 304 Diffusion process, 56–58 Digital filter, 124 Dimensionality reduction, 138, 429 Dimension reduction, 3, 11, 38 Dirac delta function, 27, 103 Direct product, 502 Discontinuous spectrum, 380 Discrepancy measures, 259 Discrete categories, 433 Discrete fourier transform, 104, 107 Discretised Laplacian, 362 Discriminant analysis, 254, 295 Discriminant function, 423 Disjoint, 469 Disorder, 244 Dispersive, 114, 164 Dispersive waves, 115 Dissimilarities, 202 Dissimilarity matrix, 202, 205, 448 Distance matrix, 207 Distortion errors, 210 Distribution ellipsoid, 63 Domain dependence, 71 Double centered dissimilarity matrix, 206 Double diagonal operator, 401 Doubly periodic, 373 Downscaling, 38, 374, 445, 451 Downward propagating signal, 93, 110 Downward signal propagating, 113 Downwelling current patterns, 447 Dual form, 396

Index Duality, 176 Dynamical mode decomposition (DMD), 138 Dynamical reconstruction, 147, 157 Dynamical systems, 86, 88, 118, 138, 146, 147, 169

E Earth System Model, 369 East Atlantic pattern, 345 Easterly, 93 Eastward propagation, 125 Eastward shift, 49 East-west dipolar structure, 345 ECMWF analyses, 125 Edgeworth expansion, 275 Edgeworth polynomial expansion, 250 Effective number of d.o.f, 53 Effective numbers of spatial d.o.f, 53 Effective sample size, 46, 50 Efficiency, 247 E-folding, 138 E-folding time, 50, 51, 123, 127, 173, 310 Eigenanalysis, 35 Eigenmode, 111 Eigenspectrum, 96, 382 Eigenvalue, 37, 39, 503 Eigenvalue problems, 34, 134 Eigenvector, 39, 503 Ekman dissipation, 310 Elbow, 405 Ellipsoid of the distribution, 58 Ellipsoids, 62 Elliptical, 311 Elliptical distributions, 61 Elliptically contoured distributions, 375 Elliptical region, 65 Ellipticity, 375 El-Niño, 33, 55, 157, 293, 405 El-Niño Southern Oscillation (ENSO), 2, 11, 33, 132, 391, 412 EM algorithm, 238 Embeddings, 148, 151, 202, 204, 211 Embedding space, 148 Empirical distribution function (edf), 20, 21 Empirical orthogonal functions (EOFs), 13, 22, 34, 38 Empirical orthogonal teleconnection (EOT), 67 Emptiness paradox, 9 Emptyness, 243 Empty space phenomena, 243 Empty space phenomenon, 6, 9, 10 Energy, 456, 463 Entropy, 196, 232, 243, 244, 271, 278, 436

587 Entropy index, 248, 250, 255 Envelope, 103, 397 EOF rotation, 55 Epaneshnikov kernel, 248 Equiprobable, 9 ERA-40 reanalyses, 167, 263 ERA-40 zonal mean zonal wind, 113 Error covariance matrix, 175, 189 E-step, 227 Euclidean distance, 203 European Centre for Medium Range Weather Forecasting (ECMWF), 92 European Re-analyses (ERA-40), 92, 212 Expalined variance, 109 Expansion coefficients, 38 Expansion functions, 35 Expectation (E), 227 Expectation maximisation (EM), 226 Expectation operator, 251, 262 Explained variance, 41, 53, 238 Exploratory data analysis (EDA), 3 Exploratory factor analysis (EFA), 233, 239, 290 Exponential distribution, 477–478 Exponential family, 402 Exponential smoothing, 18 Exponential smoothing filter, 27 Extended EOFs, 35, 94, 139, 146, 316 Extremes, 397, 398

F Factor analysis (FA), 4, 12, 46, 219, 224 Factor loading matrix, 234 Factor loading patterns, 233 Factor loadings, 220, 221, 230, 237 Factor model, 223, 228, 233, 238 Factor model parameters, 513–515 Factor rotation, 73 Factors, 219, 223 Factor scores, 229, 237 Factor-scores matrix, 220 Fastest growing modes, 135 FastICA, 282, 283 FastICA algorithm, 282 Fat spectrum, 149 Feasible set, 86 Feature analysis, 199 Feature extraction, 3, 11, 416 Feature space, 296, 297, 306, 392, 422 Feedback matrix, 119, 121, 122, 130, 146 Filter, 174 Filtered data matrix, 63 Filtered time series, 99

588 Filtering, 112, 166, 342 Filtering problem, 177 Filter matrix, 276 Filter patterns, 178 Finear filter, 26 Finite difference scheme, 361 First-order auto-regressive model, 56 First-order Markov model, 179 First-order optimality condition, 386 First-order spatial autoregressive process, 56 First-order system, 135 Fisher information, 243, 245 Fisher’s linear discrimination function, 254 Fisher–Snedecor distribution, 479 Fitted model, 228 Fixed point, 394 Fletcher-Powell method, 226 Fletcher–Reeves, 528 Floquet theory, 546 Floyd’s algorithm, 211 Fluctuation-dissipation relation, 129 Forecast, 172, 185 Forecastability, 172, 197 Forecastable component analysis (ForeCA), 196 Forecastable patterns, 196 Forecasting, 38, 416, 422 Forecasting accuracy, 130 Forecasting models, 179 Forecasting uncertainty, 442 Forecast models, 179 Forecast skill, 185 Forward-backward, 421 Forward stepwise procedure, 257 Fourier analysis, 17 Fourier decomposition, 107 Fourier series, 104, 373 Fourier spectrum, 103 Fourier transform (FT), 27, 48, 99, 102, 125, 176, 183, 187, 192, 267, 494 Fourth order cumulant, 170 Fourth order moment, 75 Fractal dimensions, 149 Fredholm eigen problem, 37 Fredholm equation, 320 Fredholm homogeneous integral equation, 359 Frequency-band, 97 Frequency domain, 94, 97, 176 Frequency response function, 29, 104, 108, 189, 191, 267 Frequency-time, 103 Friction, 93 Friedman’s index, 251 Fröbenius matrix norm, 233

Index Fröbenius norm, 217, 230, 285, 391, 398 Fröbenius product, 230 Fröbenius structure, 139 Full model, 228 Full rank, 54, 187 Funcional EOFs, 321 Functional analysis, 300 Functional CCA, 353 Functional EOF, 319 Functional PCs, 322 Fundamental matrix, 546

G Gain, 27 Gamma distribution, 478 Gaussian, 19, 63, 129, 192, 243 Gaussian grid, 44, 328 Gaussianity, 375 Gaussian kernel, 301, 393 Gaussian mixture, 214 Gaussian noise, 221 General circulation models (GCMs), 387 Generalised AR(1), 139 Generalised eigenvalue problem, 61, 66, 177, 324, 327, 357, 361, 396 Generalised inverse, 501 Generalised scalar product, 190 Generating kernels, 300 Geodesic distances, 211 Geometric constraints, 70, 71 Geometric moments, 167 Geometric properties, 72 Geometric sense, 62 Geopotential height, 62, 66, 68, 125, 180, 188, 260, 262, 382, 445 Geopotential height anomalies, 181 Geopotential height re-analyses, 157 Gibbs inequality, 273 Gini index, 435 Global scaling, 209 Gobal temperature, 284 Golden section, 523 Goodness-of-fit, 209, 259 Gradient, 85, 242, 386 Gradient ascent, 282 Gradient-based algorithms, 283 Gradient-based approaches, 526 Gradient-based method, 268 Gradient methods, 247 Gradient optimisation algorithms, 256 Gradient types algorithms, 282 Gram matrix, 301 Gram-Schmidt orthogonalization, 397

Index Grand covariance matrix, 160, 165, 338 Grand data matrix, 161 Grand tour, 12 Green function, 119 Greenhouse, 450 Greenland blocking, 146 Green’s function, 129, 457, 464 Growing phase, 135 Growth phase, 164 Growth rates, 127, 138 Gulf Stream, 405 Gyres, 405

H Hadamard, 285, 401 Hadamard product, 328, 502 HadCRUT2, 183 HadGEM2-ES, 369 HadISST, 332 Hadley Centre ice and sea surface temperature (HadISST), 290 Hamiltonian systems, 136 Hankel matrix, 169 Heavy tailed distributions, 252 Hellinger distance, 259 Henderson filter, 18 Hermite polynomials, 249, 252, 410 Hermitian, 29, 96, 98, 137, 492, 502 Hermitian covariance matrix, 106, 109 Hermitian matrix, 109 Hessian matrix, 526 Hexagonal, 430 Hidden, 4 Hidden dimension, 12 Hidden factors, 269 Hidden variables, 220 Hierarchical clustering, 448 High dimensionality, 3, 11 Higher-order cumulants, 284 Higher order moments, 266, 267 High-order singular value decomposition, 293 Hilbert canonical correlation analysis (HCCA), 137 Hilbert EOFs, 95, 97, 100, 109, 113, 161 Hilbert filter, 102 Hilbert PC, 113 Hilbert POPs (HPOPs), 136 Hilbert singular decomposition, 101 Hilbert space, 36, 138, 190, 344 Hilbert transform, 97, 101, 102, 105, 107, 109, 136, 145 Homogeneous diffusion processes, 56 Hovmoller diagram, 31

589 Hybrid, 185 Hyperbolic tangent, 421, 440 Hypercube, 6, 8 Hyperspaces, 241 Hypersphere, 5, 7 Hypersurface, 86, 296 Hypervolumes, 6, 243 Hypothesis of independence, 45

I ICA rotation, 286 Ill-posed, 344 Impulse response function, 27 Independence, 266, 470 Independent and identically distributed (IID), 50, 224 Independent component analysis (ICA), 55, 266 Independent components, 63, 268 Independent principal components, 286 Independent sample size, 50 Independent sources, 265, 293 Indeterminacy, 352 India monsoon rainfall, 66 Indian Ocean dipole (IOD), 58, 412 Indian Ocean SST anomalies, 58 Inference, 3 Infomax, 280, 282, 283 Info-max approach, 278 Information, 244 Information capacity, 280 Information-theoretic approaches, 270 Information theory, 243, 244, 271 Initial condition, 140 Initial random configurations, 209 Inner product, 36, 401, 536 Insolation, 199 Instability, 71 Instantaneous frequency, 102, 112, 113 Integrability, 190, 191 Integrable functions, 299 Integral operator, 299, 300 Integrated power, 177 Integro-differential equations, 320, 326, 357, 359 Integro-differential operator, 328 Integro-differential system, 360 Interacting molecules, 1 Interacting space/time scales, 2 Inter-dependencies, 67, 69 Interesting, 242, 243 Interesting features, 15 Interestingness, 243–245

590 Interesting structures, 243, 264 Intergovernmental Panel for Climate Change (IPCC), 183 Intermediate water, 322 Interpoint distance, 201, 368, 430 Interpoint distance matrix, 449 Interpolated, 17 Interpolation, 189 Interpolation covariance matrix, 194 Interpolation error, 190, 241 Interpolation filter, 190 Interpretation, 11 Inter-tropical convergence zone (ITCZ), 147 Intraseasonal time scale, 132 Intrinsic mode of variability, 58 Invariance, 72 Invariance principle, 237 Inverse Fourier transform, 48 Inverse mapping, 307 Invertibility, 191 Invertible, 395 Invertible linear transfomation, 121 Irish precipitation, 391 ISOMAP, 211 Isomap, 212, 411 Isopycnal, 322 Isotropic, 105, 283 Isotropic kernels, 393 Isotropic turbulence, 53 Isotropic uniqueness, 237 Isotropy, 253 Iterative methods, 43 Iterative process, 260

J Jacobian, 272, 278, 283 Jacobian operator, 310 JADE, 284 Japanese reanalyses, 314 Japan Meteorological Agency, 383 Johnson-Lindenstrauss Lemma, 368–370 Joint distribution, 223 Joint entropy, 280 Joint probability density, 269 Joint probability density function, 473 JRA-55, 314 Jump in the spectrum, 387

K Karhunen–Loéve decomposition, 155 Karhunen–Loéve equation, 373 Karhunen–Loéve expansion, 36, 37, 91

Index Kelvin wave, 100, 162 Kernel, 169, 300, 465 Kernel CCA, 395 Kernel density estimate, 280 Kernel density estimation, 255 Kernel EOF, 297 Kernel function, 19, 37, 299 Kernel matrix, 301, 397 Kernel MCA, 392 Kernel methods, 280 Kernel PCA, 297 Kernel PDF, 259 Kernel POPs, 317 Kernel smoother, 256, 260, 280, 465 Kernel smoothing, 19 Kernel transformation, 301 Kernel trick, 299 k-fold CV, 47 Khatri-Rao, 291 K-L divergence, 273, 274 k-means, 254, 305, 397 k-means clustering, 399 k-nearest neighbors, 212 Kohonen network, 429 Kolmogorov formula, 175 Kolmogorov-Smirnov distance, 253 Kolmogorov-Wiener approach, 174 Koopman operator, 138 Kriging, 464 Kronecker matrix product, 291 Kronecker symbol, 133 Krylov method, 138 Krylov subspace, 40, 43 Kullback-Leibler distance, 259 Kullback-Leibler (K-L) divergence, 272, 277 Kuroshio, 405 Kuroshio current, 317 Kurtosis, 53, 170, 259, 264, 267, 281, 282

L Lag-1 autocorrelations, 379 Lagged autocorrelation matrix, 150 Lagged autocovariance, 50 Lagged covariance matrix, 94, 98, 99, 159, 175, 180 Lagrange function, 516 Lagrange multiplier, 39, 76, 255, 273, 283, 340, 395 Lagrangian, 77, 331, 396 Lagrangian function, 327 Lagrangian method, 532 Lagrangian multipliers, 358 Lanczos, 40, 42

Index Lanczos method, 517–518 La-Niña, 33, 405 Lapalce probability density function, 270 Laplace-Beltrami differential operator, 304 Laplacian, 305, 310, 360, 361 Laplacian matrix, 304, 317 Laplacian operator, 327, 361 Laplacian spectral analysis, 317 Large scale atmosphere, 264 Large scale atmospheric flow, 295 Large scale flow, 384 Largescale processes, 167 Large scale teleconnections, 445 Latent, 220 Latent heat fluxes, 383 Latent patterns, 4 Latent space, 223 Latent variable, 11, 12, 56 Lattice, 431 Leading mode, 44 Leaf, 434 Learning, 416, 425 Leas square, 82 Least Absolute Shrinkage and Selection Operator (LASSO), 82 Least square, 64, 165, 174 Least squares regression, 388 Leave-one-out CV, 47 Leave-one-out procedure, 391 Legendre polynomials, 251 Leptokurtic, 270, 473 Likelihood, 385 Likelihood function, 385 Likelihood ratio statistic, 228 Lillieford test, 59 Linear convergence, 283 Linear discriminant analysis, 296 Linear filter, 29, 102, 189, 210, 266, 267 Linear growth, 125 Linear integral operator, 299 Linear inverse modeling (LIM), 129 Linearisation, 135 Linearised physical models, 71 Linear operator, 26, 500 Linear programming, 521 Linear projection, 241, 243 Linear space, 295 Linear superposition, 55 Linear transformation, 268 Linear trend, 193 Linkage, 448 Loading coefficients, 54 Loadings, 38, 54, 74, 75, 223 Loading vectors, 372

591 Local averages, 11 Local concepts, 11 Localized kernel, 302 Local linear embedding, 211 Logistic, 419 Logistic function, 279, 280 Logistic regression, 417 Log-likelihood, 224, 238 Long-memory, 173, 179, 180 Long short-term memory (LSTM), 440 Long-term statistics, 1 Long term trends, 180 Lorenz, E.N., 147 Lorenz model, 440 Lorenz system, 157 Low frequency, 35, 180 Low-frequency modes, 184 Low-frequency patterns, 194, 446 Low-frequency persistent components, 199 Low-level cloud, 445 Low-level Somali jet, 212 Low-order chaotic models, 310 Low-order chaotic systems, 148 Lyapunov equation, 520 Lyapunov function, 263

M Machine learning, 3, 415 Madden-Julian oscillation (MJO), 91, 132, 146, 164, 184 Mahalanobis distance, 459 Mahalanobis metrics, 203 Mahalanobis signal, 183 Manifold, 86, 211, 223, 295, 303 Map, 22 Marginal density function, 269 Marginal distribution, 223 Marginal pdfs, 473 Marginal probability density, 279, 280 Marginal probability density functions, 65 Markov chains, 305 Markovian time series, 173 Markov process, 118 Matching unit, 430 Mathematical programming, 531 Matlab, 23, 24, 26, 40, 77, 109, 161, 345, 377 Matrix derivative, 506–512 Matrix inversion, 342 Matrix norm, 216 Matrix of coordinates, 205 Matrix optimisation, 398 Matrix optimisation problem, 63 Maximization, 227

592 Maximization problem, 74 Maximum covariance analysis (MCA), 337, 344 Maximum entropy, 274 Maximum likelihood, 46, 62 Maximum likelihood method (MLE), 224 Maximum variance, 38, 40 Mean sea level, 383 Mean square error, 37, 47 Mediterranean evaporation, 66, 345 Mercer kernel, 37 Mercer’s theorem, 299 Meridional, 94 Meridional overturning circulation, 199 Mesokurtic, 473 Metric, 202 Mid-tropospheric level, 68 Minimum-square error, 175 Minkowski distance, 202 Minkowski norm, 217 Mis-fit, 458 Mixed-layer, 322 Mixing, 42, 55, 412 Mixing matrix, 268, 269, 276 Mixing problem, 55, 284, 375 Mixing property, 375 Mixture, 268 Mixture model, 238 Mode analysis, 64 Model evaluation, 384 Model simulations, 2 Mode mixing, 373 Modes, 56 Modes of variability, 38 Modularity, 306 Modularity matrix, 306 Moisture, 164, 445 Moment matching, 53 Moments, 250, 259, 269 Momentum, 136 Monomials, 299 Monotone regression, 208 Monotonicity, 430 Monotonic order, 385 Monotonic transformation, 210, 375 Monsoon, 114, 345 Monte Carlo, 45, 46, 48, 343, 379, 525 Monte Carlo approach, 260 Monte-Carlo bootstrap, 49 Monthly mean SLP, 77 Moore–Penrose inverse, 501 Most predictable patterns, 185 Moving average, 18, 150 Moving average filter, 27

Index MPI-ESM-MR, 369 M-step, 227 Multichannel SSA (MSSA), 157 Multi-colinearity, 343 Multidimensional scaling (MDS), 4, 201, 242, 254 Multilayer perceptron (MLP), 423 Multimodal, 243 Multimodal data, 249 Multinormal, 58 Multinormal distribution, 130 Multinormality, 46 Multiplicative decomposition, 259 Multiplicity, 154 Multiquadratic, 424 Multispectral images, 256 Multivariate filtering problem, 29 Multivariate Gaussian distribution, 8 Multivariate normal, 62, 245 Multivariate normal distribution, 9 Multivariate normal IID, 362 Multivariate POPs, 138 Multivariate random variable, 23, 24 Multivariate spectrum matrix, 199 Multivariate t-distribution, 61 Multivarite AR(1), 219 Mutual information, 64, 273–275, 278 Mutually exclusive, 469

N Narrow band, 103 Narrow band pass filter, 98 Narrow frequency, 103 National Center for Environmental Prediction (NCEP), 383 National Oceanic and Atmospheric Administration (NOAA), 411 N -body problem, 460 NCEP-NCAR reanalysis, 68, 146, 233, 260, 446 NCEP/NCAR, 31, 184, 310 Negentropy, 245, 259, 274, 277, 282, 284 Neighborhood graph, 212 Nested period, 374 Nested sets, 448 Nested sigmoid architecture, 426 Networks, 67, 68 Neural-based algorithms, 283 Neural network-based, 284 Neural networks (NNs), 276, 278, 302, 415, 416 Neurons, 419 Newton algorithm, 283

Index Newton–Raphson, 529 Newton–Raphson method, 527 Noise, 118 Noise background, 149 Noise covariance, 139 Noise floor, 382 Noise-free dynamics, 124 Noise variability, 53 Non-alternating algorithm, 400 Nondegeneracy, 205 Nondifferentiable, 83 Non-Gaussian, 267 Non-Gaussian factor analysis, 269 Non-Gaussianity, 53, 264 Non-integer power, 389 Nonlinear, 296 Nonlinear association, 12 Nonlinear dynamical mode (NDM), 410 Nonlinear features, 38 Nonlinear interactions, 2 Nonlinearity, 3, 12 Nonlinear manifold, 213, 247, 295, 410 Nonlinear mapping, 299 Nonlinear MDS, 212 Nonlinear ow regimes, 49 Nonlinear PC analysis, 439 Nonlinear programme, 521 Nonlinear smoothing, 19 Nonlinear system of equations, 263 Nonlinear units, 422 Nonlocal, 10 Non-locality, 115 Non-metric MDS, 208 Non-metric multidimensional scaling, 430 Non-negative matrix factorisation (NMF), 403 Non-normality, 269, 288 Non-parametric approach, 280 Nonparametric estimation, 277 Non parametric regression, 257, 453 Non-quadratic, 76, 83, 209, 275 Nonsingular affine transformation, 246 Normal, 502 Normal distribution, 245, 477 Normalisation constraint, 81 Normal matrix, 40 Normal modes, 119, 123, 130, 134, 138 North Atlantic Oscillation (NAO), 2, 9, 31, 33, 42, 49, 56, 68, 77, 83, 234, 259, 284, 288, 289, 293, 311, 312, 332, 381, 387, 391, 446 North Pacific Gyre Oscillation (NPGO), 293 North Pacific Oscillation, 233, 234 Null hypothesis, 48, 56, 57, 228, 288, 342

593 Null space, 137 Nyquist frequency, 114, 494

O Objective function, 84 Oblimax, 232 Oblimin, 231 Oblique, 74, 81, 230 Oblique manifold, 401 Oblique rotation, 76, 77, 231 Occam’s rasor, 13 Ocean circulation, 445 Ocean current forecasting, 445 Ocean current patterns, 446 Ocean currents, 101 Ocean fronts, 322 Ocean gyres, 170 Oceanic fronts, 323 Ocean temperature, 321 Ocillating phenomenon, 91 Offset, 419 OLR, 161, 162, 164 OLR anomalies, 160 One-mode component analysis, 64 One-step ahead prediction, 175, 185 Operator, 118 Optimal decorrelation time, 180 Optimal interpolation, 411 Optimal lag between two fields, 349–350 Optimal linear prediction, 179 Optimally interpolated pattern (OIP), 189, 191 Optimally persistent pattern (OPP), 176, 178, 180, 183, 185, 241 Optimisation algorithms, 521 Optimization criterion, 76 Order statistics, 271 Ordinal MDS, 208 Ordinal scaling, 210 Ordinary differential equations (ODEs), 85, 147, 263, 530, 543 Orthogonal, 74, 81, 230, 502 Orthogonal complement, 86, 206 Orthogonalization, 397 Orthogonal rotation, 74, 75, 77 Orthomax-based criterion, 284 Orthonormal eigenfunctions, 37 Oscillatory, 122 Outgoing long-wave radiation (OLR), 146, 345 Outlier, 17, 250, 260, 271 Out-of-bag (oob), 437 Overfitting, 395

594 P Pacific decadal oscillation (PDO), 293, 412 Pacific-North American (PNA), 2, 33, 68, 127, 259, 263 Pacific patterns, 83 Pairwise distances, 201 Pairwise similarities, 202 Parabolic density function, 248 Paradox, 10 Parafac model, 291 Parsimony, 13 Partial least squares (PLS) regression, 388 Partial phase transform, 390 Partial whitening, 388, 390 Parzen lagged window, 184 Parzen lag-window, 185 Parzen window, 182 Pattern recognition, 295, 416 Patterns, 3, 22 Pattern simplicity, 72 Patterns of variability, 91 Pdf estimation, 416 Pearson correlation, 281 Penalised, 354 Penalised likelihood, 457 Penalised objective function, 532 Penalized least squares, 344 Penalty function, 83, 532 Perceptron convergence theorem, 417 Perfect correlation, 354 Periodic signal, 149, 151, 155, 180 Periodogram, 180, 187, 192, 495 Permutation, 95, 159, 376 Permutation matrix, 160, 170 Persistence, 157, 171, 185, 350, 440 Persistent patterns, 142, 172 Petrie polygon, 403 Phase, 27, 96, 492 Phase changes, 113 Phase functions, 110 Phase propagation, 113 Phase randomization, 48 Phase relationships, 97, 100 Phase shift, 95, 150, 151 Phase space, 169 Phase speeds, 112, 157 Physical modes, 55, 56 Piece-wise, 19 Planar entropy index, 250 Planetary waves, 310 Platykurtic, 271 Platykurtotic, 473 Poisson distribution, 476 Polar decomposition, 217

Index Polar vortex, 93, 167, 259, 291 Polynomial equation, 153 Polynomial fitting, 19 Polynomial kernels, 302 Polynomially, 296 Polynomial transformation, 299 Polytope, 403, 404 POP analysis, 219 POP model, 179 Positive semi-definite, 177, 206, 207, 216, 502 Posterior distribution, 227 Potential vorticity, 309 Powell’s algorithms, 525 Power law, 286 Power spectra, 196, 267 Power spectrum, 48, 100, 110, 156, 172–174, 180, 199, 487 Precipitation, 445 Precipitation predictability, 440 Predictability, 184, 199 Predictable relationships, 54 Predictand, 338, 363 Prediction, 3, 189, 416, 421 Prediction error, 174 Prediction error variance, 174, 185 Predictive data mining, 3 Predictive Oscillation Patterns (PrOPs), 185 Predictive skill, 170 Predictor, 338, 363 Pre-image, 394 Prewhitening, 268 Principal axes, 58, 63 Principal component, 39 Principal component analysis (PCA), 4, 13, 33 Principal component regression (PCR), 388 Principal coordinate analysis, 202 Principal coordinate matrix, 206 Principal coordinates, 206–208, 215 Principal fundamental matrix, 546 Principal interaction pattern (PIP), 119, 139 Principal oscillation pattern (POP), 15, 94, 95, 119, 126 Principal prediction patterns (PPP), 343 Principal predictors, 351 Principal predictors analysis (PPA), 338 Principal regression analysis (PRA), 338 Principal trend analysis (PTA), 199 Principlal component transformation, 54 Prior, 223 Probabilistic archetype analysis, 402 Probabilistic concepts, 17 Probabilistic framework, 45 Probabilistic models, 11, 50 Probabilistic NNs, 424

Index Probability, 467 Probability-based approach, 149 Probability-based method, 46 Probability density function (pdf), 2, 11, 23, 58, 65, 196, 245, 471 Probability distribution, 219, 243 Probability distribution function, 255 Probability function, 470 Probability matrix, 404 Probability space, 213 Probability vector, 398 Product-moment correlation coefficient, 16 Product of spheres, 401 Profile likelihood, 363 Profiles, 321 Progressive waves, 114, 115 Projected data, 251 Projected gradient, 83, 86, 283, 533 Projected gradient algorithm, 284 Projected matrix, 369 Projected/reduced gradient, 85 Projection index, 242, 246 Projection methods, 367 Projection operators, 86, 206 Projection pursuit (PP), 242, 269, 272, 280, 293 Projection theorem, 538 Propagating, 91, 97 Propagating disturbances, 97, 107, 117 Propagating features, 168 Propagating patterns, 91, 95, 96, 122, 135, 166 Propagating planetary waves, 72 Propagating signal, 110 Propagating speed, 93 Propagating structures, 94, 95, 118, 145, 157 Propagating wave, 162 Propagation, 113, 145 Propagator, 130, 546 Prototype, 397, 430 Prototype vectors, 448 Proximity, 201, 430 Pruning, 436, 437 Pseudoinverse, 501

Q QR decomposition, 505–506 Quadratic equation, 154 Quadratic function, 82 Quadratic measure, 53 Quadratic nonlinearities, 296 Quadratic norm, 53 Quadratic optimisation problem, 38 Quadratic system of equations, 263

595 Quadratic trend, 379 Quadrature, 91, 110, 114, 149, 150, 155, 167, 192 Quadrature function, 107 Quadrature spectrum, 29 Quantile, 46 Quantile-quantile, 58 QUARTIMAX, 75 Quartimax, 230 QUARTIMIN, 76, 77, 81 Quartimin, 231 Quas-biennial periodicity, 93 Quasi-biennial oscillation (QBO), 91, 101, 113, 145 Quasi-geostrophic, 315 Quasi-geostrophic model, 135, 309 Quasi-geostrophic vorticity, 135 Quasi-Newton, 422, 437 Quasi-Newton algorithm, 141, 425 Quasi-stationary signals, 367

R Radial basis function networks, 442 Radial basis functions (RBFs), 321, 325, 357 Radial coordinate, 375 Radial function, 457 Radiative forcing, 2, 118 Rainfall, 447 Rainfall extremes, 445 Raleigh quotient, 177, 183, 395, 396 Random error, 219 Random experiment, 469 Random forest (RF), 433, 450 Random function, 190, 353 Randomness, 244 Random noise, 192, 220 Random projection, 369 Random projection matrix, 368 Random samples, 49 Random variable, 24, 224, 470 Random vector, 154, 474 Rank, 501 Rank correlation, 21 Rank order, 210 Ranndom projection (RP), 368 Rational eigenfunctions, 109 RBF networks, 424 Reanalysis, 2, 77, 131, 445 Reconstructed attractor, 148 Reconstructed variables, 164 Rectangular, 430 Recurrence networks, 68, 169 Recurrence relationships, 153

596 Recurrences matrix, 169 Recurrent, 422 Recurrent NNs, 421 Recursion, 259 Red-noise, 49, 51, 52, 152 Red spectrum, 125, 135 Reduced gradient, 242 Redundancy, 274, 278 Redundancy analysis (RDA), 337, 348 Redundancy index, 347 Redundancy reduction, 279 Regimes, 3 Regime shift, 315 Regime transitions, 169 Regression, 3, 38, 67, 82, 184, 221, 315, 337, 381, 395, 416, 422, 434 Regression analysis, 257 Regression curve, 209 Regression matrix, 66, 338, 347 Regression matrix A, 363 Regression models, 4 Regularisation, 325, 390, 395, 396 Regularisation parameter, 391 Regularisation problem, 330, 455 Regularised EOFs, 331 Regularised Lagrangian, 396 Regularization parameters, 344 Regularized CCA (RCCA), 343 Replicated MDS, 210 Reproducing kernels, 300 Resampling, 46 Residual, 6, 36, 44, 87 Residual sum of squares (RSS), 344, 398, 404 Resolvent, 129, 546 Response, 27 Response function, 102 Response variables, 351 RGB colours, 256 Ridge, 313, 395 Ridge functions, 257, 258 Ridge regression, 344, 391, 395 Riemannian manifold, 400 Robustness, 438 Rominet patterns, 3 Root node, 434 Rosenblatt perceptron, 417 Rossby radii, 310 Rossby wave, 115, 263 Rossby wave breaking, 146 Rotated EOF (REOF), 4, 13, 35, 73, 141 Rotated factors, 72, 230 Rotated factor scores, 234 Rotated principal components, 73 Rotation, 72, 73, 229

Index Rotationally invariant, 250, 251 Rotation criteria, 74 Rotation matrix, 73, 74, 115, 223, 284 Roughness measure, 454 R-square, 315, 348 Runge Kutta scheme, 263 Running windows, 50

S Salinity, 321 Sample covariance matrix, 237 Sample-space noise model, 227 Sampling errors, 178 Sampling fluctuation, 342 Sampling problems, 71 Scalar product, 140 Scalar product matrix, 207 Scaled SVD, 365 Scaling, 25, 54, 62, 207, 396, 399 Scaling problem, 62, 238, 338 Scandinavian pattern, 234, 293, 330 Scandinavian teleconnection pattern, 288 Scores, 37 Scree plot, 404 Sea level, 374 Sea level pressure (SLP), 21, 31, 33, 41, 68, 115, 180, 194, 212, 284, 314, 315 Sea saw, 13 Seasonal cycle, 42, 160 Seasonality, 17, 93 Sea surface temperature (SST), 11, 33, 55, 132, 180, 284, 293, 383, 391, 404, 445 Second kind, 359 Second order centered moment, 24 Second-order differential equations, 86 Second-order Markov model, 179 Second order moments, 259, 266 Second-order stationary, 132 Second-order stationary time series, 196 Second-order statistics, 375 Self-adjoint, 37, 299, 300 Self-interactions, 68 Self-organisation, 425, 429 Self-organising maps (SOMs), 416, 429 Semi-annual oscillation, 162 Semi-definite matrices, 63 Semi-difinite positivity, 226 Semi-orthogonal matrix, 64 Sensitivity to initial as well as boundary conditions, 2 Sen surface height, 445 Sentivity to initial contitions, 147 Sequentially, 386

Index Serial correlation, 171 Serial orrelation, 50 Shannon entropy, 37, 243, 244 Shew orthogonal projection, 404 Short-memory, 180 Siberian high, 381, 383 Sigmoid, 279, 421, 423 Sigmoid function, 278 Signal patterns, 178 Signal-to-noise maximisation, 177 Signal-to-noise ratio, 44, 254 Significance, 343 Significance level, 59 Similarity, 201 Similarity coefficient, 203 Similarity matrix, 207, 217 Simple structure, 72 Simplex, 260, 402, 403, 524 Simplex method, 524 Simplex projection, 408 Simplex vertices, 404 Simplicity, 72, 73 Simplicity criteria, 81 Simplicity criterion, 73, 74 Simplified Component Technique-LASSO (SCoTLASS), 82 Simplified EOFs (SEOFs), 82 Simulated, 3 Simulated annealing, 250 Simulations, 33 Singular, 175 Singular covariance matrices, 151 Singular system analysis (SSA), 146 Singular value, 26, 42, 74, 161, 345, 350 Singular value decomposition (SVD), 26, 40, 42, 96, 503–504 Singular vectors, 109, 122, 149, 160, 206, 341, 342, 389, 393 Skewness, 26, 264, 288 Skewness modes, 262 Skewness tensor, 263 Skew-symmetric, 217 Sliding window, 148 SLP anomalies, 42, 49, 213, 233, 285, 331, 381 S-mode, 22 Smooth EOFs, 319, 326, 332 Smooth functions, 319 Smoothing, 18, 19 Smoothing constraint, 258 Smoothing kernel, 19 Smoothing parameter, 303, 326, 362 Smoothing problem, 258 Smoothing spectral window, 192 Smoothing spline, 354

597 Smooth maximum covariance analysis (SMCA), 319, 358 Smoothness, 256, 352 Smoothness condition, 453 Smooth time series, 352 Smpling with replacement, 47 Sneath’s coefficient, 203 Soothing condition, 355 Southern Oscillation, 33 Southern Oscillation mode, 196 Spacetime orthogonality, 70 Sparse systems, 40 Spatial derivative, 111 Spatial weighting, 44 Spearman’s rank correlation coefficient, 21 Spectral analysis, 344, 391 Spectral clustering, 302, 304, 306, 317 Spectral decomposition theorem, 300 Spectral density, 156, 196 Spectral density function, 26, 27, 98, 124, 186 Spectral density matrix, 175, 185, 187, 190–192 Spectral domain, 97 Spectral domain EOFs, 99 Spectral entropy, 197 Spectral EOFs, 98 Spectral methods, 328 Spectral peak, 125 Spectral radius, 217 Spectral space, 373 Spectral window, 187 Spectrum, 54, 58, 66, 137, 148, 160, 167, 180, 375, 380 Sphered, 256 Sphereing, 283 Spherical cluster, 302 Spherical coordinates, 326, 327 Spherical geometry, 44, 361 Spherical harmonics, 118, 310 Spherical RBFs, 360 Sphering, 25, 54 Splines, 19, 321, 326, 354, 423 Spline smoothing, 258 Split, 436 Splitting, 434 Splitting node, 434 Squared residuals, 210 Square integrable, 36 Square root of a symmetric matrix, 25 Square root of the sample covariance matrix, 256 Squashing, 279 Squashing functions, 421 Srrogate data, 48

598 SST anomalies, 60 Stability analysis, 135 Standard error, 45 Standard normal distribution, 10, 46 Standing mode, 169 Standing oscillation, 135 Standing waves, 124, 168, 169 State space, 38 Stationarity, 121, 370 Stationarity conditions, 121 Stationary, 26, 94, 154 Stationary patterns, 91 Stationary points, 283 Stationary solution, 88, 310 Stationary states, 311, 313 Stationary time series, 98, 156 Statistical downscaling, 365 Steepest descent, 425, 526 Steepness, 421 Stepwise procedure, 386 Stochastic climate model, 310 Stochastic integrals, 99 Stochasticity, 12 Stochastic matrix, 305, 398 Stochastic model, 122 Stochastic modelling, 222 Stochastic process, 36, 37, 190, 352, 480 Stochastic system, 118 Storm track, 61, 180 Stratosphere, 91, 448 Stratospheric activity, 291 Stratospheric flow, 93 Stratospheric warming, 146 Stratospheric westerlies, 93 Stratospheric westerly flow, 93 Stratospheric zonal wind, 125 Streamfunction, 127, 132, 141, 263, 310, 311 Stress function, 208, 209 Structure removal, 256 Student distribution, 478 Subantarctic mode water, 323 Sub-Gaussian, 271, 280 Subgrid processes, 44 Subgrid scales, 118 Subjective/Bayesian school, 467 Submatrix, 159 Subscale processes, 118 Substructures, 168 Subtropical/subpolar gyres, 317 Sudden stratospheric warming, 291 Summer monsoon, 34, 66, 212, 404 Sum-of-squares, 254 Sum of the squared correlations, 351 Super-Gaussian, 270, 473

Index Supervised, 416 Support vector machine (SVM), 422 Support vector regression, 442 Surface temperature, 62, 183, 369, 445 Surface wind, 33 Surrogate, 48 Surrogate data, 45 Swiss-roll, 211 Symmetric, 502 Synaptic weight, 430 Synchronization, 68 Synoptic, 167 Synoptic eddies, 180 Synoptic patterns, 445 Synoptic weather, 444 System’s memory, 171

T Tail, 9, 250 Taining dataset, 48 Tangent hyperbolic, 282 Tangent space, 86 t-distribution, 375 Teleconnection, 2, 33, 34, 49, 67, 284, 345, 446 Teleconnection pattern, 31 Teleconnectivity, 66, 67 Tendencies, 136, 302, 310 Terminal node, 434 Ternary plot, 404 Tetrahedron, 403 Thermohaline circulation, 321 Thin-plate, 463 Thin-plate spline, 424, 454 Third-order moment, 262 Thompson’s factor score, 229 Three-way data, 291 Tikhonov regularization, 344 T-mode, 22 Toeplitz, 156, 159 Toeplitz covariance matrix, 152 Toeplitz matrix, 149 Topographic map, 429 Topological neighbourhood, 431 Topology, 431 Topology-preserving projection, 448 T-optimals, 179 Tori, 211 Trace, 500 Training, 46, 416, 425 Training set, 47, 391 Trajectory, 147 Trajectory matrix, 149

Index Transfer entropy, 281 Transfer function, 27, 104, 105, 125, 419, 424 Transition probability, 305 Transpose, 500 Trapezoidal rule, 192 Travelling features, 124 Travelling waves, 117, 145 Tree, 448 Trend, 377 Trend EOF (TEOF), 375 Trend pattern, 381 Trends, 3, 180 Triangular matrix, 397 Triangular metric inequality, 205 Triangular truncation, 310 Tridiagonal, 153 Tropical cyclone forecast, 444 Tropical cyclone frequency, 442 Tropical Pacific SST, 440 Tucker decomposition, 293 Tukey’s index, 248 Tukey two-dimensional index, 248 Tukey window, 183 Two-sample EOF, 383 Two-way data, 291

U Uncertainty, 45, 49, 244 Unconstrained problem, 76, 186 Uncorrelatedness, 270 Understanding-context independence, 9 Uniform distribution, 196, 244, 251, 271 Uniformly convergent, 37 Uniform random variables, 21 Unimodal, 259, 311 Uniqueness, 222 Unit, 419, 421 Unitary rotation matrix, 115 Unit circle, 135 Unit gain filter, 107 Unit-impulse response, 266 Unit sphere, 386 Unresolved waves, 310 Unstable modes, 135 Unstable normal modes, 135 Unsupervised, 416 Upwelling, 447

V Validation, 3 Validation set, 47

599 Variational problem, 455 VARIMAX, 73, 77, 81 Varimax, 115 Vector autoregressive, 138 Vertical modes, 322 Visual inspection, 241 Visualization, 11 Visualizing proximities, 201 Volcanoes, 2 Vortex, 167 Vortex area, 167 Vorticity, 135

W Water masses, 322 Wave amplitude, 113 Wavelet transform, 103 Wave life cycle, 113 Wavenumber, 71, 111, 115 Weather forecasting, 440 Weather predictability, 184 Weather prediction, 34, 451 Weighted covariance matrix, 281 Weighted Euclidean distances, 211 Weight matrix, 278 Welch, 196 Westerly flows, 34, 93 Westerly jets, 92 Western boundary currents, 55 Western current, 406 Westward phase tilt, 125 Westward propagating Rossby waves, 157 Whitened, 389 Whitening transformation, 364 White noise, 56, 131, 149, 197, 267 Wind-driven gyres, 322 Wind fields, 212 Window lag, 160, 167 Winning neuron, 430 Wishart distribution, 385, 480 Working set, 400 Wronskian, 545

X X11, 17

Y Young-Householder decomposition, 206 Yule-Walker equations, 175, 185

600 Z Zero frequency, 107, 177 Zero-skewness, 170 Zonally symmetric, 92 Zonal shift the NAO, 49

Index Zonal velocity, 184 Zonal wavenumber, 126 Zonal wind, 92, 184 z-tranform, 267