Kernel Methods for Machine Learning with Math and Python: 100 Exercises for Building Logic 9811904006, 9789811904004

The most crucial ability for machine learning and data science is mathematical logic for grasping their essence rather t

103 25 3MB

English Pages 220 [216] Year 2022

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
How to Overcome Your Kernel Weakness
What Makes KMMP Unique?
Acknowledgments
Contents
1 Positive Definite Kernels
1.1 Positive Definiteness of a Matrix
1.2 Kernels
1.3 Positive Definite Kernels
1.4 Probability
1.5 Bochner's Theorem
1.6 Kernels for Strings, Trees, and Graphs
Appendix
Exercises 1 sim 15
2 Hilbert Spaces
2.1 Metric Spaces and Their Completeness
2.2 Linear Spaces and Inner Product Spaces
2.3 Hilbert Spaces
2.4 Projection Theorem
2.5 Linear Operators
2.6 Compact Operators
Appendix: Proofs of Propositions
Exercises 16 sim 30
3 Reproducing Kernel Hilbert Space
3.1 RKHSs
3.2 Sobolev Space
3.3 Mercer's Theorem
Appendix
Exercises 31 sim 45
4 Kernel Computations
4.1 Kernel Ridge Regression
4.2 Kernel Principle Component Analysis
4.3 Kernel SVM
4.4 Spline Curves
4.5 Random Fourier Features
4.6 Nyström Approximation
4.7 Incomplete Cholesky Decomposition
Appendix
Exercises 46 sim 64
5 The MMD and HSIC
5.1 Random Variables in RKHSs
5.2 The MMD and Two-Sample Problem
5.3 The HSIC and Independence Test
5.4 Characteristic and Universal Kernels
5.5 Introduction to Empirical Processes
Appendix
Exercises 65 sim83
6 Gaussian Processes and Functional Data Analyses
6.1 Regression
6.2 Classification
6.3 Gaussian Processes with Inducing Variables
6.4 Karhunen-Lóeve Expansion
6.5 Functional Data Analysis
Appendix
Exercises 83sim100
Appendix Bibliography
Recommend Papers

Kernel Methods for Machine Learning with Math and Python: 100 Exercises for Building Logic
 9811904006, 9789811904004

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Joe Suzuki

Kernel Methods for Machine Learning with Math and Python 100 Exercises for Building Logic

Kernel Methods for Machine Learning with Math and Python

Joe Suzuki

Kernel Methods for Machine Learning with Math and Python 100 Exercises for Building Logic

Joe Suzuki Graduate School of Engineering Science Osaka University Toyonaka, Osaka, Japan

ISBN 978-981-19-0400-4 ISBN 978-981-19-0401-1 (eBook) https://doi.org/10.1007/978-981-19-0401-1 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

How to Overcome Your Kernel Weakness Among machine learning methods, kernels have always been a particular weakness of mine. I tried to read “Introduction to Kernel Methods” by Kenji Fukumizu (in Japanese) but failed many times. I invited Prof. Fukumizu to give an intensive lecture at Osaka University and listened to the course for a week with the students, but I could not understand the book’s essence. When I first started writing this book, my goal was to rid my sense of weakness. However, now that this book is completed, I can tell readers how they can overcome their own kernel weaknesses. Most people, even machine learning researchers, do not understand kernels and use them. If you open this page, I believe you have a positive feeling that you want to overcome your weakness. The shortest path I would most recommend for achieving this is to learn mathematics by starting from the basics. Kernels work according to the mathematics behind them. It is essential to think through this concept until you understand it. The mathematics needed to understand kernels are called functional analysis (Chap. 2). Even if you know linear algebra or differential and integral calculus, you may be confused. Vectors are finite dimensional, but a set of functions is infinite dimensional and can be treated as linear algebra. If the concept of completeness is new to you, I hope you will take the time to learn about it. However, if you get through this second chapter, I think you will understand everything about kernels. This book is the third volume (of six) in the 100 Exercises for Building Logic set. Since this is a book, there must be a reason for publishing it (the so-called cause) when existing books on kernels can be found. The following are some of the features of this book. 1.

The mathematical propositions of kernels are proven, and the correct conclusions are stated so that the reader can reach the essence of kernels.

v

vi

Preface

2.

As in the other books in the 100 Mathematical Problems in Machine Learning series, source programs and running examples are presented to promote understanding. It is not easy for readers to understand the results if only mathematical formulas are given, and this is especially true for kernels. Once the reader understands the basic topics of functional analysis (Chap. 2), the applications in the subsequent chapters are discussed, and no prior knowledge of mathematics is assumed. This kernel considers both the kernel of the RKHS and the kernel of the Gaussian process. A clear distinction is made between the two treatments. In this book, the two types of kernels are discussed in Chaps. 5 and 6, respectively.

3.

4.

We surveyed books on kernels both in Japan and overseas but found that none satisfied two or more of the above characteristics. I have experienced many failures leading up to the publication of this book. Every year, I give a lecture (at the graduate school of Osaka University). Each area of machine learning is studied by solving 100 mathematical and programming exercises. Sparse estimation (2018) and graphical models (2019) have gained popularity, and the 2020 kernel lecture has more than 100 students enrolled. However, although I prepared for the lectures for more than 2 days every week, the talks did not go well, probably due to my weakness regarding the subject. This was evident from the class questionnaires provided by the students. However, I analyzed each of these problems and made improvements, and this book was born. I hope that readers will learn about kernels efficiently without following the same path that I took (consuming much time and energy through trial and error). Reading this book does not mean that you will write a paper immediately, but it will give you a solid foundation. You will be able to read kernel papers smoothly, which had previously seemed difficult, and you will be able to see the whole kernel paradigm from a higher level. This book is also enjoyable, even for researchers in machine learning. We hope that you will use this book to achieve success in your respective fields.

What Makes KMMP Unique? I have summarized the features of this book as follows. 1.

Developing logic We mathematically formulate and solve each ML problem and build those programs to grasp the subject’s essence. The KMMP (Kernel methods for Machine learning with Math and Python) instills “logic” in the minds of the readers. The reader will acquire both the knowledge and ideas of ML. Even if new technology emerges, they will be able to follow the changes smoothly. After solving the 100 problems, most students would say, “I learned a lot”.

Preface

2.

3.

4.

5.

6.

vii

Not just a story If programming codes are available, you can immediately take action. It is unfortunate when an ML book does not offer the source codes. Even if a package is available, if we cannot see the inner workings of the programs, all we can do is input data into those programs. In KMMP, the program codes are available for most of the procedures. In cases where the reader does not understand the math, the codes will help them know what it means. Not just a how-to book: an academic book written by a university professor. This book explains how to use the package and provides examples of executions for those unfamiliar with them. Still, because only the inputs and outputs are visible, we can see the procedure as a black box. In this sense, the reader will have limited satisfaction because they will not obtain the subject’s essence. KMMP intends to show the reader the heart of ML and is more of a full-fledged academic book. Solve 100 exercises: problems are improved with feedback from university students The exercises in this book have been used in university lectures and refined based on students’ feedback. The best 100 problems were selected. Each chapter (except the exercises) explains the solutions, and you can solve all of the exercises by reading the book. Self-contained All of us have been discouraged by phrases such as “for the details, please refer to the literature XX”. Unless you are an enthusiastic reader or researcher, nobody will seek out those references. In this book, we have presented the material so that consulting external references is not required. Additionally, the proofs are simple derivations, and the complicated proofs are given in the appendices at the end of each chapter. KMMP completes all discussions, including the appendices. Readers’ pages: questions, discussion, and program files The reader can ask any question on the book via https://bayesnet.org/books.

Osaka, Japan November 2021

Joe Suzuki

Acknowledgments

The author wishes to thank Mr. Bing Yuan Zhang, Mr. Tian Le Yang, Mr. Ryosuke Shimmura, Mr. Tomohiro Kamei, Ms. Rieko Tasaka, Mr. Keito Odajima, Mr. Daiki Fujii, Mr. Hongming Huang, and all graduate students at Osaka University, for pointing out logical errors in mathematical expressions and programs. Furthermore, I would like to take this opportunity to thank Dr. Hidetoshi Matsui (Shiga University), Dr. Michio Yamamoto (Okayama University), and Dr. Yoshikazu Terada (Osaka University) for their advice on functional data analysis in seminars and workshops. This English book is based mainly on the Japanese book published by Kyoritsu Shuppan Co., Ltd. in 2021. The author would like to thank Kyoritsu Shuppan Co., Ltd., particularly its editorial members Mr. Tetsuya Ishii and Ms. Saki Otani. The author also appreciates Ms. Mio Sugino, Springer, preparing the publication and providing advice on the manuscript. Osaka, Japan November 2021

Joe Suzuki

ix

Contents

1 Positive Definite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Positive Definiteness of a Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Positive Definite Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Bochner’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Kernels for Strings, Trees, and Graphs . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 1∼15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 3 5 12 14 16 22 25

2 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Metric Spaces and Their Completeness . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Linear Spaces and Inner Product Spaces . . . . . . . . . . . . . . . . . . . . . . . 2.3 Hilbert Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Projection Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Compact Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix: Proofs of Propositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 16∼30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29 29 33 36 41 43 46 50 57

3 Reproducing Kernel Hilbert Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 RKHSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Sobolev Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Mercer’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 31∼45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61 61 65 70 81 87

4 Kernel Computations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.1 Kernel Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.2 Kernel Principle Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 97 4.3 Kernel SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 4.4 Spline Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xi

xii

Contents

4.5 Random Fourier Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Nyström Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Incomplete Cholesky Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 46∼64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

110 116 118 123 125

5 The MMD and HSIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Random Variables in RKHSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The MMD and Two-Sample Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The HSIC and Independence Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Characteristic and Universal Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Introduction to Empirical Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 65∼83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

129 129 132 139 150 153 157 161

6 Gaussian Processes and Functional Data Analyses . . . . . . . . . . . . . . . . . 6.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Gaussian Processes with Inducing Variables . . . . . . . . . . . . . . . . . . . . 6.4 Karhunen-Lóeve Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Functional Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Exercises 83∼100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

167 167 175 180 185 194 203 205

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

Chapter 1

Positive Definite Kernels

In data analysis and various information processing tasks, we use kernels to evaluate the similarities between pairs of objects. In this book, we deal with mathematically defined kernels called positive definite kernels. Let the elements x, y of a set E correspond to the elements (functions) (x), (y) of a linear space H called the reproducing kernel Hilbert space. The kernel k(x, y) corresponds to the inner product (x), (y) H in the linear space H . Additionally, by choosing a nonlinear map , this kernel can be applied to various problems. The set E may be a string, a tree, or a graph, even if it is not a real-numbered vector, as long as the kernel satisfies positive definiteness. After defining probability and Lebesgue integrals in the second half, we will learn about kernels by using characteristic functions (Bochner’s theorem).

1.1 Positive Definiteness of a Matrix Let n ≥ 1; we say that a square matrix A is symmetric if A ∈ Rn×n is equal to its transpose (A = A)1 , and we say that A is nonnegative definite if all the eigenvalues are nonnegative. Proposition 1 (nonnegative definite matrix) The following three conditions are equivalent for a symmetric matrix A ∈ Rn×n . 1. A matrix B ∈ Rn×n exists such that A = B  B. 2. x  Ax ≥ 0 for any x ∈ Rn . 3. The eigenvalues of A are nonnegative. ⇒x  Ax = x  B  Bx = Bx2 ≥ 0. 2.= ⇒3. Proof: 1.= ⇒2. holds because A = B  B=  ⇒ 0 ≤ y  Ay = y  λy = λy2 for follows from the fact that x Ax ≥ 0, x ∈ Rn = 1

We write the transpose of matrix A as A .

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_1

1

2

1 Positive Definite Kernels

n an eigenvalue λ of A and . 3.= ⇒1. holds since λ √ its √ eigenvector √ y ∈ R √ √1 , . . . , λn ≥   D P√ , where D and D are diag0= ⇒A = P D P = P D D P = ( D P√ ) onal matrices with elements λ1 , . . . , λn and λ1 , . . . , λn , and P is the corresponding orthogonal matrix.  A nonnegative definite matrix A is symmetric. In this book, we say that a nonnegative definite matrix is positive definite if all of its eigenvalues are positive. In addition, we assume that the elements of any matrix are real. However, the following fact is often useful when we deal with complex numbers and Fourier transformations.

n×n  Corollary 1 For a nonnegative √ definite matrix A ∈ R , we have that z Az ≥ 0 n for any z ∈ C , where i = −1 is the imaginary unit, and we write the conjugate x − i y of z = x + i y ∈ C with x, y ∈ R as z.

Proof: Since there exists a B ∈ Rn×n such that A = B  B for a nonnegative definite matrix A ∈ Rn×n , we have that z  Az = (Bz) Bz = |Bz|2 ≥ 0 for any z = [z 1 , . . . , z n ] ∈ Cn . Example 1 # In this chapter, we assume that the following has been executed. import numpy as np import matplotlib.pyplot as plt from matplotlib import style style.use("seaborn−ticks")

n=3 B = np.random.normal(size=n∗∗2).reshape(3, 3) A = np.dot(B.T, B) values, vectors = np.linalg.eig(A) print("values:\n", values, "\n\nvectors:\n", vectors, "\n")

values: [0.09337468 7.75678625 4.43554113] vectors: [[ 0.49860775 0.84350568 0.199721 ] [ 0.39606374 -0.42663779 0.81308899] [-0.77105371 0.32631023 0.54680692]]



1.1 Positive Definiteness of a Matrix

3

S = [] for i in range(10): z = np.random.normal(size = n) y = np.squeeze(z.T.dot(A.dot(z))) S.append(y) if (i+1) % 5 == 0: print("S[%d:%d]:"%((i−4),i), S[i−4:i])

S[0:4]: [23.24608872999895, 6.601263342526701, 5.334515801733688, 14.886876186736613] S[5:9]: [18.85503241886245, 34.30290091714191, 1.025291282540866, 29.59512428090335]

1.2 Kernels Let E be a set. We often express similarity between elements x, y ∈ E by using a bivariate function k : E × E → R not just for data analysis but also for various information processing tasks. The larger k(x, y) is, the more similar x, y are. We call such a function k : E × E → R a kernel. Example 2 (Epanechnikov kernel) We use the kernel k : E × E → R such that   |x − y| k(x, y) = D λ 3 (1 − t 2 ), |t| ≤ 1 D(t) = 4 0, Other wise for λ > 0, and we construct the following function (the Nadaraya-Watson estimator) from observations (x1 , y1 ), . . . , (x N , y N ) ∈ E × R: N k(x, xi )yi fˆ(x) = i=1 . N j=1 k(x, x j ) For a given input x∗ ∈ E that is different from the N pairs of inputs, we return the weighted sum of y1 , . . . , y N , k(x∗ , x1 ) k(x∗ , x N ) , . . . , N , N j=1 k(x ∗ , x j ) j=1 k(x ∗ , x j ) as the output fˆ(x∗ ). Because we assume that a larger k(x, y) yields a more similar x, y ∈ E, the more similar x∗ and xi are, the larger the weight of yi . Given an input x∗ ∈ E for i = 1, . . . , N , we weight yi such that xi − λ ≤ x∗ ≤ xi + λ is proportional to k(xi , x∗ ). If we make the λ value smaller, we predict y∗ by using only the (xi , yi ) for which xi and x∗ are close. We display the output obtained when we execute the following code in Fig. 1.1.

1 Positive Definite Kernels

3

4

-2

-1

0

y

1

2

λ = 0.05 λ = 0.35 λ = 0.5

-3

-2

-1

0

1

2

3

x Fig. 1.1 We use the Epanechnikov kernel and Nadaraya-Watson estimator to draw the curves for λ = 0.05, 0.35, 0.5. Finally, we obtain the optimal λ value and present it in the same graph

n = 250 x = 2 ∗ np.random.normal(size = n) y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation def D(t): # Function Definition D return np.maximum(0.75 ∗ (1 − t∗∗2), 0) def k(x, y, lam): # Function Definition K return D(np.abs((x − y) / lam)) def f(z, lam): # Function Definition f S = 0; T = 0 for i in range(n): S = S + k(x[i], z, lam) ∗ y[i] T = T + k(x[i], z, lam) return S / T plt.figure(num=1, figsize=(15, 8),dpi=80) plt.xlim(−3, 3); plt.ylim(−2, 3) plt.xticks(fontsize = 14); plt.yticks(fontsize = 14) plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o") xx = np.arange(−3, 3, 0.1) yy = [[] for _ in range(3)] lam = [0.05, 0.35, 0.50] color = ["g", "b", "r"] for i in range(3): for zz in xx: yy[i].append(f(zz, lam[i])) plt.plot(xx, yy[i], c = color[i], label = lam[i]) plt.legend(loc = "upper left", frameon = True, prop={’size’:14}) plt.title("Nadaraya−Watson Estimator", fontsize = 20)

1.3 Positive Definite Kernels

5

1.3 Positive Definite Kernels The kernels that we consider in this book satisfy the positive definiteness criterion defined below. Suppose k : E × E → R is symmetric, i.e., k(x, y) = k(y, x), x, y ∈ E. For x1 , . . . , xn ∈ E (n ≥ 1), we say that the matrix ⎤ k(x1 , x1 ) · · · k(x1 , xn ) ⎥ ⎢ .. .. n×n .. ⎦∈R ⎣ . . . k(xn , x1 ) · · · k(xn , xn ) ⎡

(1.1)

is the Gram matrix w.r.t. a k of order n. We say that k is a positive definite kernel2 if the Gram matrix of order n is nonnegative definite for any n ≥ 1 and x1 , . . . , xn ∈ E. Example 3 The kernel in Example 2 does not satisfy positive definiteness. In fact, when λ = 2, n = 3, and x1 = −1, x2 = 0, x3 = 1, the matrix consisting of K λ (xi , yi ) can be written as ⎡

⎤ ⎡ ⎤ k(x1 , x1 ) k(x1 , x2 ) k(x1 , x3 ) 3/4 9/16 0 ⎣ k(x2 , x1 ) k(x2 , x2 ) k(x2 , x3 ) ⎦ = ⎣ 9/16 3/4 9/16 ⎦ 0 9/16 3/4 k(x3 , x1 ) k(x3 , x2 ) k(x3 , x3 ) and the determinant is computed as 33 /26 − 35 /210 − 35 /210 = −33 /29 . In general, the determinant of a matrix is the product of its eigenvalues, and we find that at least one of the three eigenvalues is negative. ∞ Example 4 For random variables {X i }i=1 that are not necessarily independent, if k(X i , X j ) is the covariance between X i , X j , the Gram matrix of any order is the covariance matrix among a finite number of X j , which means that k is positive definite. We discuss Gaussian processes based on this fact in Chap. 6.

By assuming positive definiteness, the theory of kernels will be developed in this book. Hereafter, when we state kernels, we are referring to positive definite kernels. Let H be a linear space (vector space) equipped with an inner product ·, · H . Then, we often construct a positive definite kernel with k(x, y) = (x), (y) H .

(1.2)

By using an arbitrary map  : E → H . We say that such a  is a feature map. In this chapter, we may assume that the linear space H is the Euclidean space H = Rd of dimensionality d with the standard inner product x, yRd = x  y, x, y ∈ Rd . We define the linear space and inner product concepts in Chap. 2. Proposition 2 The kernel k : E × E → R defined in (1.2) is positive definite. 2

Although it seems appropriate to say “a nonnegative definite kernel”, the custom of saying “a positive definite kernel” has been established.

6

1 Positive Definite Kernels

Proof: We arbitrarily fix n = 1, 2, · · · and x1 , · · · , xn ∈ E and denote the Gram matrix (1.1) by K . Then, from the definition of inner products, for an arbitrary z = [z 1 , · · · , z n ] ∈ Rn , we have z K z =

n n

z i z j (xi ), (x j ) H = 

i=1 j=1

n

z i (xi ),

i=1

n

z j (x j ) H = 

j=1

n

z j (x j )2H ≥ 0 ,

j=1

1/2

where we write a H := a, a H for a ∈ H .



Proposition 3 If the matrices A, B are nonnegative definite, then so is the Hadamard product A ◦ B (elementwise multiplication). Proof: See the appendix at the end of this chapter. Proposition 3 is helpful for proving the second part of the following proposition. Proposition 4 If the kernels k1 , k2 , . . . are positive definite, then so are the following E × E → R: 1. 2. 3. 4. 5.

ak1 + bk2 (a, b ≥ 0) , k1 k2 , the limit3 of {ki } when it converges, k has only one value a ≥ 0 (constant function), and f (x)k(x, y) f (y) (x, y ∈ E) for an arbitrary f : E → R ,

where the third point claims that the limit k∞ (x, y) := lim ki (x, y) satisfies positive i→∞

definiteness for any x, y ∈ E. Proof: ak1 + bk2 is positive definite because

⇒x  (a A + bB)x ≥ 0 x  Ax ≥ 0, x  Bx ≥ 0= for A, B ∈ Rn×n . The product k1 k2 is positive definite because if A = (Ai, j ), B = (Bi, j ) are nonnegative definite, then so is the Hadamard product A ◦ B (Proposition 3). The third statement assumes the existence of a positive integer n such that B∞ =

n n

z j z h k∞ (x j , x h ) = −

j=1 h=1

for · · · , xn ∈ E, z 1 , . . . , z n ∈ R, and  > 0. Then, the difference between Bi := n x1 , n j=1 h=1 z j z h ki (x j , x h ) ≥ 0 and B∞ becomes arbitrarily close to zero as i → ∞. However, the difference is at least  > 0, which is a contradiction and means that B∞ ≥ 0. If a kernel takes only a (nonnegative) constant value a, since all the values in (1.1) are a ≥ 0, we have

3

the limit of ki (x, y) for each (x, y) ∈ E.

1.3 Positive Definite Kernels

7

⎤ ⎡√ a ··· a a/n ⎢ .. . . .. ⎥ ⎢ .. ⎣. . .⎦=⎣ . √ a ··· a a/n ⎡

⎤ ⎡ √ √ · · · a/n a/n .. ⎥ ⎢ .. .. . . ⎦ ⎣ . √ √ · · · a/n a/n

⎤ √ · · · a/n . ⎥ .. . .. ⎦ . √ · · · a/n

The last claim is due to the implication x  Ax ≥ 0 , x ∈ Rn = ⇒x  D ADx ≥ 0 , x ∈ Rn , which we can examine by substituting y = Dx into y  Ay ≥ 0. In particular, we may regard A and D as the matrix (1.1) and diagonal matrix with the elements  f (x1 ), · · · , f (xn ), respectively. In addition, the f (x) f (y) obtained by substituting k(x, y) = 1 for x, y ∈ E in the last item of Proposition 4 is positive definite. Moreover, the √

k(x, y) k(x, x)k(y, y)

(1.3)

obtained by substituting f (x) = {k(x, x)}−1/2 for k(x, x) > 0 (x ∈ E) in the last item of Proposition 4 is positive definite. Furthermore„ the value obtained by substituting n = 2, x1 = x, and x2 = y into (1.1) is nonnegative, and the absolute value of (1.3) does not exceed one. We say that (1.3) is the positive definite kernel obtained by normalizing k(x, y). Example 5 (Linear Kernel) Let E := Rd . Then, the kernel k(x, y) = x  Ay = Bx, By H , x, y ∈ Rd using the nonnegative definite matrix A = B  B ∈ Rd×d , B ∈ Rd×d is positive definite because it corresponds to the case in which the map  in Proposition 2 is E  x → Bx ∈ H . In particular, if A is the unit matrix, then the map  is the identity map. In this sense, the positive definite kernel is an extension of the inner product k(x, y) = x  y. Example 6 (Exponential Type) Let β > 0, n ≥ 0, and x, y ∈ Rd . Then, km (x, y) := 1 + βx  y +

β2  2 βm  m (x y) + · · · + (x y) 2 m!

(1.4)

(m ≥ 1) is a polynomial of the products of positive definite kernels, and the coefficients are nonnegative. From the first two items of Proposition 4, this kernel is a positive definite kernel. Additionally, because (1.4) is a Taylor expansion up to the order m, from the third item of Proposition 4, k∞ (x, y) := exp(βx  y) = lim km (x, y) m→∞

is a positive definite kernel as well. Example 7 (Gaussian Kernel) The kernel

8

1 Positive Definite Kernels

k(x, y) := exp{−

1 x − y2 } , σ > 0 2σ 2

(1.5)

for x, y ∈ Rd can be written as exp{−

1 x2 xy y2 2 x − y } = exp{− } exp{ } exp{− }. 2σ 2 2σ 2 σ2 2σ 2

Thus, from the fifth item of Proposition 4 and the fact that exp(βx  y) with β = σ −2 is positive definite, we see that (7) is positive definite. Example 8 (Polynomial Kernel) The kernel km,d (x, y) := (x  y + 1)m ,

(1.6)

for x, y ∈ Rd , d = 1, 2, . . . is a polynomial of positive definite kernels (linear kernels x  y), and its coefficients are nonnegative. From the first two items of Proposition 4, (1.6) is positive definite. Example 9 If we normalize the linear kernel by (1.3), we obtain x  y/xy, where we denote a := a, a1/2 for a ∈ Rn . The Gaussian kernel (1.5) remains the same even if we normalize it. The polynomial kernel becomes

x y + 1  √ x  x + 1 y y + 1

m

if we normalize it. The converse is true for Proposition 2, which will be proven in Chap. 3: for any nonnegative definite kernel k, there exists a feature map  : E → H such that k(x, y) = (x), (y) H . Example 10 (Polynomial Kernel) Let m, d ≥ 1. The feature map of the kernel km,d (x, y) = (x  y + 1)m with x, y ∈ Rd is  m,d (x1 , · · · , xd ) = (

m! x m 1 · · · xdm d )m 0 ,m 1 ,...,m d ≥0 , m 0 !m 1 ! · · · m d ! 1

where the indices (m 0 , m 1 , · · · , m d ) range over m 0 , m 1 , · · · , m d ≥ 0 and m 0 + m 1 + · · · + m d = m, and we assume that an order exists among the indices (m 0 , m 1 , · · · , m d ). If we use the multinomial theorem, (

d

z i )m =

i=0

(z 0 = 1), we see that

m 0 +m 1 +···+m d =m

m! z m 1 · · · z dm d m 0 !m 1 ! · · · m d ! 1

1.3 Positive Definite Kernels

9

(x  y + 1)m = m,d (x), m,d (y) H with x0 = y0 = 1. For example, we have 1,2 (x1 , x2 ) = [1, x1 , x2 ] 2,2 (x1 , x2 ) = [1, x12 , x22 ,



2x1 ,



2x2 ,



2x1 x2 ]

because 2,1 (x1 , x2 ), 2,1 (y1 .y2 ) H = 1 + x1 y1 + x2 y2 = 1 + x  y = k(x, y) 2,2 (x1 , x2 ), 2,2 (y2 .y2 ) H = 1 + x12 y12 + x22 y22 + 2x1 y1 + 2x2 y2 + 2x1 x2 y1 y2 = (1 + x1 y1 + x2 y2 )2 = (1 + x  y)2 = k(x, y) . Example 11 (Infinite-Dimensional Polynomial Kernel) Let 0 < r ≤ ∞, d ≥ 1, and √ E := {x ∈ Rd |x2 < r }. Let f : (−r, r ) → R be C ∞ . We assume that the function can be Taylor-expanded by f (x) =



an x n , x ∈ (−r, r ) .

n=0

If a0 > 0, a1 , a2 , . . . ≥ 0, then f (x  y) is a positive definite kernel for x, y ∈ E. The exponential type is an infinite-dimensional polynomial kernel and is positive definite. Example 12 In Example 2, we use the Nadaraya-Watson estimator to determine the Gaussian kernel (Figs. 1.2 and 1.3). def K(x, y, sigma2) : return np.exp(−np. linalg .norm(x − y)∗∗2/2/sigma2) def F(z , sigma2) : S=0; T=0 for i in range(n) : S = S + K(x[ i ] , z , sigma2) ∗ y[ i ] T = T + K(x[ i ] , z , sigma2) return S / T

# Function Definition f

We often obtain the optimal value for each of the kernel parameters via cross validation (CV)4 . If the parameters take continuous values, we select a finite number of candidates and obtain the evaluation value for each parameter as follows. Divide the N samples into K groups and conduct estimation with the samples belonging to group K − 1 group. Perform testing with the samples belonging to the one remaining group and calculate the corresponding score. Repeat the procedure K times (changing 4

Joe Suzuki, “Statistical Learning with Math and Python”, Chap. 4, Springer.

10

1 Positive Definite Kernels

-2

-1

0

y

1

2

3

σ 2 = 0.01 σ 2 = 0.001 2 σ 2 = σbest

-3

-2

-1

0

1

2

3

x Fig. 1.2 Smoothing by predicting the values of x outside the N sample points via the NadarayaWatson estimator. We choose the best parameter for the Gaussian kernel via cross validation

First Second (k − 1)-th k-th

Group 1 Test Estimation .. .

Group 2 Estimation Test .. .

Estimation Estimation

Estimation Estimation

··· ··· ··· .. . ···

Group k − 1 Estimation Estimation .. .

Group k Estimation Estimation .. .

Test Estimation

Estimation Test

Fig. 1.3 A rotation employed for cross validation. Each group consists of N /k samples; we divide N N 2N N the samples into k groups based on their sample IDs. 1 ∼ , +1∼ , . . . , (k − 2) + 1 ∼ k k k k N N (k − 1) , (k − 1) + 1 ∼ N k k

the test group) and find the sum of the obtained scores. In that way, we evaluate the performance of the kernel based on one parameter. Execute this process for all parameter candidates and use the parameters with the best evaluation values. We obtain the optimal value of the parameter σ 2 via CV. We execute this procedure, setting σ 2 = 0.01, 0.001. n = 100 x = 2 ∗ np.random.normal(size = n) y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 # Data Generation # The Curves for sigma2 = 0.01, 0.001 plt.figure(num=1, figsize=(15, 8),dpi=80) plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o") plt.xlim(−3, 3) plt.ylim(−2, 3) plt.xticks(fontsize = 14); plt.yticks(fontsize = 14) xx = np.arange(−3, 3, 0.1)

1.3 Positive Definite Kernels yy = [[] for _ in range(2)] sigma2 = [0.001, 0.01] color = ["g", "b"] for i in range(2): for zz in xx: yy[i].append(F(zz, sigma2[i])) plt.plot(xx, yy[i], c = color[i], label = sigma2[i]) plt.legend(loc = "upper left", frameon = True, prop={’size’:20}) plt.title("Nadaraya−Watson Estimator", fontsize = 20)

# Optimum lambda Values m = int(n / 10) sigma2_seq = np.arange(0.001, 0.01, 0.001) SS_min = np.inf for sigma2 in sigma2_seq: SS = 0 for k in range(10): test = range(k∗m, (k+1)∗m) train = [x for x in range(n) if x not in test] for j in test: u, v = 0, 0 for i in train: kk = K(x[i], x[j], sigma2) u = u + kk ∗ y[i] v = v + kk if v != 0: z = u/v SS = SS + (y[j] − z)∗∗2 if SS < SS_min: SS_min = SS sigma2_best = sigma2 print("Best sigma2 = ", sigma2_best)

Best sigma2 =

0.003

plt.figure(num = 1, figsize=(15, 8),dpi = 80) plt.scatter(x, y, facecolors = ’none’, edgecolors = "k", marker = "o") plt.xlim(−3, 3) plt.ylim(−2, 3) plt.xticks(fontsize = 14); plt.yticks(fontsize = 14) xx = np.arange(−3, 3, 0.1) yy = [[] for _ in range(3)] sigma2 = [0.001, 0.01, sigma2_best] labels = [0.001, 0.01, "sigma2_best"] color = ["g", "b", "r"] for i in range(3): for zz in xx: yy[i].append(F(zz, sigma2[i])) plt.plot(xx, yy[i], c = color[i], label = labels[i]) plt.legend(loc = "upper left", frameon = True, prop={’size’: 20}) plt.title("Nadaraya−Watson Estimator", fontsize = 20)

11

12

1 Positive Definite Kernels

1.4 Probability Each set is an event when the sets are closed by set operations (union, intersection, and complement). Example 13 We consider a set consisting of the subsets of E = {1, 2, 3, 4, 5, 6} (dice eyes) that are closed by set operations: {E, {}, {1, 3}, {5}, {2, 4, 6}, {1, 3, 5}, {2, 4, 5, 6}, {1, 2, 3, 4, 6}} . If any of these eight elements undergo the union, intersection, or complement operations, the result remains one of these eight elements. In that sense, we can say that these eight elements are closed by the set operations. The subsets {1, 3} and {2, 4, 5, 6} are events, but {2, 4} is not. On the other hand, for the entire set E, if we include {1}, {2}, {3}, {4}, {5}, {6} as events, 26 events should be considered. Even if the entire set E is identical, whether it is an event differs depending on the set F of events. In the following, we start our discussion after defining the entire set E and the set F of subsets (events) of E closed by the set operations. Any open interval (a, b) with a, b ∈ R is a subset of the whole real number system R. Applying set operations (union, set product, and set complement) to multiple open intervals does not form an open interval, but the result remains a subset of R. We call any subset of R obtained from an open set by set operations a Borel set of R, and we denote such a subset as B. A set obtained by further applying set operations to Borel sets remains a Borel set. Example 14 For a, b ∈ R, the following are Borel sets: {a} = ∩∞ n=1 (a − 1/n, a + 1/n), [a, b) = {a} ∪ (a, b), (a, b] = √ {b} ∪ (a, b), [a, b] = {a} ∪ (a, b], R = ∞ n n ∪∞ n=0 (−2 , 2 ), Z = ∪n=0 {−n, n}, and [ 2, 3) ∪ Z. As described above, we assume that we have defined the entire set E and the set F of events. At this time, the μ : F → [0, 1] that satisfies the following three conditions is called a probability. 1. μ(A) ≥ 0, A ∈ F, ∞ ∞ 2. Ai ∩ A j = {}= ⇒μ(∪i=1 Ai ) = i=1 μ(Ai ), and 3. μ(E) = 1. We say that μ is a measure if μ satisfies the first two conditions, and we say that this measure is finite if μ(E) takes a finite value. We say that (E, F, μ) is either a probability space or a measure space, depending on whether μ(E) = 1 or not. For probability and measure spaces, if {e ∈ E|X (e) ∈ B} is an event for any Borel set B, which means that {e ∈ E|X (e) ∈ B} ∈ F, we say that the function X : E → R is measurable in X . In particular, if we have a probability space, X is a random variable. Whether X is measurable depends on (E, F) rather than (E, F, μ).

1.4 Probability

13

The notion of measurability might be complex for a beginner to understand. However, it seems smoother if we intuitively understand that the function X : E → R depends on an element of F rather than an element of E. Example 15 (Dice Eyes) Suppose that X : E → R for E = {1, 2, 3, 4, 5, 6} is given by  1, e = 1, 3, 5 X (e) = . 0, e = 2, 4, 6 Then, if F = {{1, 3, 5}, {2, 4, 6}, {}, E}, then X is a random variable. In fact, since X is measurable, {e ∈ E|X (e) ∈ {1}} = {1, 3, 5} {e ∈ E|X (e) ∈ [−2, 3)} = E {e ∈ E|X (e) ∈ [0, 1)} = {2, 4, 6} for the Borel set B = {1}, [−2, 3), [0, 1). Even if we choose the Borel set B, the set {e ∈ E|X (e) ∈ B} is one of {1, 3, 5}, {2, 4, 6}, {}, E. On the other hand, if F = {{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable. In the following, assuming that the function f : E → R is measurable, we define  f dμ. We first assume that f is nonnegative. For a sequence

the Lebesgue integral of exclusive subsets

E {Bk }

of F, we define 

 inf f (e) μ(Bk ) .

(1.7)

e∈Bk

k

If ∪k Bk = E and the supremum of (1.7) w.r.t. {Bk }, sup {Bk }

 k

 inf f (e) μ(Bk ),

e∈Bk

takes a finite value, we say that the supremum is the Lebesgue integral of the measurable function f for (E, B, μ), and we write

f dμ. When the function f is not E

necessarily nonnegative, we divide E into E + := {e ∈ E| f (e) ≤ 0}} and E − := {e ∈ E| f (e)  of f + :=  f, f − :=− f . If  ≥ 0}},and we define the above quantity for each both f + dμ, f − dμ take finite values, we say that f dμ := f + dμ − f − dμ is the Lebesgue integral of f for (E, B, μ). If X is a random variable, the associated Borel sets are the events for the probability μ(·). We say that the probability of event X ≤ x for x ∈ R  FX (x) := μ([−∞, x)) =

(−∞,x]

dμ ,

14

1 Positive Definite Kernels

is the distribution function and that f X is the probability density function of X if we can write FX as  x FX (x) = f X (t)dt . −∞

We say that μ is absolutely continuous if the probability μ(B) diminishes when the width sum of the intervals in any Borel set B approaches zero. The necessary and sufficient condition ensuring that a probability density function exists for the probability μ is that μ is absolutely continuous. If X takes a finite number of values, the probability density function does not exist, which means that μ is not absolutely continuous. If X takes values of a1 < · · · < am , then the distribution function can be written as μ({a j }) . FX (x) = j:a j ≤x

Example 16 Suppose that X follows the standard Gaussian distribution. If we make  > 0 close to zero, FX (x + ) − FX (x − ) (for any x ∈ R) approaches zero, which means that the probability is absolutely continuous. On the other hand, suppose that X takes values of 0, 1; even if we make  > 0 close to zero, FX (1 + ) − FX (1 − ) does not approach zero, which means that the probability is not absolutely continuous. If we use the Lebesgue integral, we can express the probability without distinguishing between discrete and continuous variables. Example 17 For E = R, ifthe probability  ∞ density function f X exists, the expectation of X can be written as E xdμ = −∞ t f X (t)dt. On the other hand, if X takes  values of a1 < · · · < am , we have E xdμ = mj=1 a j μ({a j }).

1.5 Bochner’s Theorem We consider the case in which a kernel is a function of the difference between x, y ∈ E. According to Bochner’s theorem, which is the main topic of this section and will be used in later chapters, the kernel should coincide with a characteristic function in terms of probability and statistics (up to a constant). When utilizing a univariate function φ : E → R, we often use kernels in the form k(x, y) = φ(x − y), such as the Gaussian kernel. The kernel k being positive definite is equivalent to the inequality n n

z i z j φ(xi − x j ) ≥ 0 , z = [z 1 , . . . , z n ] ∈ Rn

(1.8)

i=1 j=1

for an arbitrary √ n ≥ 1, x1 , . . . , xn ∈ E. Let i = −1 be the imaginary unit. We define the characteristic function of a random variable X by ϕ : Rd → C:

1.5 Bochner’s Theorem

15

ϕ(t) := E[exp(it  X )] =



exp(it  x)dμ(x) , t ∈ Rd , E

where E[·] denotes the expectation. If μ is absolutely continuous (i.e., the probability  density function f X exists), then ϕ(t) := E[exp(it  X )] = the Fourier transformation of f X (x) = via the inverse Fourier transformation f X (x) =

1 2π

dμ(x) , dx





−∞

exp(it  x) f X (x)d x is

E

and f X (x) can be recovered from ϕ(x) 

ϕ(t)e−it x dt .

Example 18 The characteristic function of the Gaussian distribution with a mean 1 (x − μ)2 of μ and a variance of σ 2 , f (x) = √ exp{− }, is 2σ 2 2πσ 2  ∞ 1 (x − μ)2 ϕ(t) = √ exp{it x} exp{− }d x 2σ 2 2πσ 2 −∞  ∞ {x − (μ + itσ 2 )}2 t 2 σ2 1 } exp[− ]d x · exp{iμt − = √ 2σ 2 2 2πσ 2 −∞ t 2 σ2 = exp{iμt − }. 2 α The characteristic function of the Laplace distribution f (x) = exp{−α|x|} with 2 a parameter α > 0 is 

∞ −∞

exp{it x}

 ∞  0 α α exp[(it + α)x]d x + exp[(it − α)x]d x} exp{−α|x|}d x = { 2 2 −∞ 0   0 ∞ α e(it+α)x e(it−α)x α2 = { − }= 2 . 2 it + α it − α t + α2 −∞

0

Proposition 5 (Bochner) Suppose that φ : Rn → R is continuous. Then, condition (1.8) holds for an arbitrary n ≥ 1 with x = [x1 , . . . , xn ] ∈ Rn and z = [z 1 , . . . , z n ] ∈ Rn if and only if φ coincides with the characteristic function w.r.t. a probability μ up to a constant, i.e., there exists a finite measure η such that 

exp(it  x)dη(x), t ∈ Rn .

φ(t) =

(1.9)

E

Proof: See the appendix at the end of this chapter. Because a kernel evaluates a similarity between two elements in E, we do not care much about the multiplication of constants. In the following, we say that a probability μ is the probability of kernel k if μ is the finite measure η when dividing

16

1 Positive Definite Kernels

the kernel k in Proposition 5 by a constant. Note that we only consider a kernel k(·, ·) whose range is real in this book, although the range of the characteristic function is generally Cn .  d 2 d In the following, we denote t2 by j=1 t j for t = [t1 , . . . , td ] ∈ R . Example 19 (Gaussian Kernel) k(x, y) = exp{− 2σ1 2 x − y22 }, x, y ∈ Rd coint2 cides with the characteristic function exp{− 22 }, t = x − y ∈ Rd of the Gaussian 2σ distribution with a mean of 0 and a covariance matrix (σ 2 )−1 I ∈ Rd×d . Example 20 (Laplacian Kernel) k(x, y) =

1 1 , x, y ∈ Rn coin2π x − y22 + β 2

β2 , t = x − y ∈ Rn of the Laplace + β2 distribution with a parameter α = β > 0 up to the constant multiplication [2πβ 2 ]−1 . cides with the characteristic function

t22

We can construct the kernel for this distribution if the probability density function exists. However, if we restrict our search to the kernels whose ranges are real, we need to choose the parameters so that the characteristic function takes real values. For example, the Gaussian kernel obtained by setting the mean to zero takes real values.

1.6 Kernels for Strings, Trees, and Graphs As discussed in Chap. 4, the space E of the covariates is projected via the feature map  : E → H . The method of evaluating similarity via the inner product (kernel) in another linear space ( RKHS ) has been widely used in machine learning and data science. If the similarities between the elements of the set E are accurately represented, then this approach yields improved regression and classification processing performance. As this is a kernel configuration method, we provide the notions of convolutional and marginalized kernels and illustrate them by introducing string, tree, and graph kernels. First, we define positive definite kernels k1 , . . . , kd for the sets E 1 , . . . , E d . Suppose that we define a set E and a map R : E 1 × · · · × E d → E. Then, we define the kernel E × E  (x, y) → k(x, y) ∈ R by k(x, y) =

d 

ki (xi , yi ) ,

(1.10)

R −1 (x) R −1 (y) i=1

 where R −1 (x) is the sum over (x1 , . . . , xd ) ∈ E 1 × · · · E d such that R(x1 , . . . , xd ) = x. A kernel in the form of (1.10) is called a convolutional kernel [13]. Since each ki (xi , yi ) is positive definite, k(x, y) is also positive definite (according to the first two items of Proposition 4).

1.6 Kernels for Strings, Trees, and Graphs

17

Example 21 (String Kernel) Let  p be a set of strings consisting of p ≥ 0 characters in a finite set , and let  ∗ := ∪i  i . For example, if  = {A, T, G, C}, we have AGGC GT G ∈  7 . Then, we define the kernel cu (x)cu (y) k(x, y) := u∈ p

for x, y ∈  ∗ , where cu (x) denotes the number of occurrences of u ∈  p in x ∈  ∗ . The following represents sample code for defining this string kernel. def string_kernel (x, y, p) : m, n = len(x) , len(y) S= 0 for i in range(m) : for j in range( i , n) : i f x[ i : ( i+p) ] == y[ j : ( j+p) ] : S=S+ 1 return S

Then, we execute the procedure. C = ["a", "b", "c"] m = 10 w = np.random.choice(C, m, replace = True) x = "" for i in range(m): x = x + w[i] n = 12 w = np.random.choice(C, n, replace = True) y = "" for i in range(n): y = y + w[i]

x

’ababbcaaac’

y

’ccbcbcaaacaa’

string_kernel (x,y,2)

58

18

1 Positive Definite Kernels

Suppose that d = 3, E 1 = E 3 =  ∗ , and E 2 =  p . Then, if we concatenate (x1 , x2 , x3 ) ∈ E 1 × E 2 × E 3 , then we may state that R(x1 , x2 , x3 ) = x ∈ E. If x2 = u and y2 = u appear cu (x) times in x and cu (y) times in y, respectively, then by setting k1 (x1 , y1 ) = k3 (x3 , y3 ) = 1 and k2 (x2 , y2 ) = I (x2 = y2 = u), we have cu (x)cu (y) =





1 · I (x2 = y2 = u) · 1

R(x1 ,x2 ,x3 )=x R(y1 ,y2 ,y3 )=y

k(x, y) =





cu (x)cu (y) =



1 · I (x2 = y2 ) · 1 .

R(x1 ,x2 ,x3 )=x R(y1 ,y2 ,y3 )=y

u

Thus, we observe that the string kernel can be expressed by (1.10), where I (A) takes values of one and zero depending on whether condition A is satisfied. Example 22 (Tree Kernel) Suppose that we assign a label to each vertex of trees x, y. We wish to evaluate the similarity between x, y based on how many subtrees are shared. We denote by ct (x), ct (y) the numbers of occurrences of subtree t in x, y, respectively. Then, the kernel k(x, y) :=



ct (x)ct (y)

(1.11)

t

is positive definite. In fact, for x1 , . . . , xn ∈ E and an arbitrary z 1 , . . . , z n ∈ R, we have n n n z i z j k(xi , x j ) = { z i ct (xi )}2 ≥ 0 . t

i=1 j=1

i=1

Let Vx , Vy be the sets of vertices in trees x, y, respectively; we write I (u, t) = 1 and I (u, t) = 0 depending as a vertex or not. Since (1.11) can be  on whether t has u written as ct (x) = u∈Vx I (u, t), ct (y) = v∈Vy I (v, t), we have k(x, y) =

u∈Vx v∈Vy

t

I (u, t)I (v, t) =



c(u, v) ,

u∈Vx v∈Vy

 where c(u, v) = t I (u, t)I (v, t) is the number of common subtrees in x and y such that the vertices u ∈ Vx and v ∈ Vy are their roots. We assume that a label l(v) is assigned to each v ∈ V and determine whether they coincide. 1. For the descendants u 1 , . . . , u m and v1 , . . . , vn of u and v, if any of the following hold, then we define c(u, v) := 0: (a) l(u) = l(v), (b) m = n, (c) there exists i = 1, . . . , m such that l(u i ) = l(vi ), 2. otherwise, we define

1.6 Kernels for Strings, Trees, and Graphs

c(u, v) :=

19 m  {1 + c(u i , vi )}. i=1

For example, suppose that we assign one of the labels A, T, G, C to each vertex in Fig. 1.4. We may write this in a Python function as follows, where we assume that we assign no identical labels to the vertices at the same level of the tree. Note that the function calls itself (it is a recursive function). For example, the function requires the value C(4, 2) when it obtains C(1, 1). def C( i , j ) : S, T = s [ i ] , t [ j ] # Return zero when verteces i and j of the trees s and t do not coincides i f S[0] != T[0]: return 0 # Return zero when either verteces i or j of the trees s and t does not have a descendant i f S[1] is None: return 0 i f T[1] is None: return 0 i f len(S[1]) != len(T[1]) : return 0 U = [] for x in S[1]: U.append( s [x][0]) U1 = sorted(U) V = [] for y in T[1]: V.append( t [y][0]) V1 = sorted(V) m = len(U) # Return zero when the labels of the descendants do not coincide for h in range(m) : i f U1[h] != V1[h] : return 0 U2 = np. array (S[1]) [np. argsort (U) ] V2 = np. array (T[1]) [np. argsort (V) ] W= 1 for h in range(m) : W = W ∗ (1 + C(U2[h] , V2[h]) ) return W

def k(s , t ) : m, n = len( s ) , len( t ) kernel = 0 for i in range(m) : for j in range(n) : i f C( i , j ) > 0: kernel = kernel + C( i , j ) return kernel

s = [[] for _ in range(6)] s[0] = ["G", [1, 3]]; s[1] = ["T", [2]]; s[2] = ["C", None] s[3] = ["A", [4, 5]]; s[4] = ["C", None]; s[5] = ["T", None] t = [[] for _ in range(9)] t[0] = ["G", [1, 4]]; t[1] = ["A", [2, 3]]; t[2] = ["C", None] t[3] = ["T", None]; t[4] = ["T", [5, 6]]; t[5] = ["C", None]

20

1 Positive Definite Kernels

1 2 3

T C

G 4

1 A

5

6

A T

C

G

2

5

3

4

C

T

6

T 7

A

C C

8

9

T

Fig. 1.4 A tree kernel evaluates the similarity in terms of which the labels A, G, C, T are assigned to the vertices of the trees

t[6] = ["A", [7, 8]]; t[7] = ["C", None]; t[8] = ["T", None] for i in range(6): for j in range(9): if C(i, j) > 0: print(i, j, C(i, j))

0 0 2 3 1 1 3 6 1

k(s , t )

4

Thus, the sum 4 will be the kernel value. Let X and Y be discrete random variables that take values in E X and E Y , respectively, and let P(y|x) be the conditional probability of X = x ∈ E X given Y = y ∈ E Y . Suppose that we are given a positive definite kernel k X Y : E X Y × E X Y → R for E X Y := E X × E Y . We define the marginalized kernel by k(x, x  ) :=



k X Y ((x, y), (x  , y  ))P(y|x)P(y  |x  ) , x, x  ∈ E X

(1.12)

y∈E Y y  ∈E Y

for x, x  ∈ E X (Tsuda et al. [32]). We claim that the marginalized kernel is positive definite. In fact, k X Y being positive definite implies the existence of the feature map  : E X Y  (x, y) → (x, y) such that k X Y ((x, y), (x  , y  )) = ((x, y)), ((x  , y  )) .

1.6 Kernels for Strings, Trees, and Graphs

21

Thus, there exists another feature map E X  x → k(x, x  ) :=



y∈E Y

P(y|x)((x, y)) such that

P(y|x)P(y  |x  )((x, y)), ((x  , y  ))

y∈E Y y  ∈E Y

= 



P(y|x)((x, y)),



P(y  |x  )((x  , y  )) .

y  ∈E Y

y∈E Y

We may define (1.12) for the conditional density function f of Y given X as follows: k(x, x  ) :=



 y∈E Y

 y  ∈E Y

kY |X ((x, y), (x  , y  )) f (y|x) f (y  |x  )dydy 

for x, x  ∈ E X . Example 23 (Graph Kernel (Kashima et al. [19]) We construct a kernel that expresses the similarity between (directed) graphs G 1 , G 2 that may contain a loop according to the set of paths connecting two vertices. Let V, E be the sets of vertices and (directed) edges, respectively. We express each path of length m by a sequence consisting of vertices and edges: (v0 , e1 , . . . , em , vm ), v0 , v1 , . . . , vm ∈ V , and e1 , . . . , em ∈ E. We assume that a label is assigned to each of the vertices and edges of the two graphs, and we define the probability of the sequence π = (v0 , e1 , . . . , em , vm ) by the products of the associated conditional probabilities p(π) := p(v0 ) p(v1 |v0 ) · · · p(vm |vm−1 ). To this end, we consider a random walk in which we first choose v0 ∈ V with a probability of p(v0 ) = 1/|V | (|V |: the cardinality of V ) and repeatedly choose either to stop at that point with a probability of p or to move to a neighbor vertex via one of the connected directed edges with an equiprobability times 1 − p, where the stopping probability 0 < p < 1 should be determined in an a priori manner. For example, if the random walk arrives at a vertex v that connects to |V (v)| vertices, then the probability of moving to one of the neighboring vertices is (1 − p)/|V (v)|. For example, for 1 → 4 → 3 → 5 → 3 in Fig. 1.5, the labels are A, e, A, d, D, a, B, c, D. If p = 1/3, then the probability of the directed path can be obtained via the following code. def k(s , p) : return prob(s , p) / len(node) def prob(s , p) : i f len(node[ s [0]]) == 0: return 0 i f len( s ) == 1: return p m = len( s ) S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p) return S

22

1 Positive Definite Kernels

Fig. 1.5 Evaluating similarity via a graph kernel

A 1

D 3

b

c

e

C 2

b

a c

5 B

d 4 A

We demonstrate the execution of the code below: node = [ [ ] for _ in range(5) ] node[0] = [2 , 4]; node[1] = [4]; node[2] = [1 , 5] node[3] = [1 , 5]; node[4] = [3] k([0 , 3, 2, 4, 2] , 1 / 3)

0.0016460905349794243

Because five vertices exist, we multiply by 1/5, choose one of the next two transitions, and so on. 1 · 5



2 1 · 3 2



2 · ( · 1) · 3



2 1 · 3 2



2 1 22 · ( · 1) · = . 3 3 5 × 35

Because these probabilities are different in the directed graphs G 1 , G 2 , we denote them by p(π|G 1 ) and p(π|G). We express the label sequence of the path π (of length 2m + 1) by L(π) and define the graph kernel by k(G 1 , G 2 ) :=

π1

p(π1 |G 1 ) p(π2 |G 2 )I [L(π1 ) = L(π2 )].

π2

We find that this kernel is a marginalized kernel if k X Y ((G 1 , π1 ), (G 2 , π2 )) = I [L(π1 ) = L(π2 )].

Appendix Many books have proofs because Fubini’s theorem, Lebesgue’s dominant convergence theorem, and Levy’s convergence theorem are general theorems. We have abbreviated these statements and proofs. The proof of Proposition 5 was provided by Ito [15].

Appendix

23

Proof of Proposition 3 D is a diagonal matrix whose components are the eigenvalues λi ≥ 0 of the nonnegative definite matrix A, and U is an orthogonal matrix whose column vectors are orthogonal to each other. Then, we can write are unit eigenvectors n u i that λi u i u i . Similarly, if μi , vi , i = 1, . . . , n are the eigenvalues A = U DU  = i=1 n μi vi vi . At and eigenvectors of matrix B, respectively, then we can write B = i=1 this moment, we have (u i u i ) ◦ (v j v j )=(u i,k u i,l ·v j,k v j,l )k,l =(u i,k v j,k · u i,l v j,l )k,l = (u i ◦ v j )(u i ◦ v j ) . Note that this matrix is nonnegative definite. In fact, if we write u i ◦ v j = [y1 , . . . , yn ] ∈ Rn , then component (h, l) of (u i ◦ v j )(u i ◦ v j ) is yh yl , which means that  n

 = ( nh=1 z h yh )2 ≥ 0 for any z 1 , . . . , z n . Since matrices A and B are nonnegative definite, we have that λi , μ j ≥ 0 for each i, j = 1, · · · , n, which means that

nh=1

l=1 z h z l yh yl

A◦B =

n n

λi μ j (u i u i ) ◦ (v j u j ) =

i=1 j=1

n n

λi μ j (u i ◦ v j )(u i ◦ v j )

i=1 j=1



is nonnegative definite.

Proof of Proposition 5 We only show the case in which φ(0) = η(E) = 1 because the extension is straightforward. Suppose that (1.9) holds. Then, we have n n j=1 k=1

z j z k φ(x j − xk ) =

 n E j=1

z j ei x j t

n

z k e−i xk t dη(t) =

k=1

 E

|

n

z j ei x j t |2 dη(t) ≥ 0,

j=1

and (1.8) follows. Conversely, suppose that (1.8) holds. Since the matrix consisting of φ(xi − x j ) for the (i, j)th element is nonnegative definite and symmetric, we have that φ(x) = φ(−x), x ∈ R. If we substitute n = 2, x1 = u, and x2 = 0, then we obtain    1 φ(u) z1 ≥0 [z 1 , z 2 ] z2 φ(u) 1 and φ(u)2 ≤ 1 because the determinant is nonnegative. Since φ is bounded and 2 continuous, it is uniformly continuous. On the other hand, e−t /n e−i xt is uniformly continuous as well. In the following, we show that 1 f n (x) := 2π





−∞

φ(t)e−t

2

/n −i x  t

e

dt

24

1 Positive Definite Kernels

is a probability density function, and the characteristic function φn approaches φ as n → ∞. If we verify the claim, by Levy’s convergence theorem [15], φ is the characteristic function. We show the d = 1 case first.  ∞ 1 2 f n (x) = φ(t)e−t /n e−i xt dt. 2π −∞ For a > 0, we have  a  a ∞  ∞ 2 sin at 1 1 2 2 dt , f n (x)d x = φ(t)e−t /n e−i xt dtd x = φ(t)e−t /n 2π 2π t −a −a −∞ −∞ 

b

where we use Fubini’s theorem for the last equality. Then, for b > 0, from sin(at) 0  ∞ 1 − cos t bt da = 1−cos ≥ 0, dt = π, and φ(0) = 1, as b → ∞, we have t t2 −∞     ∞ 2 1 b a 2 sin at 1 b 1 dadt { f n (x)d x}da = φ(t)e−t /n b 0 −a b 0 2π −∞ t  ∞  ∞ 2 2 1 2(1 − cos tb) 2(1 − cos u) u 1 = φ(t)e−t /n φ( )e−(u/b) /n du → 1 , dt= 2 2π −∞ 2π b t b u2 −∞

where we use the dominant convergence theorem for the last equality. In general, for a g : R → R that is monotonically increasing and bounded from above, we have lim

y→∞





Thus, we have −∞

1 y



y

g(x)d x = lim g(x) . x→∞

0

f n (x)d x = 1.

Finally, we show that φn → φ (n → ∞): 

= = =

1 2π





e φ(t)e−t /n e−ita dt a→∞ −a −∞  ∞ 1 2 sin a(t − z) 2 dt φ(t)e−t /n lim a→∞ 2π −∞ t −z   ∞ 1 b 2 sin a(t − z) 1 2 dt lim da φ(t)e−t /n b→∞ b 0 2π −∞ t −z  ∞ 1 2(1 − cos b(t − z)) 2 lim φ(t)e−t /n dt b→∞ 2π −∞ b(t − z)2  ∞ 1 2(1 − cos s) s 2 2 φ(z + )e−(z+s/b) /n ds = φ(z)e−z /n → φ(z). lim 2 b→∞ 2π −∞ b s

φn (z) := lim =

a

i za

2

Appendix

25

For a general d ≥ 1, if we use t22 = t12 + . . . + td2 , 

a1

−a1

 ···

ad

−ad

and

e−i(x1 t1 +···xd td ) d x1 · · · d xd = 

bi 0

2 sin a1 x1 2 sin ad xd ··· , t1 td

2 sin ai xi 2(1 − cos ti bi ) dai = , ti ti2 bi

(i = 1, . . . , d), then the same claim can be obtained.



Exercises 1∼15 1. Show that the following three conditions are equivalent for a symmetric matrix A ∈ Rn×n . (a) There exists a square matrix B such that A = B  B. (b) x  Ax ≥ 0 for an arbitrary x ∈ Rn . (c) All the eigenvalues of A are nonnegative. In addition, using Python, generate a square matrix B ∈ Rn×n with real elements by generating random numbers to obtain A = B  B. Then, randomly generate five more x ∈ Rn (n = 5) to examine whether x  Ax is nonnegative for each value. 2. Consider the Epanechnikov kernel defined by k : E × E → R  k(x, y) = D 3 D(t) =

|x − y| λ



(1 − t 2 ), |t| ≤ 1 4 0, Otherwise

for λ > 0. Suppose that we write a kernel for λ > 0 and (x, y) ∈ E × E in Python as shown below: def k(x, y, lam) : return D(np.abs((x − y) / lam) ) .

Specify the function D using Python. Moreover, define the function f that makes a prediction at z ∈ E based on the Nadaraya-Watson estimator by utilizing the function k such that z, λ are the inputs of f and k, respectively, and (x1 , y1 ), . . . , (x N , y N ) are global. Then, execute the following to examine whether the functions D, f work properly.

26

1 Positive Definite Kernels

n = 250 x = 2 ∗ np.random.normal(size = n) y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4 plt.figure(num=1, figsize=(15, 8),dpi=80) plt.xlim(−3, 3); plt.ylim(−2, 3) plt.xticks(fontsize = 14); plt.yticks(fontsize = 14) plt.scatter(x, y, facecolors=’none’, edgecolors = "k", marker = "o") xx = np.arange(−3, 3, 0.1) yy = [[] for _ in range(3)] lam = [0.05, 0.35, 0.50] color = ["g", "b", "r"] for i in range(3): for zz in xx: yy[i].append(f(zz, lam[i])) plt.plot(xx, yy[i], c = color[i], label = lam[i]) plt.legend(loc = "upper left", frameon = True, prop={’size’:14}) plt.title("Nadaraya−Watson Estimator", fontsize = 20)

3.

4.

5.

6.

Replace the Epanechnikov kernel with the Gaussian kernel, the exponential type, and the polynomial kernel and execute them. Show that the determinant of A ∈ R3×3 coincides with the product of the three eigenvalues. In addition, show that if the determinant is negative, at least one of the eigenvalues is negative. Show that the Hadamard product of nonnegative definite matrices of the same size is nonnegative definite. Show also that the kernel obtained by multiplying positive definite kernels is positive definite. Show that a square matrix whose elements consist of the same nonnegative value is nonnegative definite. Show further that a kernel that outputs a nonnegative constant is positive definite. Find the feature map 3,2 (x1 , x2 ) of the polynomial kernel k3,2 (x, y) = (x  y + 1)3 for x, y ∈ R2 to derive k3,2 (x, y) = 3,2 (x1 , x2 ) 3,2 (x1 , x2 ) .

7. Use Proposition 4 to show that the Gaussian and polynomial kernels and exponential types are positive definite. Show also that the kernel obtained by normalizing a positive definite kernel is positive definite. What kernel do we obtain when we normalize the exponential type and the Gaussian kernel? 8. The following procedure chooses the optimal parameter σ 2 of the Gaussian kernel via 10-fold CV when applying the Nadaraya-Watson estimator to the samples. Change the 10-fold CV procedure to the N -fold (leave-one-out) CV process to find the optimal σ 2 , and draw the curve by executing the procedure below: def K(x, y, sigma2): return np.exp(−np.linalg.norm(x − y)∗∗2/2/sigma2) n = 100 x = 2 ∗ np.random.normal(size = n) y = np.sin(2 ∗ np.pi ∗ x) + np.random.normal(size = n) / 4

Exercises 1∼15

27

m = int(n / 10) sigma2_seq = np.arange(0.001, 0.01, 0.001) SS_min = np.inf for sigma2 in sigma2_seq: SS = 0 for k in range(10): test = range(k∗m,(k+1)∗m) train = [x for x in range(n) if x not in test] for j in test: u, v = 0, 0 for i in train: kk = K(x[i], x[j], sigma2) u = u + kk ∗ y[i] v = v + kk if not(v==0): z=u/v SS = SS + (y[j] − z)∗∗2 if SS < SS_min: SS_min = SS sigma2_best = sigma2 print("Best sigma2 = ", sigma2_best)

9. For a probability space (E, F, μ) with E = {1, 2, 3, 4, 5, 6} and a map X : E → R, show that if  1, e = 1, 3, 5 X (e) = 0, e = 2, 4, 6 and F = {{1, 2, 3}, {4, 5, 6}, {}, E}, then X is not a random variable (not measurable). 10. Derive the characteristic function of the Gaussian distribution f (x) = (x − μ)2 1 } with a mean of μ and a variance of σ 2 and find the √ exp{− 2σ 2 2π condition for the characteristic function to be a real function. Do the same for α the Laplace distribution f (x) = exp{−α|x|} with a parameter α > 0. 2 11. Obtain the kernel value between the left tree and itself in Fig. 1.4. Construct and execute a program to find this value. 12. Randomly generate binary sequences x, y of length 10 to obtain the string kernel value k(x, y). def string_kernel (x, y) : m, n = len(x) , len(y) S= 0 for i in range(m) : for j in range( i , m) : for k in range(n) : i f x[( i −1): j ] == y[(k−1): (k+j−i ) ] : S=S+ 1 return S

13. Show that the string, tree, and marginalized kernels are positive definite. Show also that the string and graph kernels are convolutional and marginalized kernels, respectively. 14. How can we compute the path probabilities below when we consider a random walk in the directed graph of Fig. 1.5 if the stopping probability is p = 1/3?

28

1 Positive Definite Kernels

(a) 3 → 1 → 4 → 3 → 5, (b) 1 → 2 → 4 → 1 → 2, (c) 3 → 5 → 3 → 5. 15. What inconvenience occurs when we execute the procedure below to compute a graph kernel? Illustrate this inconvenience with an example. def k(s , p) : return prob(s , p) / len(node) def prob(s , p) : i f len(node[ s [0]]) == 0: return 0 i f len( s ) == 1: return p m = len( s ) S = (1 − p) / len(node[ s [0]]) ∗ prob( s [1:m] , p) return S

Chapter 2

Hilbert Spaces

When considering machine learning and data science issues, in many cases, the calculus and linear algebra courses taken during the first year of university provide sufficient background information. However, we require knowledge of metric spaces and their completeness, as well as linear algebras with nonfinite dimensions, for kernels. If your major is not mathematics, we might have few opportunities to study these topics, and it may be challenging to learn them in a short period. This chapter aims to learn Hilbert spaces, the projection theorem, linear operators, and (some of) the compact operators necessary for understanding kernels. Unlike finite-dimensional linear spaces, ordinary Hilbert spaces require scrutiny of their completeness.

2.1 Metric Spaces and Their Completeness Let M be a set. We say that a bivariate function d : M × M → R is a distance if 1. 2. 3. 4.

d(x, y) ≥ 0; d(x, y) = 0 ⇐⇒ x = y; d(x, y) = d(y, x); and d(x, z) ≤ d(x, y) + d(y, z)

for x, y, z ∈ M, and the pair (M, d) is a metric space1 . Let E be a subset of the metric space M. We say that E is an open set if a positive constant  exists such that U (x, ) := {y ∈ M|d(x, y) < } ⊆ E for each x ∈ E. Moreover, we say that y ∈ M is a convergence point of E if U (y, ) ∩ E = {} for an arbitrary  > 0, and E is a closed set if E contains all the convergence points of E.

1

We call M a metric space rather than (M, d) when we do not stress d or when d is apparent.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_2

29

30

2 Hilbert Spaces

Example 24 The set M = [0, 1] is a closed set because the neighborhood U (y, ) of y∈ / M has no intersection with M if we make the radius  > 0 smaller, which means that M contains all the convergence points of M. On the other hand, M = (0, 1) is an open set because M contains the neighborhood U (y, ) of y ∈ M if we make the radius  > 0 smaller. If we add {0}, {1} to the interval (0, 1), (0, 1], [0, 1), we obtain the closed set [0, 1]. We say that the minimum closed set in M that contains E is the closure of E, and we write this as E. If E is not a closed set, then E does not contain all the convergence points. Thus, the closure is the set of convergence points of E. Moreover, we say that E is dense in M if E = M, which is equivalent to the following conditions. “y ∈ E exists such that d(x, y) <  for an arbitrary  > 0 and x ∈ M”, and “each point in M is a convergence point of E”. Furthermore, we say that M is separable if it contains a dense subset that consists of countable points. Example 25 For the distance d(x, y) := |x − y| with x, y ∈ R and the metric space (R, d), each irrational number a ∈ R\Q is a convergence point of Q. In fact, for an arbitrary  > 0, the interval (a − , a + ) contains a rational number b ∈ Q. Thus, Q does not contain the convergence point a ∈ / Q and is not a closed set in R. Moreover, the closure of Q is R (Q is dense in R). Furthermore, since Q is a countable set, we find that R is separable. Let (M, d) be a metric space. We say that a sequence {xn } in2 M converges to x ∈ M if d(xn , x) → 0 as n → ∞ for x ∈ M, and we write this as xn → x. On the other hand, we say that a sequence {xn } in M is Cauchy if d(xm , xn ) → 0 as m, n → ∞, i.e., if supm,n≥N d(xm , xn ) → 0 as N → ∞. If {xn } converges to some x ∈ M, then it is a Cauchy sequence. However, the converse does not hold. We say that a metric space (M, d) is complete if each Cauchy sequence {xn } in M converges to an element in M. We say that (M, d) is bounded if there exists a C > 0 such that d(x, y) < C for an arbitrary x, y ∈ M, and the minimum and maximum values are the upper and lower limits if M is bounded from above and below, respectively. Example 26 An arbitrary Cauchy sequence is bounded. In fact, for any  > 0, we can choose N := N () such that m, n ≥ N = ⇒d(xm , xn ) < , and we have min{x1 , . . . , x N −1 , x N − } ≤ xn ≤ max{x1 , . . . , x N −1 , x N + } . Example 27 (Q is Not Complete) The sequence {an } defined by a1 =1, an+1 = 1 1 an + (n ≥ 1) is in Q. However, we can prove that {an } is a Cauchy sequence in 2 an √ / Q (Exercise 17). Q but an → 2 ∈ Proposition 6 R is complete. 2

{xn } with xn ∈ M for each n.

2.1 Metric Spaces and Their Completeness

31

Proof: If {xn } is a Cauchy sequence in R, then {xn } is bounded (Example 26). If we write the upper and lower limits of {xn }∞ n=s as ls , m s , respectively, the monotone sequences {m s }, {ls } in R share the same limit. In fact, from the above assumption, we can make ls − m s = sup{|x p − xq | : p, q ≥ s} as small as possible. Thus, R is complete.  If the number of dimensions is finite, we may check completeness for each dimension, and we see that R p is complete for any p ≥ 1. Suppose that we arbitrarily set a neighborhood U (P) for each P ∈ M beforehand. We say that a set M is compact if there exist finite m and P1 , . . . , Pm ∈ M such that m U (Pi ). M ⊆ ∪i=1 Example 28 Let M = (0, 1), and suppose that we define the neighborhood U (x) := ( 21 x, 23 x) for each x ∈ M beforehand. Then, for any n and x1 , . . . , xn ∈ M, we have 1 3 n ( xi , xi ) , (0, 1)  ∪i=1 2 2 which means that M is not compact. Proposition 7 (Heine-Borel) For R p , any bounded closed set M is compact. Proof: Suppose that we have set a neighborhood U (P) for each P ∈ M and that m U (Pi ) cannot be realized by any m and P1 , . . . , Pm . If we divide the M ⊆ ∪i=1 closed set (rectangular) that contains M ⊆ R p into two components for each dimension, then at least one of the 2 p rectangles cannot be covered by a finite number of neighborhoods. If we repeat this procedure, then the volume of the rectangle that a finite number of neighborhoods cannot cover becomes sufficiently small for the center to converge to a P ∗ ∈ M; furthermore, we can cover the rectangle with U (P ∗ ), which is a contradiction.  Let (M1 , d1 ), (M2 , d2 ) be metric spaces. We say that the map f : M1 → M2 is continuous at x ∈ M1 if for any  > 0, there exists δ(x, ) such that for y ∈ M1 , ⇒d2 ( f (x), f (y)) <  . d1 (x, y) < δ(x, )=

(2.1)

In particular, if there exists δ(x, ) that does not depend on x ∈ M1 in (2.1), we say that f is uniformly continuous. Example 29 The function f (x) = 1/x defined on the interval (0, 1] is continuous but is not uniformly continuous. In fact, if we make x approach y after 1 1 fixing y, we can make d2 ( f (x), f (y)) = | − | as small as possible, which x y means that f is continuous in (0, 1]. However, when we make x approach y to make d2 ( f (x), f (y)) smaller than a constant, we observe that for each  > 0, the smaller y is, the smaller d1 (x, y) = |x − y| should be. Thus, no such δ() for which ⇒d2 ( f (x), f (y)) <  exists if δ() does not depend on x, y ∈ M d1 (x, y) < δ()= (Fig. 2.1).

32

2 Hilbert Spaces

2

4

6

8

10

f (x) = 1/x

f (x) = 1/x

Fig. 2.1 The function f (x) = 1/x is not uniformly continuous over (0, 1]. To make the | f (x) − f (y)| value smaller than a constant, we need to make the |x − y| value smaller when x, y are close to 0 (red lines) than that when x, y are far away from 0 (blue lines). Thus, δ > 0 depends on the locations of x, y

0.1

0.2

0.3

0.4

0.5

0.6

x

Proposition 8 A continuous function over a bounded closed set is uniformly continuous. Proof: Let f : E → R be a continuous function defined over a bounded closed set M. Because the function f is continuous, for an arbitrary  > 0, there exists a (z) for each z ∈ M such that ⇒d2 ( f (x), f (z)) <  . d1 (x, z) < (z)=

(2.2)

From Proposition 7, we can prepare a finite number of neighborhoods to cover M. Let U1 , . . . , Um be neighborhoods with centers z 1 , . . . , z m and radii (z 1 )/2, . . . , 1 min (z i ). (z m )/2. Suppose that we choose x, y ∈ M such that d1 (x, y) < δ := 2 1≤i≤m Because x belongs to one of the U1 , . . . , Um , without loss of generality, we assume that x ∈ Ui . From the distance property, we have d1 (x, z i )
0 is arbitrary and δ does not depend on x, y, f is uniformly continuous.  Example 30 We can prove that “a definite integral exists for a continuous function defined over a closed interval [a, b]” by virtue of Proposition 8. If we divide a < b

2.1 Metric Spaces and Their Completeness

33

into n equal-length segments as a = x0 < . . . < xn = b, then for an arbitrary  > 0, we require  n  b−a n

i=1

 sup

xi−1 0 such that x1 ≤ δx  ≤ δ, we have δ= ⇒ T x2 ≤ 1. Since  x T x2 = T (

x1 x1 δx ≤ )2 x1 δ δ

for any x = 0. On the other hand, if T is bounded, there exists a constant C that does not depend on x ∈ X 1 such that T (xn − x)2 ≤ Cxn − x1 for any {xn } and an x ∈ X 1 such that xn → x as n → ∞. Hereafter, we define the operator norm of T ∈ B(X 1 , X 2 ) by T  :=

sup

x∈X 1 ,x1 =1

T x2 .



(2.12)

44

2 Hilbert Spaces

Thus, for an arbitrary x ∈ X 1 , we have T x2 ≤ T  x1 . Example 41 Let X 1 := R p and X 2 := Rq . If the norm is the Euclidean norm, then we can write the linear operator T : R p → Rq as T : x → Bx by using some B ∈ Rq× p . If the matrix B is square, the norm T  is the square root of the maximum eigenvalue of the nonnegative definite matrix A := B  B. T 2 = max x  B  Bx = max Bx2 . x=1

x=1

Example 42 For K : [0, 1]2 → R, let  0

1



1

K 2 (x, y)d xd y

0

be finite. We define the integral operator by the linear operator T in L 2 [0, 1] such that  1 K (·, x) f (x)d x (2.13) (T f )(·) = 0

for f ∈ L 2 [0, 1]. Note that (2.13) belongs to L 2 [0, 1] and that T is bounded: From 

1

|(T f )(x)|2 ≤ 0



1

K 2 (x, y)dy 0

 f 2 (y)dy =  f 22

1

K 2 (x, y)dy ,

0

we have  T f 22 =

1 0

 |(T f )(x)|2 d x ≤  f 22

1



0

1

K 2 (x, y)d xd y .

0

We call such a K an integral operator kernel and distinguish between the positive definite kernels we deal with in this book. In particular, we call any linear operator with X 2 = R a linear functional. Proposition 22 (Riesz’s Representation Theorem) Let H be a Hilbert space with an inner product ·, · and a norm of  · , and let T ∈ B(H, R). Then, there exists a unique eT ∈ H such that T f =  f, eT  , f ∈ H and T  = eT .

(2.14)

2.5 Linear Operators

45

Proof: See the appendix at the end of this chapter. Example 43 (RKHS) Let x ∈ E, and let Tx : H → R be the map from f ∈ H to f (x). Then, Tx is linear because Tx (a f + bg) = (a f + bg)(x) = a f (x) + bg(x) = aTx ( f ) + bTx (g) . We assume that Tx is bounded for each x ∈ E. Then, from Proposition 22, there exists a k x ∈ H such that f (x) = Tx ( f ) =  f, k x  x ∈ E, and Tx  = k x . Proposition 23 (Adjoint Operator) Let Hi be a Hilbert space with an inner product ·, ·i for i = 1, 2 and T ∈ B(H1 , H2 ). Then, there exists a T ∗ ∈ B(H2 , H1 ) such that T x1 , x2 2 = x1 , T ∗ x2 1 , x1 ∈ H1 , x2 ∈ H2 . Proof: If we fix x2 ∈ H2 and regard T x1 , x2 2 as a function of x1 ∈ H1 , then from x1 → T x1 , x2 2 ≤ x1 2 x2 2 , T is a bounded operator w.r.t. x1 ∈ H1 . From Proposition 22, for each x2 ∈ H2 , there exists y2 (x2 ) ∈ H1 such that T x1 , x2 2 = x1 , y(x2 )1 . If we define T ∗ x2 = y2 (x2 ), then T ∗ is a bounded linear map. The boundness property is due to T ∗ x2 21 = |x2 , T T ∗ x2 |2 ≤ T  T ∗ x2 1 x2 2 .  We call the T ∗ in Proposition 23 the adjoint operator of T . In particular, if T ∗ = T , we call such an operator T self-adjoint. Example 44 Let H = R p . We can express any T ∈ B(H ) by a square matrix T ∈ R p× p . From T x, y = x  T  y = x, T  y , we see that the adjoint T ∗ is the transpose matrix of T  and that T can be written as a symmetric matrix if and only if T is self-adjoint. Example 45 For the integral operator of L 2 [0, 1] in Example 42, from Fubini’s theorem, we have that  T f, g = 0

1



1 0



1

K (x, y) f (x)g(y)d xd y =  f,

K (y, ·)g(y)dy ,

0

1 and y → (T ∗ g)(y) = 0 K (x, y)g(x)d x is an adjoint operator. If the integral operator kernel K is symmetric, the operator T is self-adjoint.

46

2 Hilbert Spaces

2.6 Compact Operators Let (M, d) and E be a metric space and a subset of M, respectively. If any infinite sequence in E contains a subsequence that converges to an element in E, then we say that E is sequentially compact. If {xn } has a subsequence that converges to x, then x is a convergence point of {xn }. Example 46 Let E := R and d(x, y) := |x − y| for x, y ∈ R. Then, E is not sequentially compact. In fact, the sequence xn = n has no convergence points. For / (0, 1] as n → ∞, and the conE = (0, 1], the sequence xn = 1/n converges to 0 ∈ vergence point of any subsequence is only 0. Therefore, E = (0, 1] is not sequentially compact. Proposition 24 Let (M, d) and E be a metric space and a subset of M, respectively. Then, E is sequentially compact if and only if E is compact. Proof: Many books on geometry deal with the proof of equivalence. See such books for the details of this proof. In this section, we explain compactness by using the terminology of sequential compactness. Let X 1 , X 2 be linear spaces equipped with norms, and let T ∈ B(X 1 , X 2 ). We say that T is compact if {T xn } contains a convergence subsequence for any bounded sequence {xn } in X 1 . Example 47 The orthonormal basis {e j } in a Hilbert space H is √ bounded because e j  = 1. However, for an identity map, we have that ei − e j  = 2 for any i = j. Thus, the sequence e1 , e2 , . . . does not have any convergence points in H . Hence, the identity operator for any infinite-dimensional Hilbert space is not compact. Proposition 25 For any bounded linear operator T , the following hold. 1. If the rank is finite, then the operator T is compact. 2. If a sequence of finite-rank operators {Tn } exists such that Tn − T  → 0 as n → ∞, then T is compact8 . Proof: See the appendix at the end of this chapter. Let H and T ∈ B(H ) be a Hilbert space and its bounded linear operator, respectively. If λ ∈ R and 0 = e ∈ H exist such that T e = λe ,

(2.15)

then we say that λ and e are an eigenvalue and an eigenvector of T , respectively. Proposition 26 Let T ∈ B(H ) and e j ∈ Ker(T − λ j I ) for j = 1, 2, . . .. If the eigenvalues λ j = 0 have different values, then 8

It is known that the converse is true.

2.6 Compact Operators

47

1. e j is linearly independent. 2. If T is self-adjoint, then {e j } are orthogonal. Proof: See the appendix at the end of this chapter. Example 48 Let T ∈ B(H ) be a compact operator. For each eigenvalue λ = 0, the eigenspace Ker(T − λI ) has a finite dimensionality. In fact, if Ker(T − λI ) is of infinite dimensionality for an eigenvalue λ = 0, then λ contains infinitely many eigenvectors e j , and if we apply the operator T to them, then as in Example 47, {λe j } does not have any convergence subsequence. Thus, T is not compact, which is a contradiction. Example 49 For any C > 0, the absolute values of a finite number of eigenvalues λi for a compact operator T exceed C. Suppose that the absolute values of an infinite number of eigenvalues λ1 , λ2 , . . . exceed C. Let M0 := {0}, Mi := span{e1 , . . . , ei }, e j ∈ Ker(T − λ j I ), j = 1, 2, . . ., i = 1, 2, . . .. Since the {e1 , . . . , ei } are linearly ⊥ is one dimensional for i = 1, 2, . . .. Thus, if we independent, each Mi ∩ Mi−1 ⊥ , i = 1, 2, . . . via Gramdefine the orthonormal sequence xi ∈ Ker(T − λi I ) ∩ Mi−1 Schmidt, then we have T xi − T xk 2 = T xi 2 + T xk 2 ≥ 2C 2 for i > k. Thus, {T xi } has no convergence subsequence. Example 49 implies that the set of nonzero eigenvalues of T is countable. We summarize the above discussion and its implications below. Proposition 27 Let T be a self-adjoint compact operator of a Hilbert space H . Then, the set of nonzero eigenvalues of T is finite, or the sequence of eigenvalues converges to zero. Each eigenvalue has a finite multitude, and any pair of eigenvectors corresponding to different eigenvalues is an orthogonal pair. Let λ1 , λ2 , . . . be a sequence of eigenvalues such that |λ1 | ≥ |λ2 | ≥ · · · , and let e1 , e2 , . . . be any corresponding eigenvectors that are orthogonal (orthogonalized eigenvectors via Gram-Schmidt) if they possess the same eigenvalue. Then, {e j } is the orthonormal basis of Im(T ), and we can express T by Tx =

∞ 

λ j x, e j e j

(2.16)

j=1

for each x ∈ H . Proof: We utilize the following steps, where the second item is equivalent to (Ker(T ))⊥ = Im(T ) because T = T ∗ . 1. Show that H = Ker(T ) ⊕ (Ker(T ))⊥ . 2. Show that (Ker(T ))⊥ = Im(T ∗ ).

48

2 Hilbert Spaces

3. Show that span{e j | j ≥ 1} ⊆ Im(T ). 4. Show that span{e j | j ≥ 1} ⊇ Im(T ). See the appendix at the end of this chapter. We say that an operator T is nonnegative definite if ∞ ∞ ∞    T x, x =  λi x, ei ei , x, e j e j  = λi x, ei 2 ≥ 0 i=1

for arbitrary H  x = λ2 ≥ 0, . . ..

j=1



i=1 x, ei ei ;



i=1

this condition is equivalent to λ1 ≥ 0,

Proposition 28 If T is nonnegative definite, we have λk =

max

e∈span{e1 ,...,ek−1

}⊥

T e, e e2

(2.17)

which expresses the maximum value over the Hilbert space H when k = 1. Proof: The claim follows from (2.16) and λ j ≥ 0: max

e∈{e1 ,...,ek−1 }⊥ e=1

T e, e = max

e=1

∞ 

λ j e, e j 2 = λk .

j=k

 Let H1 , H2 be Hilbert spaces, {ei } an orthonormal basis of H1 , and T ∈ B(H1 , H2 ). If ∞ 

T ei 2

i=1

takes a finite value, we say that T is a Hilbert-Schmidt (HS) operator, and we write the set of HS operators in B(H1 , H2 ) as B H S (H1 , H2 ). We define the inner product of T1 , T2 ∈ B H S (H1 , H2 ) and the HS norm of T ∈ B H S (H1 , H2 ) by T1 , T2  H S := ∞ j=1 T1 e j , T2 e j 2 and  T  H S :=

1/2 T, T  H S

=

∞ 

1/2 T ei 22

,

i=1

respectively. Proposition 29 The HS norm value of T ∈ B(H1 , H2 ) does not depend on the choice of orthonormal basis {ei }. bases of Hilbert spaces H1 , H2 , Proof: Let {e1,i }, {e2, j } be arbitrary orthonormal ∗ and let T1 , T2 ∈ B(H1 , H2 ). Then, for Tk e1,i = ∞ j=1 Tk e1,i , e2, j 2 e2, j , Tk e2, j = ∞ ∗ i=1 Tk e2, j , e1,i 1 e1,i , and k = 1, 2, we have

2.6 Compact Operators

49

∞ ∞ ∞    T1 e1,i , T2 e1,i 2 = T1 e1,i , e2, j 2 T2 e1,i , e2, j 2 i=1

=

∞  ∞ 

i=1 j=1

e1,i , T1∗ e2, j 1 e1,i , T2∗ e2, j 1 =

i=1 j=1

∞  T1∗ e2, j , T2∗ e2, j 1 , i=1

which means that both sides do not depend on the choices of {e1,i }, {e2, j }. In particular, if T1 = T2 = T , we see that T 2H S does not depend on the choices of {e1,i }, {e2, j }.  Proposition 30 An HS operator is compact. Proof: Let T ∈ B(H1 , H2 ) be an HS operator, x ∈ H1 , and Tn x :=

n  T x, e2i 2 e2i , i=1

where {e2i } is an orthonormal basis of H2 . Since the image of Tn is of finite dimensionality, Tn is compact. Thus, from the second item of Proposition 25, it is sufficient to show that T − Tn  → 0 as n → ∞. However, since (T − Tn )x = ∞ T x, e2,i 2 e2,i , we have that when x1 ≤ 1, i=n+1 ∞ 

(T − Tn )x22 =

T x, e2,i 22 =

i=n+1

∞ 

x, T ∗ e2,i 21 ≤

i=n+1

∞ 

T ∗ e2,i 2 .

i=n+1

Because T ∗ is an HS operator, the right-hand side converges to zero, where T is an HS operator if and only if T ∗ is an HS operator due to the derivation in Proposition 29.  Example 50 When an operator is expressed by a matrix T = (Ti, j ) such that T ∈ B(Rm , Rn ), m, n ≥ 1, the HS norm becomes the squared sum of the mn elements of this matrix. In fact, if T is expressed by a matrix Rn×m , then the HS norm is the Frobenius norm: T 2H S =

n 

T e X,i 2 =

n 

i=1

T ∗ eY, j 2 =

j=1

m  n 

Ti,2j ,

i=1 j=1

where e X,i ∈ Rm is a column vector such that the ith element is one and the other elements are zeros, and eY, j ∈ Rn is a column vector such that the jth element is one and the other elements are zeros. Let T ∈ B(H ) be nonnegative definite and {ei } be an orthonormal basis of H . If T T R :=

∞  T e j , e j  j=1

50

2 Hilbert Spaces

is finite, we say that T T R is the trace norm of T and that T is a trace class. Similar to an HS norm value, a trace norm value does not depend on the choice of orthonormal basis {e j }. If we substitute x = e j into (2.16) in Proposition 27, then we have T x = λe j and obtain that ∞ ∞   T T R := T e j , e j  = λj . j=1

j=1

On the other hand, from T 2H S =

∞ ∞  

T ei,1 , e j,2 2 =

i=1 j=1

we have

T  H S ≤ λ1

∞ 

∞ 

λ2j ,

j=1

1/2 λi

=



λ1 T T R .

i=1

Thus, we have established the following proposition. Proposition 31 If T ∈ B(H ) is a trace class, it is a compact HS class.

Appendix: Proofs of Propositions Proof of Proposition 13 We show that a simple function approximates an arbitrary f ∈ L 22 and that a continuous function approximates an arbitrary simple function. Hereafter, we denote the L 2 norm by  · . Since f ∈ L 2 is measurable, if f is nonnegative, the sequence { f n } of simple functions defined by f n (ω) =

(k − 1)2−n , (k − 1)2−n ≤ f (ω) < k2−n , 1 ≤ k ≤ n2n n, n ≤ f (ω) ≤ ∞

satisfies 0 ≤ f 1 (ω) ≤ f 2 (ω) ≤ · · · ≤ f (ω) and | f n (ω) − f (ω)|2 → 0 almost surely. Since the right-hand side of | f n (ω) − f (ω)|2 ≤ 4{ f (ω)}2 is finite when integrated, from the dominant convergence theorem, we have  f n − f 2 → 0 . We can show a similar derivation for a general f that is not necessarily nonnegative, as derived in Chap. 1.

Appendix: Proofs of Propositions

51

On the other hand, let A be a closed subset of [a, b], and let K A be the indicator function (K A (e) = 1 if e ∈ A; K A (e) = 0 otherwise). If we define h(x) := 1 , then gnA is continuous, gnA (x) ≤ 1 for inf y∈A {|x − y|} and gnA (x) := 1 + nh(x) x ∈ [a, b], gnA (x) = 1 for x ∈ A, and lim gnA (x) = 0 for x ∈ B := [a, b]\A. Thus, n→∞ we have  lim gnA n→∞

1/2

− K A  = lim

n→∞

B

gnA (x)2 d x



1/2

= B

lim g A (x)2 d x n→∞ n

=0,

where the second equality follows from the dominant convergence theorem. More over, if A, A are disjoint, then αgnA + α gnA with α, α > 0 approximates αK A +  α K A . In fact, we have 



αgnA + α gnA − (αK A + α K A ) ≤ αgnA − K A  + α gnA − K A  . Hence, a sequence of continuous functions can approximate an arbitrary simple function. 

Proof of Proposition 14 Suppose that { f n } is a Cauchy sequence in L 2 , which means that lim sup  f m − f n 2 = 0 .

N →∞ m,n≥N

(2.18)

Then, there exists a sequence {n k } such that ∞  ∞      | f n k+1 − f n k | ≤  f n k+1 − f n k 2 < ∞ .    k=1

2

k=1

Thus, almost surely, we have ∞ 

| f n k+1 (x) − f n k (x)| < ∞ .

(2.19)

k=1

For arbitrary r < t and x ∈ E, from the triangle inequality, we have | f nr (x) − f n t (x)| ≤

t−1 

| f n k+1 (x) − f n k (x)| .

k=r

Combined with (2.19), the real sequence { f n k (x)}∞ k=1 is almost surely Cauchy. Since the entire real system is complete (Proposition 6), we define f (x) := limk→∞ f n k (x)

52

2 Hilbert Spaces

for x ∈ E such that { f n k (x)}∞ k=1 is Cauchy, and we define f (x) := 0 for the other x ∈ E. From (2.18), for an arbitrary  > 0, we have   f − f n 2 =

 E

| f n − f |2 dμ =

 lim inf | f n − f n k |2 dμ ≤ lim inf

E k→∞

k→∞

E

| f n − f n k |2 dμ < 

as n → ∞, where the first inequality is due to Fatou’s lemma. Furthermore, since  f n , f − f n ∈ L 2 and L 2 is a linear space, we have f ∈ L 2 .

Proof of Proposition 15 The first item holds because 0 ≤ x −

n n   x, ei ei 2 = x2 − x, ei 2 i=1

i=1

for all n. For the second item, letting n > m, sn := sn − sm  =  2

n 

x, ek ek ,

k=m+1

n 

n

k=1 x, ek ek , n 

x, ek ek  =

k=m+1

we have

|x, ek |2 ,

k=m+1

which diminishes as n, m → ∞ according to the first item. For the third item, we have n n n    sn − sm 2 =  αk ek , αk ek  = αk2 = Sn − Sm k=m+1

k=m+1

k=m+1

n n for sn := i=1 αi ei , Sn := i=1 αi2 , and n > m. Thus, the third item follows from the equivalence: {sn } is Cauchy ⇐⇒ {Sn } is Cauchy. n  The last item holds because y, ei = lim  α j e j , ei  = αi for y = ∞ j=1 α j e j , n→∞

j=1

which follows from the continuity of inner products (Proposition 9).



Proof of Proposition 16 For 1.= ⇒6., since ∞{ei } is an orthonormal basis of H , we may write an arbitrary αi ei , αi ∈ R. From the fourth item of Proposition we have x ∈ H as x = i=1 15, ∞ x, ei  and obtain 6. 6.= ⇒5. is obtained by substituting x = i=1 x, ei ei , αi = ∞ y, ei ei into x, y. 5.= ⇒4. is obtained by substituting x = y in 5. 4.= ⇒3. y = i=1 is due to n n   2 2 x, ek ek  = x − |x, ek |2 → 0 x − k=1

k=1

as n → ∞ for each x ∈ H . For 3.= ⇒2., note the implication x, ek  = 0, k = 1, 2, . . . = ⇒ x ⊥ span{ek }, which implies that x ⊥ span{ek } from the continuity of inner prod-

Appendix: Proofs of Propositions

53

ucts (Proposition 9). Thus, we have ⇒1., from the sec ∞x, x = 0 and x = 0. For 2.= z, ei ei converges for each z ∈ H . Therefore, ond item of Proposition 15, y = i=1 for each j, we have n  z − y, e j  = z, e j  − lim  z, ei , e j  = z.e j  − z.e j  = 0 . n→∞

i=1

From ∞ the assumption of 2., we have that z − y = 0 and that z can be written as  i=1 z, ei ei .

Proof of Proposition 19 Let M be a closed subset of H . We show that for each x ∈ H , there exists a unique y ∈ M that minimizes x − y and that we have x − y, z − y ≤ 0

(2.20)

for z ∈ M. To this end, we first show that any sequence {yn } in M for which lim x − yn 2 = inf x − y2

n→∞

y∈M

(2.21)

is Cauchy. Since M is a linear space, we have (yn + ym )/2 ∈ M and yn + ym 2  2 2 2 2 ≤ 2x − yn  + 2x − ym  − 4 inf x − y → 0 .

yn − ym 2 = 2x − yn 2 + 2x − ym 2 − 4x − y∈M

Hence, {yn } is Cauchy. Then, suppose that more than one lower limit y exists, and let u = v be such a y. For example, for {yn }, let y2m−1 → u, and let y2m → v satisfy (2.21). However, this limit is not Cauchy and contradicts the discussion shown thus far. Hence, the y that achieves the limit in (2.21) is unique. In the following, we assume that y gives the lower limit. Moreover, note that x − {az + (1 − a)y}2 ≥ x − y2 ⇐⇒ 2ax − y, z − y ≤ a 2 z − y2 for arbitrary 0 < a < 1 and z ∈ M, and if x − y, z − y > 0, the inequality flips for small a > 0. Thus, we have x − y, z − y ≤ 0. Finally, if we substitute z = 0, 2y into (2.20), we have x − y, y = 0. Therefore, (2.20) implies that x − y, z ≤ 0 for z ∈ M. We obtain the proposition by replacing z with −z. 

54

2 Hilbert Spaces

Proof of Proposition 22 If the operator T maps to zero for any element, then eT = 0 satisfies the desired condition. Thus, we assume that T outputs a nonzero value for at least one input. From the first item of Proposition 20, Ker(T )⊥ is a closed subset of H and contains a y such that T y = 1. Thus, for an arbitrary x ∈ H , we have T (x − (T x)y) = T x − T x T y = 0 and x − (T x)y ∈ Ker(T ). Since y ∈ Ker(T )⊥ , we have x − (T x)y, y = 0 and x, y = T xy, y = T xy2 . Thus, eT = y/y2 satisfies the desired condition. To demonstrate uniqueness, if eT satisfies the same condition, then x, eT − eT  = 0 for any x ∈ H , which means that eT = eT . Furthermore, since T x = x, eT  ≤ xeT  for x ∈ H , we have that T  ≤ 1 = eT  when x = 1. Additionally, we obtain the inverse inequality eT  = y T y y

≤ T .



Proof of Proposition 25 For the first item, note that if {xn } is bounded, so is {T xn }. Moreover, if the image of T is of finite dimensionality, then {T xn } is also compact (Proposition 7)9 . For the second item, we use the so-called diagonal argument. In the following, we denote the norms of H1 , H2 by  · 1 ,  · 2 . Let {xk } be a bounded sequence in X 1 . From the compactness of T1 , there exists {x1,k } ⊆ {x0,k } := {xk } such that {T1 x1,k } converges to a y1 ∈ H2 as k → ∞. Then, there exists {x2,k } ⊆ {x1,k } such that {T2 x2,k } converges to a y2 ∈ H2 as k → ∞. If we repeat this process, the sequence {yn } in H2 converges. In fact, for each n, there exists a large kn such that Tn xn,k − yn 2
0, either TN  or −TN  is an eigenvalue of T . The existence of an eigenvalue on N contradicts the chosen orthonormal basis {e j }∞ j=1 . Therefore, when x ∈ N , we have T x = 0, which means that N ⊆ Im(T ) ∩ Ker(T ) = {0}. Thus, we have established (2.16). 

Exercises 16∼30 16. Choose the closed sets among the sets below. For the nonclosed sets, find their closures. 1 1 ∪∞ n=1 [n − n , n + n ]; {2, 3, 5, 7, 11, 13, . . .}; R ∩ Z; {(x, y) ∈ R2 | x 2 + y 2 < 1 when x ≥ 0, x 2 + y 2 ≤ 1 when x < 0 }. √ 1 1 converges to 2 as n → ∞. 17. Show that the sequence a1 = 1, an+1 = an + 2 an 18. Let f : M → R be a function defined over a bounded closed set M, and we define (z 1 ), . . . , (z m ) for some m ≥ 1 and z 1 , . . . , z m such that

(a) (b) (c) (d)

d(x, z) < (z)= ⇒d( f (x), f (z)) <  for z ∈ M.

58

2 Hilbert Spaces

(a) Why can the neighborhoods cover M ? 1 min (z i ). Without loss of generality, 2 1≤i≤m we assume that x ∈ Ui with a center at z i and a radius of (z i )/2. Prove the following.

Let x, y ∈ M satisfy d1 (x, y) < δ :=

(a) (b) (c) (d)

d1 (x, z i ) < 21 (z i ) < (z i ). d1 (y, z i ) ≤ d1 (x, y) + d1 (x, z i ) < (z i ). d2 ( f (x), f (y)) ≤ d2 ( f (x), f (z i )) + d2 ( f (y), f (z i )) <  +  = 2. f is uniformly continuous.

19. Using the fact that any continuous function over a bounded closed set is uniformly continuous, show that a continuous function over [0, 1] is a Riemann integral. 20. that the Cauchy-Schwarz inequality (2.5) holds if and only if one of x, y is a constant multiplied by the other. 21. Show that a one-indeterminate polynomial ring A is an algebra. In addition, show that the set of functions f ∈ A over E := [0, 1] is dense in C(E). 22. Derive Riesz-Fischer’s theorem stating that “L 2 is complete” (Proposition 14) according to the following steps in the appendix. (a) Let { f n } be an arbitrary Cauchy sequence. (b) There exists a sequence {n k } such that  ∞ k=1 | f n k+1 − f n k |2 < ∞. (c) Prove the existence of an f : E → R such that μ{x ∈ E| limk→∞ f n k (x) = f (x)} = μ(E). (d) Show that  f n − f  → 0 and f ∈ L 2 [a, b]. 23. Show that the basis of the Fourier series expansion 1 cos x sin x cos 2x sin 2x {√ , √ , √ , √ , √ , · · · } π π π π 2π is orthonormal. 24. Derive Proposition 19 according to the following steps in the appendix. What are the derivations of (a) through (e)? (a) Show that a sequence {yn } in M for which lim x − yn 2 = inf x − y2

n→∞

(b) (c) (d) (e)

y∈M

converges in M. Hereafter, let y satisfy yn → y ∈ M. Show that 2ax − y, z − y ≤ a 2 z − y2 for 0 < a < 1 and z ∈ M. Show that the inequality x − y, z − y > 0 contains a contradiction. Show that x − y, z ≤ 0. Obtain the proposition by replacing z with −z.

25. Show that the linear operator norm (2.12) satisfies the triangle inequality.

Exercises 16∼30

59

26. Show that the integral operator (2.13) is a bounded linear operator and that it is self-adjoint when K is symmetric. 27. Let (M, d) be a metric space with M := R and a Euclidean distance d. Show that each of the following E ⊆ M is not sequentially compact. Furthermore, show that they are not compact without using the equivalence between compactness and sequential compactness. (a) E = [0, 1)and (b) E = Q. 28. Proposition 27 is derived according to the following steps in the appendix. What are the derivations of (a) through (c)? (a) Show that H1 = Ker(T ) ⊕ Im(T ). (b) Show that span{e j | j ≥ 1} ⊆ Im(T ). (c) Show that span{e j | j ≥ 1} ⊇ Im(T ). Why do we need to show (2.25)? 29. Show that the HS and trace norms satisfy the triangle inequality. 30. Show that if T ∈ B(H ) is a trace class, then it is also an HS class, and show that if T ∈ B(H ) is a trace class, it is also compact.

Chapter 3

Reproducing Kernel Hilbert Space

Thus far, we have learned that a feature map  : E  x → k(x, ·) is obtained by the positive definite kernel k : E × E → R. In this chapter, we generate a linear space H0 based on its image k(x, ·)(x ∈ E) and construct a Hilbert space H by completing this linear space, where H is called reithe reproducing kernel Hilbert space (RKHS), which satisfies the reproducing prsoperty of the kernel k (k is the reproducing kernel of H ). In this chapter, we first understand that there is a one-to-one correspondence between the kernel k and the RKHS H and that H0 is dense in H (via the MooreAronszajn theorem). Furthermore, we introduce the RKHS represented by the sum of RKHSs and apply it to Sobolev spaces. We prove Mercer’s theorem regarding integral operators in the second half of this chapter and compute their eigenvalues and eigenfunctions. This chapter is the core of the theory contained in this book, and the later chapters correspond to its applications.

3.1 RKHSs Let H be a Hilbert space whose elements are functions f : E → R. A function k : E × E → R is said to be a reproducing kernel of a Hilbert space H with an inner product ·, · H if it satisfies the following two conditions. 1. For each x ∈ E, we have

k(x, ·) ∈ H.

(3.1)

2. Reproducing property: for each f ∈ H and x ∈ E, f (x) =  f, k(x, ·) H .

(3.2)

When H has a reproducing kernel, we say that H is a reproducing kernel Hilbert space (RKHS). The reproducing property (3.2) is called a kernel trick.

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_3

61

62

3 Reproducing Kernel Hilbert Space

Example 51 Let {e1 , . . . , e p } be an orthonormal basis of a finite-dimensional Hilbert space H . If we define k(x, y) :=

p 

ei (x)ei (y)

(3.3)

i=1

for x, y ∈ E, then we have k(x, ·) ∈ H and p  e j (·), k(x, ·) H = e j , ei  H ei (x) = e j (x) i=1

p for each 1 ≤ j ≤ p. Thus, for any f (·) = i=1 f i ei (·) ∈ H , f i ∈ R, we have  f (·), k(x, ·) H = f (x) (reproducing property). Therefore, H is an RKHS, and (3.3) is a reproducing kernel. Proposition 32 The reproducing kernel k of the RKHS H is unique, symmetric k(x, y) = k(y, x), and nonnegative definite. Proof: If k1 , k2 are RKHSs of H , then by the reproducing property, we have that f (x) =  f, k1 (x, ·) H =  f, k2 (x, ·) H . In other words,  f, k1 (x, ·) − k2 (x, ·) H = 0 holds for all f ∈ H , x ∈ E for which k1 = k2 (Proposition 16). Additionally, the symmetry of a reproducing kernel follows from that of its inner product: k(x, y) = k(x, ·), k(y, ·) H = k(y, ·), k(x, ·) H = k(y, x) . The nonnegative definiteness of the reproducing kernel can be shown as follows. n  n 

z i z j k(xi , x j ) =

i=1 j=1

n  n  i=1 j=1

n n   z i z j k(xi , ·), k(x j , ·) H =  z i k(xi , ·), z j k(x j , ·) H ≥ 0 i=1

j=1

. Proposition 33 A Hilbert space H is an RKHS if and only if Tx ( f ) = f (x) ( f ∈ H ) is bounded at each x ∈ E for the linear functional Tx : H  f → f (x) ∈ R. Proof: If H has a reproducing kernel k, then at each x ∈ E, we have  f (·), k(x, ·) H = Tx ( f ) , f ∈ H . Thus, we have

3.1 RKHSs

63

 |Tx ( f )| = | f (·), k(x, ·) H | ≤ f · k(x, ·) = f k(x, x) . Conversely, if the linear functional Tx ( f ) = f (x) is bounded for x ∈ E, from Proposition 22, there exists a k x : E → R such that  f (·), k x (·) H = f (x) , f ∈ H . In other words, a reproducing kernel exists.  In Proposition 32, we showed that a reproducing kernel is unique once its RKHS is determined, but the following proposition asserts the converse. Proposition 34 (Aronszajn [1]) Let k : E × E → R be a positive definite kernel. Then, the Hilbert space H with the reproducing kernel k is unique. Moreover, for k(x, ·) ∈ H , x ∈ E holds, and the generated linear space is dense in H . The proof is given by the following procedure. 1. Define the inner product ·, · H0 of H0 := span{k(x, ·)|x ∈ E}. 2. For any Cauchy sequence { f n } in H0 and each x ∈ E, the real sequence { f n (x)} is a Cauchy sequence, and we have the convergence value f (x) := lim f n (x) n→∞ (Proposition 6). Let H be such a set of f . 3. Define the inner product ·, · H of the linear space H . 4. Show that H0 is dense in H . 5. Show that any Cauchy sequence { f n } in H converges to some element of H as n → ∞ (completeness of H ). 6. Show that k is a reproducing kernel of H . 7. Show that such an H is unique. See the appendix at the end of the chapter for details1 .



Example 52 (Linear Kernel) Let ·, · E be the inner product of E := Rd . Then, the linear space H := {x, · E |x ∈ E} is complete since it has finite dimensions (Proposition 6). Moreover, H is an RKHS with the reproducing kernel k(x, y) = x, y E , x, y ∈ E. Example 53 Let E be a finite set {x1 , . . . , xn }, and let k : E × E → R be a positive definite kernel; then, the linear space H := {

n 

αi k(xi , ·)|α1 , . . . , αn ∈ R}

i=1

is a reproducing kernel Hilbert space. We define the inner product by 1

The proof is due to [33].

64

3 Reproducing Kernel Hilbert Space

 f (·), g(·) H = a K b  for f (·), g(·) ∈ H , where f (·) = nj=1 a j k(x j , ·) ∈ H , a = [a1 , . . . , an ] ∈ Rn and  g(·) = nj=1 b j k(x j , ·) ∈ H , b = [b1 , . . . , bn ] ∈ Rn via the Gram matrix ⎡

k(x1 , x1 ) ⎢ .. K := ⎣. . k(xn , x1 )

⎤ · · · k(x1 , xn ) ⎥ .. .. ⎦ . . . · · · k(xn , xn )

Then, for each xi , i = 1, 2, . . ., we have  f (·), k(xi , ·) H = [a1 , . . . , an ]K ei =

n 

a j k(x j , xi ) = f (xi )

j=1

(reproducing property), where ei is an n-dimensional column vector in which we set component i and the other components to 1 and 0, respectively. Example 54 (Polynomial Kernel) Let ·, · E be the inner product between the elements in E. The Hilbert space H obtained by completing the linear space H0 generated by (x, · E + 1)d ∈ R (x ∈ E) is an RKHS with the reproducing kernel k(x, y) = (x, y E + 1)d for x, y ∈ E. Example 55 Let k(x, y) be the kernel expressed by a function φ(x − y) as considered in Sect. 1.5. If we require k(x, y) to take real values, the associated probability density functions must be even functions such as those of the Gaussian and Laplace distributions. Otherwise, since the imaginary part of t → ei(x−y)t is odd, the kernel k might take imaginary values. Now, using L 2 (E, η)  F : E = R → C whose real and imaginary parts are even and odd, we consider the linear respectively, i xt F(t)e dη(t). The function F(t) → space consisting of f : E → R with f (x) = E f (x) = E F(t)ei xt dη(t) is injective (if E F(t)ei xt dη(t) = 0, then the inverse Fourier transform F(t) = 0). If its inner product is  f, g H = E F(t)G(t)dη(t) for F, G ∈ L 2 (E, η), then L 2 (E, η) and F(t)ei xt dη(t) ∈ R|F ∈ L 2 (E, η)} H = {E  x → E

are isomorphic as an inner product space. Note that H has a reproducing kernel E × E → R with e−i(x−y)t dη(t) . k(x, y) = E

In fact, we have k(x, y) =

E

e−i xt eiyt dη(t). Thus, if we set G(t) = e−i xt , we obtain





 f (·), k(x, ·) H =

F(t)G(t)dη(t) = E

F(t)ei xt dη(t) = f (x) E

3.1 RKHSs

65





for f (y) =

F(t)eiyt dη(t) and k(x, y) = E

G(t)eiyt dη(t). For different kernels E

k(x, y), such as the Gaussian and Laplacian kernels, the measure η(t) will be different, and the corresponding RKHS H will be different. 1 Example 56 Let E := [0, 1]. Using the real-valued function F with 0 F(u)2 du < 1 ∞, we consider the set H of functions f : E → R, f (t) = 0 F(u)(t − u)0+ du, where we denote (z)0+ = 1 and (z)0+ = 0 when z ≥ 0 and when z < 0, respectively. 1 The linear space H is complete for the norm f 2 = 0 F(u)2 du (Proposition 14) 1 1 if the inner product is  f, g H = 0 F(u)G(u)du for f (t) = 0 F(u)(t − u)0+ du 1 and g(t) = 0 G(u)(t − u)0+ du. This Hilbert space H is the RKHS for k(x, y) = min{x, y}. In fact, for each z ∈ E, we see that  f (z), k(x, z) H =

1 0

F(u)(z − u)0+ du,

1 0

(x − u)0+ (z − u)0+ du H =

1 0

F(u)(x − u)0+ du = f (x).

Thus far, we have obtained the RKHS corresponding to each positive definite kernel, but a necessary condition exists for a Hilbert space H to be an RKHS. If that condition is not satisfied, we can claim that it is not an RKHS. Proposition 35 Let H be an RKHS consisting of functions on E. If limn→∞ | f n − f H = 0 f, f 1 , f 2 , . . . ∈ H , then for each x ∈ E, limn→∞ | f n (x) − f (x)| = 0 holds. Proof: In fact, we have that for each x ∈ E,  | f n (x) − f (x)| ≤ f n − f k(x, x) .  Example 57 H := L 2 [0, 1] is not an RKHS. In fact, for a sequence { f n } with 1 1 f n (x) = x n , the norm converges to f n 2H = 0 f n2 (x)d x = 2n+1 → 0. However, for f (x) = 0 with x ∈ E, we have f n − f H → 0, and | f n (1) − f (1)| = 1  0. This contradicts the fact that H is an RKHS (Proposition 35). Example 57 illustrates that L 2 [0, 1] is too large, and as we will see in the next section, the Sobolev space restricted to L 2 [0, 1] is an RKHS.

3.2 Sobolev Space We first show that if k1 , k2 are reproducing kernels, the sum k1 + k2 is also a reproducing kernel. To this end, we show the following. Proposition 36 If H1 , H2 are Hilbert spaces, so is the direct product F := H1 × H2 under the inner product

66

3 Reproducing Kernel Hilbert Space

( f 1 , f 2 ), (g1 , g2 ) F :=  f 1 , g1  H1 +  f 2 , g2  H2

(3.4)

for f 1 , g1 ∈ H1 , f 2 , g2 ∈ H2 . Proof: From ( f 1 , f 2 ) 2F = f 1 2H1 + f 2 H2 , we have

f 1,n − f 1,m H1 , f 2,n − f 2,m H2

≤ f 1,n − f 1,m 2H1 + f 2,n − f 2,n 2H2 = ( f 1,n , f 2,n ) − ( f 1,m , f 2,m ) F . Thus, we have = ⇒ = ⇒ = ⇒

{( f 1,n , f 2,n )} is Cauchy { f 1,n }, { f 2,n } is Cauchy f 1 ∈ H1 , f 2 ∈ H2 exists such that f 1,n → f 1 , f 2,n → f 2

(  f 1,n , f 2,n ) − ( f 1 , f 2 ) F = ( f 1,n − f 1 , f 2,n − f 2 ) F = f 1,n − f 1 2 + f 2,n − f 2 2 → 0 ,

which means that F is complete. Let H := H1 + H2 := { f 1 + f 2 | f 1 ∈ H1 , f 2 ∈ H2 }



be the direct sum of H1 , H2 , and define the linear map from F to H by u : F  ( f 1 , f 2 ) → f 1 + f 2 ∈ H . Then, we can decompose F into N := u −1 (0) and its orthogonal complement N ⊥ . If we restrict u to N ⊥ to obtain the injection v : N ⊥ → H , then the bivariate function  f, g H := v −1 ( f ), v −1 (g) F

(3.5)

for f, g ∈ H forms an inner product. Note that N ⊥ is a closed subspace of the Hilbert space F. Proposition 37 If the direct sum H of Hilbert spaces H1 , H2 has the inner product (3.5), then H is complete (a Hilbert space). Proof: Since F is a Hilbert space (Proposition 36) and N ⊥ is its closed subset, N ⊥ is complete. Thus, we have = ⇒ = ⇒

f n − f m H → 0= ⇒ v −1 ( f n − f m ) F → 0 g ∈ F exists such that v −1 ( f n ) − g F → 0

f n − v(g) H → 0, v(g) ∈ H .



Proposition 38 (Aronszajn [1]) Let k1 , k2 be the reproducing kernels of RKHSs H1 , H2 , respectively. Then, k = k1 + k2 is the reproducing kernel of the Hilbert space H := H1 ⊕ H2 := { f 1 + f 2 | f 1 ∈ H1 , f 2 ∈ H2 } such that the inner product is (3.5) and the norm is

3.2 Sobolev Space

67

f 2H =

min

f = f 1 + f 2 , f 1 ∈H1 , f 2 ∈H2

{ f 1 2H1 + f 2 2H2 }

(3.6)

for f ∈ H . The proof proceeds as follows. 1. Let f ∈ H and N ⊥  ( f 1 , f 2 ) := v −1 ( f ). We define k(x, ·):=k1 (x, ·)+k2 (x, ·) and (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)), and we show that  f 1 , h 1 (x, ·)1 +  f 2 , h 2 (x, ·)2 =  f 1 , k1 (x, ·)1 +  f 2 , k2 (x, ·)2 . 2. Using the above, we present the reproducing property  f, k(x, ·) H = f (x) of k. 3. We show that the norm of H is (3.6). For details, see the Appendix at the end of this chapter.  In the following, we construct the Sobolev space as an example of an RKHS and obtain its kernel. Let W1 [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable almost everywhere and f  ∈ L 2 [0, 1]. Then, we can write each f ∈ W1 [0, 1] as

x

f (x) = f (0) +

f  (y)dy .

(3.7)

0

Similarly, let Wq [0, 1] be the set of f ’s defined over [0, 1] such that f is differentiable q − 1 times and q times almost everywhere and f (q) ∈ L 2 [0, 1]. If we define φi (x) :=

xi , i = 0, 1, . . . i!

and

q−1

G q (x, y) :=

(x − y)+ , (q − 1)!

then we can Taylor-expand each f ∈ Wq [0, 1] as follows. f (x) =

q−1 

f (i) (0)φi (x) +



1

G q (x, y) f (q) (y)dy.

(3.8)

0

i=0

In fact, we have the partial integral

1

G q (x, y) f

(q)



(y)dy = G q (x, y) f

0

x q−1 =− (q − 1)!

(q−1)

(y)

1



− 0 f (q−1) (0) +

0 1 0

1

{

d G q (x, y)} f (q−1) (y)dy dy

G q−1 (x, y) f (q−1) (y)dy

68

3 Reproducing Kernel Hilbert Space

and obtain (3.7) by repeatedly applying this integral to the right-hand side of (3.8). For the transformation, we use

1

G q (x, y)h(y)dy =

0

=

 q−1

1 (q − 1)!

i=0

q −1 i



0

1



q−1

(x − y)+ h(y)dy (q − 1)! x

xi

(−y)q−1−i h(y)dy .

0

and the differentiation

1 0

 x q−2   d 1 q −2 {G q (x, y)h(y)}dy = {−x i (−y)q−2−i h(y)dy} i dy (q − 2)! i=0 0 1 =− G q−1 (x, y)h(y)dy . 0

Hereafter, we write each element of Wq [0, 1] as q−1 

αi φi (x) +

1

G q (x, y)h(y)dy

(3.9)

0

i=0

α0 = f (0), . . . , αq−1 = f (q−1) (0) ∈ R, h ∈ L 2 [0, 1]. Although more than one Hilbert space Wq [0, 1] exists with different definitions of inner products, we consider the Hilbert space H that can be written as the direct sum of H0 and H1 , which is defined below. Let H0 := span{φ0 , . . . , φq−1 }, and define its inner product by  f, g H0 =

q−1 

f (i) (0)g (i) (0)

i=0

for f, g ∈ H0 . We find that the inner product ·, · H0 satisfies the requirement of inner products and that {φ0 , . . . , φq−1 } is an orthonormal basis. Since the inner product space H0 is of finite dimensionality, it is apparently a Hilbert space. We define another inner product space H1 as H1 := {

1

G q (x, y)h(y)dy|h ∈ L 2 [0, 1]} .

0

Since h ∈ L 2 [0, 1], if we define the inner product as

3.2 Sobolev Space

69



1

 f, g H1 =

f (q) (y)g (q) (y)dy

0

for f, g ∈ H , then we have

f m − f n H1 → 0 ⇐⇒ f m(q) − f n(q) L 2 [0,1] → 0, and there exists an f ∈ H1 such that

f n − f H1 → 0 ⇐⇒ f n(q) − f (q) L 2 [0,1] → 0 . ⇒ f n − f H1 → 0 (completeFrom Proposition 14, we have f m − f n H1 → 0= ness), and H1 is a Hilbert space. Moreover, from f (x) =

q−1 

αi φi (x) ∈ H1 = ⇒h = f (q) = 0

i=0

and

1

f (x) =

G q (x, y)h(y)dy ∈ H0 = ⇒α0 = f (0) = 0, . . . , αq−1 = f (q−1) (0) = 0 ,

0

we have that H0 ∩ H1 = {0}. From Proposition 38, for f = f 0 + f 1 , g = g0 + g1 , f 0 , g0 ∈ H0 , and f 1 , g1 ∈ H1 , the inner product is  f, gWq [0,1] =  f 0 + f 1 , g0 + g1 Wq [0,1] =  f 0 , g0  H0 +  f 1 , g1  H1 . The reproducing kernels of H0 , H1 are respectively k0 (x, y) :=

q−1 

φi (x)φi (y)

i=0

and



1

k1 (x, y) :=

G q (x, z)G q (y, z)dz ,

0

where k0 is derived from Example 3.2, and k1 is derived from 1 1 G q (·, z)h(z)dz, G q (x, z)G q (·, z)dz H1  f (·), k1 (x, ·) H1 =  0 0 1 = G q (x, z)h(z)dz = f (x) 0

70

3 Reproducing Kernel Hilbert Space

for arbitrary f (·) =

1

G q (·, z)h(z)dz ∈ H and x ∈ E (the uniqueness is due to

0

Proposition 32). Furthermore, we can construct Wq [0, 1] such that its kernel is k(x, y) = k0 (x, y) + k1 (x, y) for x, y ∈ E.

3.3 Mercer’s Theorem Let (E, F, μ) be a measure space. We assume that the integral operator kernel K : E × E → R is a measurable function and is not necessarily nonnegative definite. 2 Suppose that E×E K (x, y)dμ(x)dμ(y) takes finite values. Then, we define the integral operator TK by (TK f )(·) :=

K (x, ·) f (x)dμ(x)

(3.10)

E

for f ∈ L 2 (E, B, μ). Since

{(TK f )(x)}2 dμ(x) ≤ {K (x, y)}2 dμ(x)dμ(y) { f (z)}2 dμ(z) E E×E E 2 2 = f

{K (x, y)} dμ(x)dμ(y) ,

TK f 2 =

E×E

we have TK ∈ B(L 2 (E, B, μ)) and 

1/2

TK ≤

K (x, y)dμ(x)dμ(y) 2

.

E×E

In the following, we assume that K : E × E → R is continuous and that the entire set E is compact (such as E = [0, 1]). Thus, we assume that the integral operator kernel K is uniformly continuous (Proposition 8). Lemma 1 For each f ∈ L 2 (E, F, μ), TK f (·) is uniformly continuous. Proof: Since E × E→R is uniformly continuous, we achieve |K (x, y)−K (x, z)| <  by making |y − z| smaller for arbitrary x ∈ E and  > 0. Thus, we have      K (x, y) f (x)dμ(x) −  ≤  f . K (x, z) f (x)dμ(x)   E

E



3.3 Mercer’s Theorem

71

Proposition 39 TK is a compact operator. Proof: By Proposition 12, for an arbitrary  > 0, there exist n() ≥ 1 and an Rn() coefficient bivariate polynomial K n() (x, y) := i=1 gi (x)y i whose order of y is at most n() such that sup |K (x, y) − K n() (x, y)| <  , x,y∈E

where g1 , . . . , gn() are R-coefficient univariable polynomials. If we abbreviate n() as n and write the integral operator corresponding to K n as TK n , then we may regard n  i TK n f (·) = y f (x)gi (x)dμ(x) E

i=0

as



TK n f : H  f  → [

f (x)g0 (x)dμ(x), . . . , E

f (x)gn (x)dμ(x)] ∈ Rn+1 . E

Since the rank of TK n is finite, from the first item of Proposition 25, TK n is a compact operator. Moreover, since

(TK n − TK ) f 2 =

E

(

E

[K n (x, y) − K (x, y)] f (y)dμ(y))2 dμ(x) ≤  2 f 2 μ2 (E) ,

from Proposition 25, TK is a compact operator.  In the following, we assume that K is symmetric. Then, from Example 45, TK is self-adjoint. Thus, from Proposition 39, we have that TK x =

∞ 

λ j e j , xe j

j=1

using {λ j } and {e j } that satisfy Proposition 27. Moreover, Lemma 1 implies the following: Lemma 2 e j (y) =

λ−1 j

K (x, y)e j (x)dμ(x) E

is uniformly continuous w.r.t. y. Example 58 (Brown Motion) We obtain the eigenvalues and eigenfunctions {(λ j , e j )} when the integral operator kernel in L 2 [0, 1] is K (x, y) = min{x, y}, x, y ∈ E = [0, 1], (the subspace H1 of the Sobolev space W1 [0, 1]). Since TK f (x) = 0

1



x

K (x, y) f (y)dy = 0



1

y f (y)dy + x x

f (y)dy ,

72

3 Reproducing Kernel Hilbert Space

the eigen equation is



1

min(x, y)e(y)dy = λe(x) ,

(3.11)

0

i.e.,



x



1

ye(y)dy + x

0

e(y)dy = λe(x) .

x

If we differentiate the both sides by x, we obtain

1

xe(x) +

e(y)dy − xe(x) = λe (x) ,

x

i.e.,



1

e(y)dy = λe (x) .

(3.12)

x

If we further differentiate both sides by x, then we obtain e(x) = −λe (x) and √ √ e(y) = α sin(y/ λ) + β cos(y/ λ) . If we substitute x = 0 into (3.11), then we have e(0) √= 0, which is equivalent to β = 0. From (3.12), we have e (1) = 0, i.e., α cos(1/ λ) = 0. Thus, we obtain √ 1/ λ = (2 j − 1)π/2 , j = 1, 2, . . . . Therefore, the eigenvalues are λj =

4 , {(2 j − 1)π }2

(3.13)

and the orthonormal eigenfunctions are e j (x) = where to derive α =

1





(2 j − 1)π x 2 sin 2

 ,

2, we use

y sin ( √ )dy = λ



2

0



0

1

1 − cos( √2yλ ) 2

√ 1 1 λ 2y 1 dy = − [ sin √ ]10 = . 2 2 2 2 λ

Example 59 (Zhu et al. [36]) For a Gaussian kernel, 

−(x − y)2 K (x, y) = exp 2σ 2



(3.14)

3.3 Mercer’s Theorem

73

if we regard the finite measure μ in (3.10) of the integral operator kernel as a Gaussian distribution with a mean of 0 and a variance of σˆ 2 ; then, the eigenvalue and eigenfunction are  2a j B λj = A and

√ e j (x) = exp(−(c − a)x 2 )H j ( 2cx) ,

where H j is a Hermite polynomial of order j: H j (x) := (−1) j exp(x 2 )

dj exp(−x 2 ) , dx j

√ a −1 := 4σˆ 2 , b−1 := 2σ 2 , c := a 2 + 2ab, A := a + b + c, and B := b/A. The proof is not difficult but rather monotonous and long. See the Appendix at the end of this chapter for details. Note that for a Gaussian kernel with a parameter σ 2 , if the measure is also a Gaussian distribution with a mean of 0 and a variance of σˆ 2 , we σˆ 2 b : can compute the eigenvalues from β := 2 = σ 2a   b 2a j 2a B = ( )j √ √ 2 A a + b + a + 2ab a + b + a 2 + 2ab  β )j , = [1/2 + β + 1/4 + β]−1/2 ( √ 1/2 + β + 1/4 + β which forms a geometric sequence. For example, if σ 2 = σˆ 2 = 1, then the eigenvalue is √ 3 − 5 j+1/2 ) λj = ( . 2 The Hermite polynomials are H1 (x) = 2x, H2 (x) = −2 + 4x 2 , and H3 (x) = 12x −  (x)), and the other quantities are 8x 3 (H0 (1) = 1, H j (x) = 2x H j−1 (x) − H j−1   c = a 2 + 2ab = (4σˆ 2 )−1 1 + 4σˆ 2 /σ 2 =



5 1 , a = (4σˆ 2 )−1 = . 4 4

We show the eigenfunction φ j for j = 1, 2, 3 in Fig. 3.1. The code is as follows. 3-start # I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d . import numpy a s np import m a t p l o t l i b . p y p l o t a s p l t from m a t p l o t l i b import s t y l e s t y l e . u s e ( "seaborn−ticks" )

74

3 Reproducing Kernel Hilbert Space

def Hermite ( j ) : i f j == 0 : return [ 1 ] a = [ 0] ∗ ( j + 2) b = [0] ∗ ( j + 2) a [0] = 1 f o r i i n range ( 1 , j + 1 ) : b [ 0 ] = −a [ 1 ] f o r k i n range ( i + 1 ) : b [ k ] = 2 ∗ a [ k − 1] − ( k + 1) ∗ a [ k + 1] f o r h i n range ( j + 2 ) : a[h] = b[h] return b [ : ( j +1) ]

Hermite ( 1 ) # 1 s t order Hermite Polynomial

[0, 2]

H e r m i t e ( 2 ) # 2 nd o r d e r H e r m i t e P o l y n o m i a l

[-2, 0, 4]

Hermite ( 3 ) # 3 rd order Hermite Polynomial

[0, -12, 0, 8]

d e f H( j , x ) : coef = Hermite ( j ) S = 0 f o r i i n range ( j + 1 ) : S = S + np . a r r a y ( c o e f [ i ] ) ∗ ( x ∗∗ i ) return S

c c = np . s q r t ( 5 ) / 4 a = 1/4 def phi ( j , x ) : r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x ) c o l o r = [ "b" , "g" , "r" , "k" ] p = [ [ ] f o r _ i n range ( 4 ) ] x = np . l i n s p a c e ( − 2 , 2 , 1 0 0 ) f o r i i n range ( 4 ) : for k in x : p [ i ] . append ( phi ( i , k ) ) plt . plot (x , p[ i ] , c = color [ i ] , label =

"j=%d"%i )

3.3 Mercer’s Theorem

75

p l t . y l i m ( −2 , 8 ) p l t . y l a b e l ( "phi" ) p l t . t i t l e ( "CharacteristicfunctionofGaussKernel" )

In this section, we prove Mercer’s theorem for integral operators and illustrate some examples. Hereafter, we assume that K and TK are nonnegative definite. Proposition 40 An integral operator TK is nonnegative definite if and only if K : E × E → R is nonnegative definite, i.e., K is a positive definite kernel. Proof: See the Appendix at the end of this chapter. Proposition 41 (Mercer [21]) Let K : E × E → R be a continuous positive definite kernel and TK be the corresponding integral operator. Let {(λ j , e j )}∞ j=1 be the sequence of eigenvalues and eigenvectors of TK . Then, we can write K (x, y) =

∞ 

λ j e j (x)e j (y),

j=1

and this sum absolutely and uniformly converges. By absolute convergence, we mean that the sum of the absolute values converges, and by uniform convergence, we mean that the upper bound of the error that does not depend on x, y ∈ E converges to zero.  Proof: Note that K n (x, y) := K (x, y) − nj=1 λ j e j (x)e j (y) is continuous and that the integral operator TK n is nonnegative definite. In fact, for each f ∈ L 2 (E, F, μ), we have TK n f, f  = TK f, f  −

n 

λ j  f, e j 2 =

j=1

Fig. 3.1 The eigenfunctions for the Gaussian kernel and Gaussian distribution, where σ 2 = σˆ 2 = 1. If j is odd, the eigenfunctions are even and odd functions, respectively

∞ 

λ j  f, e j 2 ≥ 0 .

j=n+1

8

Gaussian Kernel EigenFunctions =0 =1 =2 =3

-6 -4 -2 0

φj

2

4

6

j j j j

-2

-1

0 x

1

2

76

3 Reproducing Kernel Hilbert Space

Thus, from Proposition 40, K n is nonnegative definite, and K n (x, x) ≥ 0. Thus, for all x ∈ E, we have ∞  λ j e2j (x) ≤ K (x, x) . (3.15) j=1

Moreover, for any set J consisting of positive numbers, we have 

⎛ |λ j e j (x)e j (y)| ≤ ⎝

j∈J



⎞1/2 ⎛ ⎞1/2  λ j e2j (x)⎠ ⎝ λ j e2j (y)⎠ ,

j∈J

j∈J

(3.16)

which means that from (3.15), 

|λ j e j (x)e j (y)| ≤ {K (x, x)K (y, y)}1/2

j∈J

for x, y ∈ E. From (3.16), we have ∞ 

⎛ |λ j e j (x)e j (y)| ≤ ⎝

j=n+1

∞ 

⎞1/2 ⎛ λ j e2j (x)⎠

j=n+1



∞ 

⎞1/2 λ j e2j (y)⎠

j=n+1

and the right-hand side monotonically converges to 0 as n grows. Since E is compact, the left-hand side uniformly converges according to the lemma below. Lemma 3 (Dini) Let E be a compact set. For a continuous function f n : E → R, if f n (x) monotonically converges to f (x) for a continuous f and each x ∈ E, then the convergence is uniform. Proof: See the Appendix at the end of this chapter. Thus, for an arbitrary  > 0, there exists an n such that sup

∞ 

|λ j e j (x)e j (y)| < ,

(3.17)

x,y∈E j=n+1

and this sum absolutely and uniformly converges.



Example 60 (The Kernel Expressed by the Difference Between Two Variables) Let E = [−1, 1]. An integral operator for which K : E × E → R can be expressed by K (x, z) = φ(x − z) (φ : E → R) is TK f (x) = E φ(x − y) f (y)dy, which can be expressed by (φ ∗ f )(x) using convolution: (g ∗ h)(u) = E g(u − v)h(v). Hereafter, we assume that the cycle of φ is two, i.e., φ(x) = φ(x + 2Z). In this case, e j (x) = cos(π j x) is the eigenfunction of TK . In fact, since φ is an even function and is cyclic, we have

3.3 Mercer’s Theorem

77

TK e j (x) =

E

φ(x − y) cos(π jy)dy =

1−x −1−x

φ(−u) cos(π j (x + u))du =

E

φ(u) cos(π j (x + u))du

and



TK e j (x) = {

φ(u) cos(π ju)du} cos(π j x) − { E

φ(u) sin(π ju)du} sin(π j x) E

= λ j cos(π j x)

from the Addition theorem cos(π j (x + u)) = cos(π j x) cos(π ju) − sin(π j x) sin(π ju), where λ j = E φ(u) cos(π ju)du. Similarly, sin(π j x) is an eigenfunction, and λ j is the corresponding eigenvalue. Thus, from Mercer’s theorem, we have K (x, y) =

∞ 

λ j {cos(π j x) cos(π jy) + sin(π j x) sin(π jy)} =

j=0

∞ 

λ j cos{π j (x − y)} .

j=0

Example 61 (Polynomial Kernel) For the polynomial kernel in Example 8, let m = 2, d = 1. We compute the eigenfunction of K (x, y) = (1 + xy)2 over x, y ∈ E = [−1, 1] by setting e(x) := a0 + a1 x + a2 x 2 . By comparing

E

K (x, y)e(y)dy =

E

(1 + xy)2 e(y)dy =

E

e(y)dy + {2

E

ye(y)dy}x + {

E

y 2 f (y)dy}x 2

with λe(x), we obtain ⎧ ⎪ ⎪ (a0 + a1 y + a2 y 2 )dy = λa0 ⎪ ⎪ ⎪ ⎨ E 2 y(a0 + a1 y + a2 y 2 )dy = λa1 . ⎪ ⎪ ⎪ E ⎪ ⎪ ⎩ y 2 (a0 + a1 y + a2 y 2 )dy = λa2 E

We solve the eigenequation w.r.t. the following matrix: ⎡

⎤ 2 ydy y dy ⎡ ⎤ E E ⎢ E ⎥⎡ ⎤ a0 ⎢ ⎥ a0 ⎢ ⎥ ⎢ 2 ydy 2 E y 2 dy 2 E y 3 dy ⎥ ⎣ a1 ⎦ = λ ⎣ a1 ⎦ . ⎢ E ⎥ a2 ⎣ ⎦ a2 3 4 2 y dy E y dy E y dy dy



E

Now, we consider the general method for approximately obtaining eigenvalues and eigenvectors in Mercer’s theorem. Let X be a random variable in E. Then, for the integral operator Tx ∈ B(H ) (x ∈ E) defined by TK : L  φ  →

K (·, x)φ(x)dμ(x) ∈ L 2 ,

2

E

78

3 Reproducing Kernel Hilbert Space

there exist λ1 ≥ λ2 ≥ . . . and φ1 , φ2 , . . . ∈ L 2 such that TK φ j = λφ j

and

φ j φk dμ = δ j.k . E

We say that the probability μ has generated x1 , . . . , xm ∈ E with m ≥ 1, and we approximate the generation as m 1  K (x j , y)φi (x j ) = λi φi (y) , y ∈ E m j=1

(3.18)

i = 1, 2, . . .. Since we have m 1  φ j (xi )φk (xi ) = δ j,k m i=1

if we substitute x1 , . . . , xm into y in (3.18), we find that there exists an orthogonal matrix U ∈ Rm×m such that Km U = U , where K m ∈ Rm×m is the Gram matrix and is the diagonal matrix with the elements √ λi(m) (m) into = mλ , . . . , λ = mλ . If we substitute φ (x ) = mU , λ = λ(m) 1 m i j j,i i m 1 m (3.18), we obtain √  m m K (x j , ·)U j,i . (3.19) φi (·) = (m) λi j=1 We require that the distribution of x1 , . . . , xm ∈ E coincide with the measure μ of the integral operator. It is known that if we make m larger in λi(m) /m, the term converges to the eigenvalue λi . For the proof and the convergence process, consult Baker (Theorem 3.4 [3]). We write the procedure using the Python as below. Example 62 We obtain the eigenvalue and eigenfunction by using the following program with a Gaussian kernel, where the measure required for the definition of the integral kernel should be the same as the measure used when providing random numbers. Even with the same Gaussian kernel, if x1 , . . . , x N follows a different distribution, we obtain different eigenvalues and eigenfunctions. We compare the cases in which N = 300 and N = 1000 to find that the eigenvalues and eigenfunctions coincide (Figs. 3.2 and 3.3).

3.3 Mercer’s Theorem

79

0.00 0.05 0.10 0.15 0.20 0.25

EigenValues

The First 100 Eigenvalues m = 1000 m = 300

0

20

40

60

80

100

# Eigenvalues

Fig. 3.2 The eigenvalues obtained in Example 62. We compare the cases involving m = 1000 samples and the first m = 300 samples. The largest eigenvalues for both cases coincide

# Kernel D e f i n i t i o n sigma = 1 def k ( x , y ) : r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 ) # G e n e r a t e S a m p l e s and D e f i n e t h e Gram M a t r i x m = 300 x = np . random . r a n d n (m) − 2 ∗ np . random . r a n d n (m) ∗∗2 + 3 ∗ np . random . r a n d n (m) ∗∗3 # E i g e n v a l u e s and E i g e n v e c t o r s K = np . z e r o s ( ( m, m) ) f o r i i n range (m) : f o r j i n range (m) : K[ i , j ] = k ( x [ i ] , x [ j ] ) v a l u e s , v e c t o r s = np . l i n a l g . e i g (K) lam = v a l u e s / m a l p h a = np . z e r o s ( ( m, m) ) f o r i i n range (m) : a l p h a [ : , i ] = v e c t o r s [ i , : ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16) # D i s p l a y Graph def F ( y , i ) : S = 0 f o r j i n range (m) : S = S + alpha [ j , i ] ∗ k ( x [ j ] , y ) return S i = 1 ## Execute i t changing i d e f G( y ) : return F ( y , i ) w = np . l i n s p a c e ( − 2 , 2 , 1 0 0 ) p l t . p l o t (w, G(w) ) p l t . t i t l e ( "EigenValuesandtheirEigenFunctions" )

Finally, we present the RKHS obtained from Mercer’s theorem (Proposition 41). In Example 57, we pointed out that the condition was too loose for the L 2 -space to be an RKHS. The following proposition suggests the restrictions that we should add.

80

3 Reproducing Kernel Hilbert Space

First Eigenfunction

-2

-1

0 x

1

m = 1000 m = 300

-1.0 -0.5 0.0 0.5 1.0 1.5

Eigenfunction

2.0 1.5 1.0 0.5

Eigenfunction

Second Eigenfunction

m = 1000 m = 300

2

-2

0 x

1

1

2

m = 1000 m = 300

-1.5 -1.0 -0.5 0.0 0.5 1.0

Eigenfunction

Eigenfunction

-1.0 -0.5 0.0 0.5 1.0 1.5 2.0

m = 1000 m = 300

-1

0 x

Fourth Eigenfunction

Third Eigenfunction

-2

-1

-2

2

-1

0 x

1

2

Fig. 3.3 The eigenfunctions obtained in Example 62. We show a comparison between the functions of the m = 1000 samples and the first m = 300 samples. The eigenfunctions coincide for the first largest three eigenvalues, but they are far from each other for the fourth eigenvalue. However, the fourth eigenvalues coincide

Proposition 42 Let {(λ j , e j )} be an eigenvalue of an integral operator with a positive definite kernel k and an orthonormal eigenfunction. In this case. H ={

∞  j=1

βjej|

∞  β 2j j=1



 f, g H :=

∞  j=1

gives the RKHS.

< ∞}

λj

f (x)e j (x)dη(x) E

g(x)e j (x)dη(x) E

λj

(3.20)

∞ The proposition claims that if we restrict the elements j=1 β j e j for which ∞ 2 ∞ β 2j 2 j=1 β j < ∞ to those for which j=1 λ j < ∞, the L space becomes an RKHS. Proof: From the definition of the inner product (3.20), we can write ei , e j  H = 1 δ . Thus, we have λi i, j

3.3 Mercer’s Theorem

81

 ∞ ∞  β 2j { β j e j (x)}2 dβ(x) < ∞ ⇐⇒ 0 such that f n < B, n = 1, 2, . . .. Moreover, since the above sequence is a Cauchy sequence, for an arbitrary  > 0,p there exists an N such that n > N= ⇒ f n − f N < /B. Thus, for f N (x) = i=1 αi k(xi , x) ∈ H0 , αi ∈ R, xi ∈ E, and i = 1, 2, . . ., we have that when n > N

f n 2H0 =  f n − f N , f n  H0 +  f N , f n  H0 ≤ f n − f N H0 f n H0 +

p 

αi | f n (xi )| .

i=1

Each of the first and second terms is at most  since we have f n (xi ) → 0 as n → ∞ for each i = 1, . . . , p. Hence, we have Lemma 4.  For Cauchy sequences { f n }, {gn } in H0 , we define f, g ∈ H such that { f n (x)}, {gn (x)} converge to f (x), g(x), respectively, for each x ∈ E. Then, { f n , gn  H0 } is Cauchy: | f n , gn  H0 −  f m , gm  H0 | = | f n , gn − gm  H0 +  f n − f m , gm  H0 | ≤ f n H0 gn − gm H0 + f n − f m H0 gm H0 .

Appendix

83

Since { f n , gn  H0 } is real and Cauchy, it converges (Proposition 6). The inner product obtained by convergence depends only on f (x), g(x) (x ∈ E). Let { f n }, {gn } be other Cauchy sequences in H0 that converge to f, g for each x ∈ E. Then, { f n − f n }, {gn − gn } are Cauchy sequences that converge to 0 for each x ∈ E, and from Lemma 4, we have f n − f n H0 , gn − gn H0 → 0 as n → ∞, which means that | f n , gn  H0 −  f n , gn  H0 | = | f n , gn − gn  H0 +  f n − f n , gn  H0 | ≤ f n H0 gn − gn H0 + f n − f n H0 gn H0 → 0 . Thus, the convergence point of { f n , gn  H0 } does not depend on { f n }, {gn } but on f, g ∈ H . We define the inner product of H by  f, g H := lim  f n , gn  H0 . n→∞

To show that this expression satisfies the definition of an inner product, we assume that f H =  f, f  H = 0. Then, for each x ∈ E, as n → ∞, from  | f n (x)| = | f n (·), k(x, ·)| ≤ k(x, x) f n H0 → 0, we have | f (x)| = limn→∞ | f n (x)| = 0. Moreover, since we have defined f ∈ H according to lim f n (x) (x ∈ E) for n→∞ any Cauchy sequence { f n } in H0 that converges to f , from the definition of inner products, we have (3.22)

f − f n H = lim f m − f n H0 → 0 m→∞

n → ∞, and H0 is dense in H . We show that H is complete. Let { f n } be a Cauchy sequence in H . From denseness, there exists a sequence { f n } in H0 such that

f n − f n H → 0

(3.23)

as n → ∞. Therefore, given an arbitrary  > 0, for m, n > N , we have

f n − f n H , f m − f m H , f n − f m H < /3 and

f n − f m H0 = f n − f m H ≤ f n − f n H + f n − f m H + f m − f m H ≤  for f n , f m ∈ H0 ⊆ H . Thus, { f n } is a Cauchy sequence in H0 , and we define f ∈ H by the convergence of f (x) for each x ∈ E. Moreover, from (3.22), we have

f − f n H → 0. Combining this with (3.23), we obtain

f − f n H ≤ f − f n H + f n − f n H → 0 as n → ∞. Hence, H is complete.

84

3 Reproducing Kernel Hilbert Space

Next, we show that k is the corresponding reproducing kernel of the Hilbert space H . Property (3.1) holds immediately because k(x, ·) ∈ H0 ⊆ H , x ∈ E. For another property (3.2), since f ∈ H is a limit of the Cauchy sequence { f n } in H0 at x ∈ E, we have f (x) = lim f n (x) = lim  f n (·), k(x, ·) H0 =  f, k(x, ·) H . n→∞

n→∞

Finally, we show that such an H uniquely exists. Suppose that G exists and shares the same properties possessed by H . Since H is a closure of H0 , G should contain H as a subspace. Since H is closed, from (2.11), we write G = H ⊕ H ⊥ . However, since k(x, ·) ∈ H , x ∈ E and  f (·), k(x, ·)G = 0 for f ∈ H ⊥ , we have f (x) = 0,  x ∈ E, which means that H ⊥ = {0}.

Proof of Proposition 38 From our assumption, we have k(x, ·) = k1 (x, ·) + k2 (x, ·) ∈ H for each x ∈ E. We define N ⊥  (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)) for each x ∈ E, where h 1 (x, ·), h 2 (x, ·) are elements in H1 , H2 for x ∈ E, but h 1 , h 2 are not necessarily reproducing kernels k1 , k2 of H1 , H2 , respectively. Since k(x, ·) = k1 (x, ·) + k2 (x, ·), we have h 1 (x, ·) − k1 (x, ·) + h 2 (x, ·) − k2 (x, ·) = k(x, ·) − k(x, ·) = 0 and z := (h 1 (x, ·) − k1 (x, ·), h 2 (x, ·) − k2 (x, ·)) ∈ N , so 0 = 0, f  H = z, ( f 1 , f 2 ) F for f ∈ H and N ⊥  ( f 1 , f 2 ) := v −1 ( f ). Thus, we have  f 1 , h 1 (x, ·)1 +  f 2 , h 2 (x, ·)2 =  f 1 , k1 (x, ·)1 +  f 2 , k2 (x, ·)2 , which implies the reproducing property:  f, k(x, ·) H = v −1 ( f ), v −1 (k(x, ·)) F = ( f 1 , f 2 ), (h 1 (x, ·), h 2 (x, ·)) F = ( f 1 , f 2 ), (k1 (x, ·), k2 (x, ·)) F = f 1 (x) + f 2 (x) = f (x) . Furthermore, let ( f 1 , f 2 ) ∈ F, f := f 1 + f 2 , and (g1 , g2 ) := ( f 1 , f 2 ) − v −1 ( f ). Then, from (g1 , g2 ) ∈ N and v −1 ( f ) ∈ N ⊥ , we have

( f 1 , f 2 ) 2F = v −1 ( f ) 2F + (g1 , g2 ) 2F . Combining this with (3.4) and (3.5), we have

f 2H = v −1 ( f ) 2F ≤ ( f 1 , f 2 ) 2F = f 1 2H1 + f 2 2H2 , where the equality holds when ( f 1 , f 2 ) = v −1 ( f ).



Appendix

85

Proof of Example 59 We use the equality [10]

∞ −∞

exp(−(x − y)2 )H j (αx)d x =

Suppose that

E



π(1 − α 2 ) j/2 H j (

αy ). (1 − α 2 )1/2

p(y)dy = 1. If we have k(x, y)φ j (y) p(y)dy = λφ j (x), E



then

˜ y)φ˜ j (y)dy = λφ˜ j (x) k(x, E

˜ y) := p(x)1/2 k(x, y) p(y)1/2 , φ˜ j (x) := p(x)1/2 φ j (x). Thus, it is sufficient for k(x, to show that we obtain the right-hand side by substituting  p(x) :=  ˜ y) := k(x,

2a exp(−2ax 2 ) π

2a exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 ) π

φ˜ j (x) := (

√ 2a 1/4 ) exp(−cx 2 )H j ( 2cx) π

into the left-hand side for E = (−∞, ∞). The left-hand side becomes

= = = =

∞ √ 2a ( )3/4 exp(−ax 2 ) exp(−b(x − y)2 ) exp(−ay 2 ) exp(−cy 2 )H j ( 2cy)dy π −∞ √ 2a 3/4 ∞ b b2 x)2 + [ − (a + b)]x 2 }H j ( 2cy)dy ( ) exp{−(a + b + c)(y − π a + b + c a + b + c −∞ √ ∞ 2a 3/4 2c b dz ( ) exp(−cx 2 ) exp{−(z − √ x)2 }H j ( √ z) √ π a+b+c a+b+c a+b+c −∞  √ √ 2a 2a 1/4 2c exp(−cx 2 ) π (1 − ) j/2 H j ( 2cx) ( ) π π(a + b + c) a+b+c   √ 2a 2a j b 2a ( ) j ( )1/4 exp(−cx 2 )H j ( 2cx) = B φ˜ j (x), a+b+c a+b+c π A

√ √ 2c where we define z := y a + b + c, α := √ and use a+b+c

86

3 Reproducing Kernel Hilbert Space

 (1 − α )

2 1/2

=

2c 1− = a+b+c



a+b−c = a+b+c



(a + b)2 − c2 b = . (a + b + c)2 a+b+c 

Proof of Proposition 40 Since K is uniformly continuous, if d is the distance E × E, there exists a δn such that ⇒|K (x1 , y1 ) − K (x2 , y2 )| < n −1 d((x1 , y1 ), (x2 , y2 )) < δn = for n = 1, 2, . . . and arbitrary x1 , x2 , y1 , y2 ∈ E. Since E is compact, we can cover m of diameter δn . If we arbitrarily choose vi ∈ it with a finite number of balls {E n,i }i=1 E n,i and define K n (x, y) := K (vi , v j ) for (x, y) ∈ E n,i × E n, j , from the uniform continuity of K , we obtain max

(x,y)∈E×E

|K (x, y) − K n (x, y)|
0. However, from the mean value theorem, we have TK f, f  :=

m m  

z i z j {μ(E i )μ(E j )}−1

m

i=1 z i {μ(E i )}

k(x, y)dμ(x)dμ(y) < 0 . Ei

i=1 j=1

for f =



−1

Ej

I Ei , which contradicts the fact that TK is positive definite. 

Appendix

87

Proof of Lemma 3 We assume that f n (x) monotonically increases as n grows for each x ∈ E. Let  > 0 be arbitrary. For each x ∈ E, let n(x) be the minimum n such that | f n (x) − f (x)| < . From continuity, for each x ∈ E, we set U (x) so that y ∈ U (x)= ⇒| f (x) − f (y)| < , | f n(x) (x) − f n(x) (y)| <  . Then, we have f (y) − f n(x) (y) ≤ f (x) +  − f n(x) (y) ≤ f n(x) (x) + 2 − f n(x) (y) ≤ | f n(x) (x) − f n(x) (y)| + 2 < 3 . m Moreover, since E is compact, we may suppose that E ⊆ ∪i=1 U (xi ). If N is the maximum value of n(x1 ), . . . , n(xm ), for n ≥ N , we have

f (y) − f n (y) ≤ f (y) − f n(xi ) (y) ≤ 3 for each y ∈ E and each i for which y ∈ U (xi ).



Exercises 31∼45 31. Proposition 34 can be derived according to the following steps. Which part of the proof in the appendix does each step correspond to? (a) Define the inner product ·, · H0 of H0 := span{k(x, ·) : x ∈ E}. (b) For any Cauchy sequence { f n } in H0 and each x ∈ E, the real sequence { f n (x)} is Cauchy, so it converges to a f (x) := lim f n (x) (Proposition 6). n→∞ Let H be such a set of f s. (c) Define the inner product ·, · H of the linear space H . (d) Show that H0 is dense in H . (e) Show that any Cauchy sequence { f n } in H converges to some element of H as n → ∞ (completeness of H ). (f) Show that k is a reproducing kernel of H . (g) Show that such an H is unique. 1 32. In Examples 55 and 56, the inner product is  f, g H = 0 F(u)G(u)du, and the RKHS is H = {E  x → F(t)J (x, t)dη(t) ∈ R|F ∈ L 2 (E, η)} . E

What are the J (x, t) in Examples 55 and 56? Also, how is the kernel k(x, y) represented in general by using J (x, t)? 33. Proposition 38 can be derived according to the following steps. Which part of the proof in the appendix does each step correspond to?

88

3 Reproducing Kernel Hilbert Space

(a) Fix f ∈ H arbitrarily define N ⊥  ( f 1 , f 2 ) := v −1 ( f ), k(x, ·) := k1 (x, ·) + k2 (x, ·), and (h 1 (x, ·), h 2 (x, ·)) := v −1 (k(x, ·)), and show that  f 1 , h 1 (x, ·)1 +  f 2 , h 2 (x, ·)2 =  f 1 , k1 (x, ·)1 +  f 2 , k2 (x, ·)2 (b) Using (a), prove the reproducing property of k:  f, k(x, ·) H = f (x). (c) Show that the norm of H is (3.6) 34. Show that each f ∈ Wq [0, 1] can be the Taylor series expanded by f (x) =

q−1 

f (i) (0)φi (x) +



1

G q (x, y) f (q) (y)dy

0

i=0

using φi (x) :=

xi , i = 0, 1, . . . i!

and

q−1

G q (x, y) :=

(x − y)+ . (q − 1)!

35. Show that Wq [0, 1] = H0 ⊕ H1 , where H0 = {

q−1 

αi φi (x)|α0 , . . . , αq−1 ∈ R}

i=0

H1 = {

1

G q (x, y)h(y)dy|h ∈ L 2 [0, 1]}

0

(You need to show the inclusion relation on both sides of the set). In addition, show that H0 ∩ H1 = {0}. 36. We consider the integral operator Tk of k(x, y) = min{x, y}, in L 2 [0, 1], where x, y ∈ E = [0, 1]. Substitute λj =

4 {(2 j − 1)π }2

  √ (2 j − 1)π x e j (x) = 2 sin 2 into Tk e j = λ j e j to examine the equality. 37. Show that the eigenvalues in Example 59 form a geometric sequence with the initial values and ratio that are determined by β := σˆ 2 /σ 2 .

Exercises 31∼45

89

38. In Example 59, the following program obtains eigenvalues and eigenfunctions under the assumption that σ 2 = σˆ 2 = 1. We can change the program to set the values of σ 2 , σˆ 2 in ## and add σ 2 , σˆ 2 as an argument to the function phi in ### and run it to output a graph. d e f H( j , x ) : i f j == 0 : return 1 e l i f j == 1 : return 2 ∗ x e l i f j == 2 : r e t u r n −2 + 4 ∗ x ∗∗2 else : r e t u r n 4 ∗ x − 8 ∗ x ∗∗3

c c = np . s q r t ( 5 ) / 4 a = 1/4 ## def phi ( j , x ) : # ## r e t u r n np . exp ( − ( c c − a ) ∗ x ∗ ∗ 2 ) ∗ H( j , np . s q r t ( 2 ∗ c c ) ∗ x ) c o l o r = [ "b" , "g" , "r" , "k" ] x = np . l i n s p a c e ( − 2 , 2 , 1 0 0 ) p l t . plot (x , phi (0 , x ) , c = color [0] , l a b e l = p l t . y l i m ( −2 , 8 ) p l t . y l a b e l ( "phi" ) f o r i i n range ( 0 , 3 ) :

"j=0" )

p l t . p lot (x , phi ( i , x ) , c = color [ i + 1] , l a b e l = p l t . t i t l e ( "CharacteristicfunctionofGaussKernel" )

"j=%d"%i )

39. Show the following: (a) The function f n (x) = n 2 (1 − x)x n+1 defined over [0, 1] converges at each x ∈ [0, 1], but its upper bound does not converge (it is not uniformly convergent). (b) The function f n (x) = (1 − x)x n+1 defined over [0, 1] converges uniformly (using Lemma  3).(−1)n √ (c) The series ∞ n=0 n+1 converges absolutely. 40. In Example 58, suppose that the period of φ is 2π instead of 2. What are the eigenvalues and eigenfunctions of Tk ? Additionally, derive the kernel k. 41. What eigenequations should be solved in Example 61 when m = 3, d = 1? 42. Define and execute the following part of the program in Example 62 as a function. The input for this includes data x, a kernel k, and the i of the ith eigenvalue. The output is a function F. K = np . z e r o s ( ( m, m) ) f o r i i n range (m) : f o r j i n range (m) : K[ i , j ] = k ( x [ i ] , x [ j ] ) v a l u e s , v e c t o r s = np . l i n a l g . e i g (K)

90

3 Reproducing Kernel Hilbert Space lam = v a l u e s / m a l p h a = np . z e r o s ( ( m, m) ) f o r i i n range (m) : a l p h a [ : , i ] = v e c t o r s [ : , i ] ∗ np . s q r t (m) / ( v a l u e s [ i ] + 10 e − 16) def F ( y , i ) : S = 0 f o r j i n range (m) : S = S + alpha [ j , i ] ∗ k ( x [ j ] , y ) return S

43. In Example 62, for the Gaussian kernel, random numbers are generated according to the normal distribution, and we obtain the corresponding eigenvalues and eigenfunctions. When the number of samples is large, theoretically, the eigenvalues are reduced exponentially (Example 59). What happens with the polynomial kernel k(x, y) = (1 + xy)2 when m = 2 and d = 1? Output the eigenvalues and eigenfunctions as the Gaussian kernel. 44. If we construct (3.19) using the solution of K m U = U , show that the result is a solution of (3.18) and that it is orthogonal with of 1. a magnitude 2 β < ∞. However, this is 45. In Proposition 42, β j should originally satisfy ∞ j=1 j not stated in the assertion of Proposition 42. Why is this the case?

Chapter 4

Kernel Computations

In Chap. 1, we learned that the kernel k(x, y) ∈ R represents the similarity between two elements x, y in a set E. Chapter 3 described the relationships between a kernel k, its feature map E  x → k(x, ·) ∈ H , and its reproducing kernel Hilbert space H . In this chapter, we consider k(x, ·) to be a function of E → R for each x ∈ E, and we perform data processing for N actual data pairs (x1 , y1 ), . . . , (x N , y N ) of covariates and responses. The xi , i = 1, . . . , N (row vectors) are p-dimensional and given by the matrix X ∈ R N × p . The responses yi (i = 1, . . . , N ) may be real or binary. This chapter discusses kernel ridge regression, principal component analysis, support vector machines (SVMs), and splines, and we find the f ∈ H that minimizes the objective function under  N various constraints. It is known that we can write the αi k(xi , ·) (representation theorem), and the problem optimal f in the form i=1 reduces to finding the optimal α1 , . . . , α N . In the second half, we address the problem of computational complexity. The computation of a kernel takes more than O(N 3 ), and real-time calculation is hard when N is greater than 1000. In particular, we consider how to reduce the rank of the Gram matrix K . Specifically, we learn actual procedures for random Fourier features, Nyström approximation, and incomplete Cholesky decomposition.

4.1 Kernel Ridge Regression N We say that finding the β ∈ R p (column vector) that minimizes i=1 (yi − xi β)2 is the least-squares problem. If we assume that we have executed the centralN 1  yi and ization process such that yi ← yi − y¯ and xi, j ← xi, j − x¯ j for y¯ = N i=1

© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_4

91

92

x¯ j =

4 Kernel Computations N 1  xi, j and that the matrix X  X is nonsingular, we can obtain the solution N i=1

as βˆ = (X  X )−1 X  y from X = (xi, j ) and y = (yi ). In the following, we prepare a kernel k : E × E → R and consider the problem of finding the f ∈ H that minimizes N  (yi − f (xi ))2 . L := i=1

As we considered in Example 40, we express the RKHS H as the sum of N ) M := span({k(xi , ·)}i=1

and

M ⊥ = { f ∈ H | f, k(xi , ·) H = 0, i = 1, . . . , N } .

If we set f = f 1 + f 2 , f 1 ∈ M, f 2 ∈ M ⊥ , then we have N N N N     (yi − f (xi ))2 = (yi − f 1 (xi ))2 = (yi − αk(x j , xi ))2 i=1

i=1

i=1

(4.1)

j=1

and E := R p ; we then obtain f (xi ) = f 1 (·) + f 2 (·), k(xi , ·) H = f 1 (·), k(xi , ·) H = f 1 (xi ) for i = 1, . . . , N . Thus, the minimization of L reduces to that of L=

N  i=1

{yi −

N 

α j k(x j , xi )}2 = y − K α 2 ,

(4.2)

j=1

where K = (k(xi  , x j ))i, j=1,...,N is a Gram matrix, and the norm z of z = [z 1 , . . . , N 2 z N ] ∈ R denotes i=1 z i . The above principle is the representation theorem. If we differentiate L by α, we have −K (y − K α) = 0. If K is positive definite rather than nonnegative definite, then the solution becomes αˆ = K −1 y. If we use the fˆ ∈ H obtained as above that minimizes (4.2), then we can predict the value of y given a new x ∈ R p via fˆ(x) =

n  i=1

αˆ i k(xi , x).

4.1 Kernel Ridge Regression

93

We can construct a procedure to compute α as follows: # We i n s t a l l s k f d a module b e f o r e h a n d pip i n s t a l l cvxopt

# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a v e b e e n e x e c u t e d . import numpy a s np import p a n d a s a s pd from s k l e a r n . d e c o m p o s i t i o n import PCA import c v x o p t from c v x o p t import s o l v e r s from c v x o p t import m a t r i x import m a t p l o t l i b . p y p l o t a s p l t from m a t p l o t l i b import s t y l e s t y l e . u s e ( "seaborn−ticks" ) from numpy . random import r a n d n # G a u s s i a n random numbers from s c i p y . s t a t s import norm

def alpha ( k , x , y ) : n = len ( x ) K = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) r e t u r n np . l i n a l g . i n v (K + 10 e −5 ∗ np . i d e n t i t y ( n ) ) . d o t ( y ) # Add 10^( − 5) I t o K f o r making i t i n v e r t i b l e

Example 63 Utilizing the function alpha, we execute kernel regression via polynomial and Gaussian kernels for n = 50 data (λ = 0.1). We present the output in Fig. 4.1. d e f k_p ( x , return d e f k_g ( x , return

y) : # ( np . d o t ( x . T , y) : # np . exp ( − ( x −

Kernel D e f i n i t i o n y ) + 1 ) ∗∗3 Kernel D e f i n i t i o n y ) ∗∗2 / 2 )

lam = 0 . 1 n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) # Data Generation a l p h a _ p = a l p h a ( k_p , x , y ) a l p h a _ g = a l p h a ( k_g , x , y ) z = np . s o r t ( x ) ; u = [ ] ; v = [ ] f o r j i n range ( n ) : S = 0 f o r i i n range ( n ) : S = S + a l p h a _ p [ i ] ∗ k_p ( x [ i ] , z [ j ] ) u . append ( S ) S = 0 f o r i i n range ( n ) : S = S + a l p h a _ g [ i ] ∗ k_g ( x [ i ] , z [ j ] ) v . append ( S ) p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s =

"k" ,

marker =

"o" )

94

4 Kernel Computations

5

Kernel Regression

2 -1

0

1

y

3

4

Polynomial Kernel Gaussian Kernel

-1.0

-0.5

0.0

0.5

1.0

x Fig. 4.1 We execute kernel regression by using polynomial and Gaussian kernels

plt plt plt plt plt plt plt plt

. p l o t ( z , u , c = "r" , l a b e l = "PolynomialKernel" ) . p l o t ( z , v , c = "b" , l a b e l = "GaussKernel" ) . x l i m ( −1 , 1 ) . y l i m ( −1 , 5 ) . x l a b e l ( "x" ) . y l a b e l ( "y" ) . t i t l e ( "KernelRegression" ) . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

We cannot obtain the solution of a linear regression problem when the rank of X is smaller than p, i.e., N < p. Thus, we often minimize N 

(yi − xi β)2 + λ β 22

i=1

for cases in which λ > 0. We call such a modification of linear regression a ridge. The β to be minimized is given by (X  X + λI )−1 X  y. In fact, we derive the formula by differentiating y − Xβ 2 + λβ  β by β and equating it to zero; we obtain −X  (y − Xβ) + λβ = 0 . We consider extending ridge regression to the problem of finding the f ∈ H that minimizes

4.1 Kernel Ridge Regression

95

L :=

N  (yi − f (xi ))2 + λ f 2H .

(4.3)

i=1

Since f 1 and f 2 are orthogonal, we have f 2H = f 1 2H + f 2 2H + 2 f 1 , f 2 H = f 1 2H + f 2 2H ≥ f 1 2H .

(4.4)

From (4.1), (4.3), and (4.4), we also have L ≥

N  (yi − f 1 (xi ))2 + λ f 1 2H . i=1

If we note that the second term can be expressed by N N N  N    f 1 2H = αi k(xi , ·), α j k(x j , ·) H = αi α j k(xi , ·), k(x j , ·) H = α  K α i=1

j=1

i=1 j=1

for α = [α1 , . . . , α N ] , then the minimization of L reduces to that of y − K α 2 + λα  K α .

(4.5)

If we differentiate the equation by α and set it equal to zero, we obtain −K (y − K α) + λK α = 0 . If K is nonsingular, we have αˆ = (K + λI )−1 y .

(4.6)

Finally, if we use the fˆ ∈ H that minimizes the (4.3) obtained thus far, we can predict the value of y given a new x ∈ R p via fˆ(x) =

n 

αˆ i k(xi , x) .

i=1

For example, we can construct a procedure that finds α as follows: def alpha ( k , x , y ) : n = len ( x ) K = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) r e t u r n np . l i n a l g . i n v (K + lam ∗ np . i d e n t i t y ( n ) ) . d o t ( y )

96

4 Kernel Computations

Fig. 4.2 We execute kernel ridge regression using polynomial and Gaussian kernels

2 -1

0

1

y

3

4

5

Kernel Ridge

-1.0

-0.5

0.0

0.5

1.0

x

Example 64 Using the function alpha, we execute kernel ridge regression for polynomial and Gaussian kernels and n = 50 data(λ = 0.1). We show the outputs in Fig. 4.2. d e f k_p ( x , return d e f k_g ( x , return

y) : # Kernel D e f i n i t i o n ( np . d o t ( x . T , y ) + 1 ) ∗∗3 y) : # Kernel D e f i n i t i o n np . exp ( − ( x − y ) ∗∗2 / 2 )

lam = 0 . 1 n = 5 0 ; x = np . random . r a n d n ( n ) ; y = 1 + x + x ∗∗2 + np . random . r a n d n ( n ) Data G e n e r a t i o n a l p h a _ p = a l p h a ( k_p , x , y ) a l p h a _ g = a l p h a ( k_g , x , y ) z = np . s o r t ( x ) ; u = [ ] ; v = [ ] f o r j i n range ( n ) : S = 0 f o r i i n range ( n ) : S = S + a l p h a _ p [ i ] ∗ k_p ( x [ i ] , z [ j ] ) u . append ( S ) S = 0 f o r i i n range ( n ) : S = S + a l p h a _ g [ i ] ∗ k_g ( x [ i ] , z [ j ] ) v . append ( S ) p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" ) p l t . p l o t ( z , u , c = "r" , l a b e l = "PolynomialKernel" ) p l t . p l o t ( z , v , c = "b" , l a b e l = "GaussKernel" ) p l t . x l i m ( −1 , 1 ) p l t . y l i m ( −1 , 5 ) p l t . x l a b e l ( "x" ) p l t . y l a b e l ( "y" ) p l t . t i t l e ( "KernelRidge" ) p l t . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

#

4.2 Kernel Principle Component Analysis

97

4.2 Kernel Principle Component Analysis We review the procedure of principal component analysis (PCA) when we do not use any kernel. We centralize each of the columns in the matrix X and vector y. We first compute the v1 := v ∈ R p that maximizes v  X  X v under v  v = 1. Similarly, for i = 2, . . . , p, we repeatedly compute vi with the v  v = 1 that maximizes v  X  X v and is orthogonal to v1 , · · · , vi−1 ∈ R p . In the actual cases, we do not use all of the v1 , · · · , v p but compress R p to the v1 , · · · , vm (1 ≤ m ≤ p) with the largest eigenvalues. We compute the v ∈ R p that maximizes v  X  X v − μ(v  v − 1)

(4.7)

with a μ > 0 Lagrange coefficient to find v ∈ R p with the v  v = 1 that maximizes v  X  X v. In PCA, we often compute ⎤ xv1 ⎢ .. ⎥ m ⎣ . ⎦∈R ⎡

xvm for each row vector x ∈ R p using the obtained v1 , . . . , vm ∈ R p . We call such a value the score of x, which is the vector obtained by projecting x onto the m elements. We may apply a problem that is similar to PCA for an RKHS H via the feature map  : E  xi → k(xi , ·) ∈ H rather than the PCA in R p . To this end, we consider the problem of finding the f ∈ H that maximizes N 

f (xi )2 − μ( f 2H − 1)

(4.8)

j=1

with an μ > 0 Lagrange coefficient. If we use the linear kernel (the standard inner product), we can express f ∈ H by f (·) = w, · E with w ∈ E. Thus, (4.7) and (4.8) coincide. The centralization in the kernel PCA is for the Gram matrix K rather than the matrix X . For the other part, the extension follows in the same manner. As discussed in the previous section, we apply the representation theorem. Thus, N ) and f 2 ∈ M ⊥ , we have for f 1 ∈ M := span({k(xi , ·)}i=1 N 

f (xi )2 =

i=1

N N N    f 1 (·)+ f 2 (·), k(xi , ·) 2H = f 1 (·), k(xi , ·) 2H = f 1 (xi )2 i=1

i=1

i=1

N  N N  N N    2 = { α j k(x j , xi )} = αr αs k(xr , xi )k(xs , xi ) = α  K 2 α i=1 j=1

i=1 r =1 s=1

98

4 Kernel Computations

f 1 + f 2 2H = f 1 2H + f 2 2H ≥ f 1 2H =

N 

α j k(x j , ·) 2H =

N  N 

αr αs k(xr , xs ) = α  K α .

r =1 s=1

j=1

Hence, we can formulate (4.8) as the maximization of α  K  K α − μ(α  K α − 1) . If we substitute β = K 1/2 α, then since K is symmetric, we have β  Kβ − μ(β  β − 1) . Let λ1 , . . . , λ N and u 1 , . . . , u N be the eigenvalues and eigenvectors of the eigenequation Kβ = λβ, respectively. Then, we have [26] 1 uN u1 . α = K −1/2 β = √ β = √ , . . . , √ λ λ1 λN If we centralize the Gram matrix K = (k(xi , x j )), then the (i, j)th element of the modified Gram matrix is k(xi , ·) − = k(xi , x j ) − +

N N 1  1  k(x h , ·), k(x j , ·) − k(x h , ·) H N h=1 N h=1 N N 1  1  k(xi , x h ) − k(x j , xl ) N h=1 N l=1

N N 1  k(x h , xl ) . N 2 h=1 l=1

(4.9)

To obtain the score (size 1 ≤ m ≤ p) of x ∈ R p (row vector), we use the first m columns of A = [α1 , . . . , α N ] ∈ R N × p . Let xi ∈ R p and αi ∈ Rm be a row vector of X and the ith column of A ∈ R N ×m , respectively. Then, N 

αi k(xi , x) ∈ Rm

i=1

is the score of x ∈ R p . Compared to ordinary PCA, kernel PCA requires a computational time of O(N 3 ). Therefore, when N is large compared to p, the computational complexity may be enormous. In the Python, we can write the procedure as follows:

4.2 Kernel Principle Component Analysis

99

def k e r n e l _ p c a _ t r a i n ( x , k ) : n = x . shape [ 0 ] K = np . z e r o s ( ( n , n ) ) S = [0] ∗ n ; T = [0] ∗ n f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i , : ] , x [ j , : ] ) f o r i i n range ( n ) : S [ i ] = np . sum (K[ i , : ] ) f o r j i n range ( n ) : T [ j ] = np . sum (K [ : , j ] ) U = np . sum (K) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = K[ i , j ] − S [ i ] / n − T [ j ] / n + U / n ∗∗2 v a l , v e c = np . l i n a l g . e i g (K) idx = v a l . a r g s o r t ( ) [ : : − 1 ] # d e c r e a s i n g order as R val = val [ idx ] vec = vec [ : , idx ] a l p h a = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : a l p h a [ : , i ] = vec [ : , i ] / v a l [ i ] ∗ ∗ 0 . 5 return alpha

d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) : n = x . shape [ 0 ] p c a = np . z e r o s (m) f o r i i n range ( n ) : p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z ) return pca

In kernel PCA, when we use the linear kernel, the scores are consistent with those of PCA without any kernel. For simplicity, we assume that X is normalized. If we do not use the kernel, then by the singular value decomposition of X = U V  (U ∈ R N × p ,  ∈ R p× p , V ∈ R p× p ), the multiplication of N 1−1 X  X = N 1−1 V  2 V  and V  is N 1−1 X  X V = N 1−1 V  2 . Thus, each column of V is a principal component vector, and the scores of x1 , . . . , x N ∈ R p (row vector) are the first m columns of X V = U V  · V = U  . On the other hand, for the linear kernel, we may write the Gram matrix as K = X X  = U  2 U  and have K U = X X  U = U  2 . That is, each column of U is β1 , . . . , β N , and the columns α1 , . . . , α N of K −1/2 U are the principal component vectors. Therefore, the scores of x1 , . . . , x N ∈ R p (row vectors) are the first m columns of K · K −1/2 U = U  2 U  · (U  2 U  )−1/2 · U = U  . Furthermore, we compare the results in terms of centralization. Equation (4.9) is xi x j −

N N N N 1  1  1  xi x h − x j xl + xl x h = (xi − x)(x ¯ j − x) ¯ N h=1 N l=1 N h=1 l=1

100

4 Kernel Computations

for the linear kernel, which is consistent with that of the ordinary PCA approach. Therefore, the obtained score is the same. Example 65 We performed kernel PCA on a data set called US Arrests in Python. We wished to project the ratio of the resident population living in urban areas and the incidence rates of homicide, violent crime, and assaults on women (number of arrests per 100,000 people) for all 50 states of the U.S. onto the axes of two variables using PCA. We performed kernel PCA with a Gaussian kernel (σ 2 = 0.01, 0.08), kernel PCA with a linear kernel, and ordinary PCA. We observed that the differences in the features of the 50 states were not evident in the results of ordinary PCA and kernel PCA with the linear kernel (Fig. 4.3a,b). With the Gaussian kernel (σ 2 = 0.08), the 50 states were divided into four categories (Fig. 4.3c). As far as the data were concerned, California’s figures (fewer homicides for a higher urban population) differed from those of the other states. Nevertheless, when we set σ = 0.01, the differences between (b) Kernel PCA (Linear)

(a) Standard PCA

8

46 50

42

26 17 12

-30 -40

3

22

25

31 18

20

10

4

24 33

-100

10

20

18 31

-50

0

50

100

47 39

23 49

38 35 7 44

-250

-200

-150

11

-100

-50

First

(c) Kernel PCA (σ 2 = 0.08) 0.4

(d) Kernel PCA (σ 2 = 0.01)

0.4

15

33 32

29

0.2

3

0.0

Second

4

-0.4

6

-0.4

-0.2

0.0

First

44

20 21

23 13

11

3 48 30 281 7 39 49 9 35 19 612 10 2817 36 22 14 26 24 4645 1815 31 41 27 37 25 34

0.4

5

40

2

0.2

50 42 4 43 47

38 2916

-0.2

0.0

5 7 42 41 40 43 39 44 113 33 38 37 36 35 28 34 30 24 45 27 32 31 23 25 22 26 50 49 48 47 46 18 17 19 16 21 20 11 14 10 12 89

-0.2

0.2

21

14 16

30

First

Second

29 15 27

3736

643

5 -300

50 46

25

13 28 32

-350

150

42

34

19

26 12 17

8 22

3

40

45

41

1

9 2

48

4

9

1

41

45 48

2

-80

-10 -30

3637

24 40

32 28 13

436

-50

0

27

19

39

Second

10

16 14

15 29

-20

Second

49 23

21 47

33

-60

20

44 7 35 38

34

5

30

-70

11

-0.2

0.0

0.2

0.4

0.6

First

Fig. 4.3 For the US Arrests data, we ran the ordinary PCA and kernel PCA methods (linear; Gaussian with σ 2 = 0.08, 0.01), and we display the scores here. In the figure, 1 to 50 are the IDs given to the states, and California’s ID is 5 (written in red). The results of the kernel PCA approach differ greatly depending on what kernel we choose. Additionally, since kernel PCA is unsupervised, it is not possible to use CV to select the optimal parameters. The scores of ordinary PCA and PCA with the linear kernel should be identical. Although the directions of both axes are opposite, which is common in PCA, we can conclude that they match

4.3 Kernel SVM

101

California and the other 49 states became clear (Fig. 4.3d). We used the following code for the execution of the compared approaches: # def k (x , y ) : # r e t u r n np . d o t ( x . T , y ) sigma2 = 0.01 def k ( x , y ) : r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 ) X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/master/ USArrests.csv’ ) x = X. v a l u e s [ : , : − 1] n = x . shape [ 0 ] ; p = x . shape [ 1 ] alpha = kernel_pca_train (x , k ) z = np . z e r o s ( ( n , 2 ) ) f o r i i n range ( n ) : z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] ) min1 = np . min ( z [ : , 0 ] ) ; min2 = np . min ( z [ : , 1 ] ) max1 = np . max ( z [ : , 0 ] ) ; max2 = np . max ( z [ : , 1 ] ) plt plt plt plt plt for

. x l i m ( min1 , max1 ) . y l i m ( min2 , max2 ) . x l a b e l ( "First" ) . y l a b e l ( "Second" ) . t i t l e ( "KernelPCA(Gauss0.01)" ) i i n range ( n ) : i f i != 4 : pl t . text (x = z [ i , 0] , y = z [ i , 1] , s = i ) p l t . t e x t ( z [ 4 , 0 ] , z [ 4 , 1 ] , 5 , c = "r" )

4.3 Kernel SVM Consider binary discrimination using support vector machines (SVMs). Given X ∈ R N × p and y ∈ {1, −1} N , we find the boundary Y = Xβ + β0 with the β ∈ R p and β0 ∈ R that maximize the margin. Let γ ≥ 0. We wish to maximize the margin M by ranging (β0 , β) ∈ R × R p and i ≥ 0, i = 1, . . . , N to satisfy N 

i ≤ γ

i=1

and yi (β0 + xi β) ≥ M(1 − i ) , i = 1, . . . , N . We often formulate this as the problem of minimizing  1 β 2 + C i 2 i=1 N

(4.10)

102

4 Kernel Computations

under yi (xi β + β0 ) ≥ 1 − i , i ≥ 0 for i = 1, . . . , N by using a constant C > 0 (the prime problem). We further transform it into the problem of finding 0 ≤ αi ≤ C, i = 1, 2, . . . , N that maximizes N 

1  αi − αi α j yi y j xi x j 2 i=1 i=1 j=1 N

N

(4.11)

N under i=1 αi yi = 0, where xi is the ith row vector of X (the dual problem)1 . The constant C > 0 is a parameter that represents the flexibility of the boundary surface. The higher the value is, the more samples are used to determine the boundary (samples with αi = 0, i.e., support vectors). Although we sacrifice the fit of the data, we reduce the boundary variation caused by sample data to prevent overtraining. Then, from the support vectors, we can calculate the slope of the boundary with the following formula: N  β= αi yi xi ∈ R p . i=1

Then, suppose that we replace the boundary surface with a curved surface by replacing the inner product xi x j with a general nonlinear kernel k(xi , x j ). Then, we can obtain complicated boundary surfaces rather than planes. However, the theoretical basis for replacing the product with a kernel is not clear. Therefore, in the following, we derive the same results by formulating the optimization using k : E × E → R. As in to the previous application of the representation theorem, we find the f ∈ H that minimizes    1 f 2H + C i − αi [yi { f (xi ) + β0 } − (1 − i )] − μi i . 2 i=1 i=1 i=1 N

N

N

(4.12)

Noting that f (xi ) = f 1 (xi ), i = 1, . . . , N and f H ≥ f 1 H , we find γ1 , . . . , γ N N such that f (·) = i=1 γi k(xi , ·). The Karush-Kuhn-Tucker (KKT) condition results in the following nine equations: yi { f (xi ) + β0 } − (1 − i ) ≥ 0 i ≥ 0 αi [yi { f (xi ) + β0 } − (1 − i )] = 0 μi i = 0 1

We see this derivation in several references, such as Joe Suzuki, “Statistical Learning with Math and Python” (Springer); C. M. Bishop, “Pattern Recognition and Machine Learning” (Springer); Hastie, Tibshirani, and Fridman, “Elements of Statistical Learning” (Springer); and other primary machine learning books.

4.3 Kernel SVM

103



γ j k(xi , x j ) −



j

α j y j k(xi , x j ) = 0

(4.13)

j



αi yi = 0

i

C − αi − μi = 0

(4.14)

μi ≥ 0 , 0 ≤ αi ≤ C. Next, suppose that f 0 , f 1 , . . . , f m : R p → R are convex and differentiable at β = β . In general, Eqs. (4.15, 4.16, and 4.17) are called the KKT condition2 . ∗

Proposition 43 (KKT Condition) Suppose that f 1 (β) ≤ 0, . . . , f m (β) ≤ 0. Then, β = β ∗ ∈ R p minimizes f 0 (β) if and only if f 1 (β ∗ ), . . . , f m (β ∗ ) ≤ 0

(4.15)

and α1 , . . . , αm ≥ 0 exist such that α1 f 1 (β ∗ ) = · · · = αm f m (β ∗ ) = 0 ∇ f 0 (β ∗ ) +

m 

αi ∇ f i (β ∗ ) = 0 .

(4.16) (4.17)

i=1

Utilizing these nine equations, from (4.13)(4.14), we can express (4.12) as N  i=1

1  αi α j yi y j k(xi .x j ) . 2 i=1 j=1 N

αi −

N

(4.18)

Comparing (4.11) and (4.18), we observe that the dual problem replaces xi x j with k(xi , x j ) for the formulation without any kernel. In fact, if we set f (·) = β, · H , β ∈ R p , k(x, y) = x  y (x, y ∈ R p ), then we obtain the dual problem for a linear kernel (4.11). Example 66 By using the following function svm_2, we can compare how the bounds differ between a linear kernel (the standard inner product) and a nonlinear kernel (a polynomial kernel), as shown in Fig. 4.4. cvxopt is a Python module for solving quadratic programming problems. The function cvxopt calculates α. def K_linear ( x , y ) : r e t u r n x . T@y d e f K_poly ( x , y ) : r e t u r n ( 1 + x . T@y) ∗∗2

2

For the proof, see Chap. 9 of Joe Suzuki “Statistical Learning with R/Python” (Springer).

104

4 Kernel Computations

d e f svm_2 (X, y , C , K) : eps =0.0001 n=X . s h a p e [ 0 ] P=np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : P [ i , j ] =K(X[ i , : ] , X[ j , : ] ) ∗y [ i ] ∗ y [ j ] # S p e c i f y i t v i a t h e m a t r i x f u n c t i o n i n t h e package m a t r i x P= m a t r i x ( P+np . e y e ( n ) ∗ e p s ) A= m a t r i x ( − y . T . a s t y p e ( np . f l o a t ) ) b= m a t r i x ( np . a r r a y ( [ 0 ] ) . a s t y p e ( np . f l o a t ) ) h= m a t r i x ( np . a r r a y ( [ C] ∗ n + [ 0 ] ∗ n ) . r e s h a p e ( − 1 , 1 ) . a s t y p e ( np . f l o a t ) ) G= m a t r i x ( np . c o n c a t e n a t e ( [ np . d i a g ( np . o n e s ( n ) ) , np . d i a g ( − np . o n e s ( n ) ) ] ) ) q= m a t r i x ( np . a r r a y ( [ − 1 ] ∗ n ) . a s t y p e ( np . f l o a t ) ) r e s = c v x o p t . s o l v e r s . qp ( P , q , A=A, b=b , G=G, h=h ) a l p h a =np . a r r a y ( r e s [ ’x’ ] ) # x i s t h e a l p h a i n t h e t e x t b e t a = ( ( a l p h a ∗y ) .T@X) . r e s h a p e ( 2 , 1 ) index = ( eps < alpha [ : , 0 ] ) & ( alpha [ : , 0] < C − eps ) b e t a _ 0 =np . mean ( y [ i n d e x ] −X[ i n d e x , : ] @beta ) r e t u r n {’alpha’ : a l p h a , ’beta’ : b e t a , ’beta_0’ : b e t a _ 0 } d e f p l o t _ k e r n e l (K, l i n e ) : # S p e c i f y t h e l i n e s v i a t h e l i n e a r g u m e n t r e s =svm_2 (X, y , 1 ,K) a l p h a = r e s [ ’alpha’ ] [ : , 0 ] b e t a _ 0 = r e s [ ’beta_0’ ] def f ( u , v ) : S= b e t a _ 0 f o r i i n range (X . s h a p e [ 0 ] ) : S=S+ a l p h a [ i ] ∗ y [ i ] ∗K(X[ i , : ] , [ u , v ] ) return S [ 0 ] # ww i s t h e h e i g h t o f f ( x , y ) . We can draw t h e c o n t o u r . uu=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; vv=np . a r a n g e ( − 2 , 2 , 0 . 1 ) ; ww= [ ] f o r v i n vv : w= [ ] f o r u i n uu : w. append ( f ( u , v ) ) ww. a p p e n d (w) p l t . c o n t o u r ( uu , vv , ww, l e v e l s =0 , l i n e s t y l e s = l i n e )

2

3

Fig. 4.4 After generating samples, we draw linear (planar) and nonlinear (curved) boundaries with support vector machines

0 -1 -2 -3

X[,2]

1

0 0

-3

-2

-1

0 X[,1]

1

2

3

4.4 Spline Curves

105

a = 3 ; b=−1 n =200 X= r a n d n ( n , 2 ) y=np . s i g n ( a ∗X [ : , 0 ] + b∗X[ : , 1 ] ∗ ∗ 2 + 0 . 3 ∗ r a n d n ( n ) ) y=y . r e s h a p e ( − 1 , 1 ) f o r i i n range ( n ) : i f y [ i ]==1: p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="red" ) else : p l t . s c a t t e r (X[ i , 0 ] , X[ i , 1 ] , c="blue" ) p l o t _ k e r n e l ( K_poly , l i n e ="dashed" ) p l o t _ k e r n e l ( K _ l i n e a r , l i n e ="solid" )

pcost dcost 0: -6.6927e+01 -4.6679e+02 1: -4.2949e+01 -2.9229e+02 2: -2.8717e+01 -1.0653e+02 3: -2.5767e+01 -4.7367e+01 4: -2.6165e+01 -3.1836e+01 5: -2.6940e+01 -2.8267e+01 6: -2.7243e+01 -2.7483e+01 7: -2.7325e+01 -2.7330e+01 8: -2.7327e+01 -2.7328e+01 9: -2.7328e+01 -2.7328e+01 Optimal solution found. pcost dcost 0: -8.1804e+01 -4.7816e+02 1: -5.3586e+01 -3.0647e+02 2: -4.1406e+01 -8.6880e+01 3: -4.7360e+01 -5.9604e+01 4: -4.9819e+01 -5.5157e+01 5: -5.0999e+01 -5.3276e+01 6: -5.1869e+01 -5.2122e+01 7: -5.1966e+01 -5.2010e+01 8: -5.1986e+01 -5.1988e+01 9: -5.1987e+01 -5.1987e+01 10: -5.1987e+01 -5.1987e+01 Optimal solution found.

4.4 Spline Curves Let J ≥ 1. We say that the function

gap 2e+03 5e+02 1e+02 3e+01 8e+00 2e+00 3e-01 6e-03 9e-05 9e-07

pres 3e+00 4e-01 1e-01 2e-02 5e-03 7e-04 1e-04 1e-06 2e-08 2e-10

dres 1e-14 8e-15 6e-15 4e-15 4e-15 3e-15 3e-15 3e-15 3e-15 3e-15

gap 2e+03 4e+02 6e+01 1e+01 6e+00 2e+00 3e-01 4e-02 3e-03 8e-05 2e-06

pres 3e+00 4e-01 3e-02 6e-03 2e-03 7e-04 9e-06 1e-06 5e-15 1e-15 3e-15

dres 3e-15 3e-15 5e-15 2e-15 2e-15 2e-15 3e-15 2e-15 3e-15 2e-15 3e-15

106

4 Kernel Computations

g(x) = β1 + β2 x + β3 x 2 + β4 x 3 +

J 

β j+4 (x − ξ j )3+

(4.19)

j=1

⎧ 2 3 ⎪ x < ξ1 ⎨ g0 (x) = β1 + β2 x + β3 x + β4 x , ξ j ≤ x < ξ j+1 = g j (x) = g j−1 (x) + β j+4 (x − ξ j )3 , ⎪ 3 ⎩ g (x) = β + β x + β x 2 + β x 3 +  J β 1 2 3 4 J j=1 j+4 (x − ξ j ) , x ≥ ξ J

with the constants β1 , . . . , β J +4 ∈ R is a spline function of order three with knots 0 < ξ1 < · · · < ξ J < 1. We may define the spline function of order three by the function g, which is a piecewise polynomial for each of the J + 1 intervals whose g, g , g

are continuous at the J knots. The spline expressed by (4.19) consists of a linear space, and (4.20) 1, x, x 2 , x 3 , (x − ξ1 )3+ , . . . , (x − ξ J )3+ can be its basis. In particular, we consider the natural spline of order three in which we pose more conditions such as (4.21) g

(ξ1 ) = g

(ξ1 ) = 0 and

g

(ξ J ) = g

(ξ J ) = 0 .

(4.22)

The resulting curve is not of order three in x ≤ ξ1 , ξ J ≤ x, and we approximate it by a line. The linear space of natural splines possesses J dimensions. In fact, from (4.21), we have g

(ξ1 ) = 6β4 = 0 g

(ξ1 ) = 2β3 + 6β4 ξ1 = 0 ⇐⇒ β3 = β4 = 0 . Additionally, from (4.22), we have g

(ξ J ) = 6β4 + 6

J 

β j+4 = 0

j=1

g

(ξ J ) = 2β3 + 6β4 ξ J + 6

J 

β j+4 (ξ J − ξ j ) = 0

j=1

⇐⇒

J  j=1

β j+4 =

J 

β j+4 ξ j = 0 .

j=1

Thus, the β J +3 , β J +4 values are determined by the other β j ; j = 1, 2, 5, . . . , J + 2. In the following, we consider the problem of finding the f : [0, 1] → R that minimizes

4.4 Spline Curves

107

 1 N  {yi − f (xi )}2 + λ { f

(x)}2 d x i=1

(4.23)

0

given samples (x1 , y1 ), . . . , (x N , y N ) ∈ R × R. The second term is zero if the function is a straight line, but it becomes a significant value if the function deviates from a straight line. In other words, this term represents the complexity of the function f . The constant λ ≥ 0 balances the two terms, and if it is large, the curve is smooth; if the constant is small, the curve follows the sample closely. Note that, in general, the bounds ξ1 , . . . , ξ J and x1 , . . . , x N are defined separately. In this case, it is known that the f that minimizes (4.23) is a natural spline of order three such that f (xi ) = yi , i = 1, . . . , N at the N boundaries ξ1 = x1 , . . . , ξ N = x N 3 . However, f is once differentiable everywhere and twice differentiable almost 1 everywhere with 0 { f

(x)}2 d x < ∞, which implies that f is an element of W2 [0, 1]. A similar proposition holds for the general Wq [0, 1]. Example 67 In the case of a natural spline with q = 2, if we choose the basis  1 (q)  (q) g1 , . . . , g N appropriately, such as g(·) = Nj=1 β j g j (·), and G = ( 0 gi (x)g j (x)d x) ∈ R N ×N , y = [y1 , . . . , y N ], then we obtain the optimal [β1 , . . . , β N ] = (X  X + λG)−1 X  y . Figure 4.5 shows the graphs obtained for λ = 1, 30, 80. # d , h define the function that obtains the basis def d ( j , x , knots ) : K = len ( knots ) r e t u r n ( np . maximum ( ( x− k n o t s [ j ] ) ∗ ∗ 3 , 0 ) − np . maximum ( ( x− k n o t s [K− 1]) ∗ ∗ 3 , 0 ) ) / ( k n o t s [K−1]− k n o t s [ j ] ) def h ( j , x , knots ) : K = len ( knots ) i f j == 0 : return 1 e l i f j == 1 : return x else : r e t u r n d ( j − 1 , x , k n o t s )−d (K− 2 , x , k n o t s ) # G g i v e s values i n t e g r a t i n g the f u n c t i o n s t h a t are d i f f e r e n t i a t e d twice d e f G( x ) : # The x v a l u e s a r e o r d e r e d i n a s c e n d i n g n = len ( x ) g = np . z e r o s ( ( n , n ) ) f o r i i n range ( 2 , n − 1) : f o r j i n range ( i , n ) : g [ i , j ] = 1 2 ∗ ( x [ n −1]− x [ n − 2]) ∗ ( x [ n −2]− x [ j − 2]) \ ∗ ( x [ n −2]− x [ i − 2]) / ( x [ n −1]− x [ i − 2]) / \ ( x [ n −1]− x [ j − 2]) +(12∗ x [ n − 2]+6∗ x [ j − 2] − 18∗x [ i − 2]) \ ∗ ( x [ n −2]− x [ j − 2]) ∗ ∗ 2 / ( x [ n −1]− x [ i − 2]) / ( x [ n −1]− x [ j − 2]) g[ j , i ] = g[ i , j ] return g

3

See Chap. 7 of this series (“Statistical Learning with R/Python” (Springer)) for the proof.

108

4 Kernel Computations

Smoothing Spline (n = 100) λ=1 λ = 30 λ = 80

g(x)

Fig. 4.5 Instead of giving knots or the number of knots in the smoothing spline, we specify the λ value, which indicates smoothness. Comparing λ = 1, 30, 80, as we increase the value of λ, the spline does not follow the observed data, but it becomes smoother

-5

# MAIN n = 100 x = np . random . u n i f o r m ( − 5 , 5 , n ) y = x + np . s i n ( x ) ∗2 + np . random . r a n d n ( n ) i n d e x = np . a r g s o r t ( x ) x = x [ index ] ; y = y [ index ] X = np . z e r o s ( ( n , n ) ) X[ : , 0] = 1 f o r j i n range ( 1 , n ) : f o r i i n range ( n ) : X[ i , j ] = h ( j , x [ i ] , x ) GG = G( x ) lam_set = [ 1 , 30 , 80] c o l _ s e t = [ "red" , "blue" , "green" ] plt plt plt plt

0 x

5

# Data G e n e r a t i o n

# Generation of Matrix X # Generation of Matrix G

. figure () . y l i m ( −8 , 8 ) . x l a b e l ( "x" ) . y l a b e l ( "g(x)" )

f o r i i n range ( 3 ) : lam = l a m _ s e t [ i ] gamma = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X) +lam ∗GG) ,X . T ) , y ) def g ( u ) : S = gamma [ 0 ] f o r j i n range ( 1 , n ) : S = S + gamma [ j ] ∗ h ( j , u , x ) return S u _ s e q = np . a r a n g e ( − 8 , 8 , 0 . 0 2 ) v_seq = [ ] for u in u_seq : v_seq . append ( g ( u ) ) p l t . p l o t ( u_seq , v_seq , c = c o l _ s e t [ i ] , l a b e l = "$\lambda=%d$"%l a m _ s e t [ i ]) p l t . legend ( ) p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" ) p l t . t i t l e ( "smoothspline(n=100)" ) T e x t ( 0 . 5 , 1 . 0 , ’smoothspline(n=100)’ )

Generalizing (4.23), we consider minimizing

4.4 Spline Curves

109

 1 N  {yi − f (xi )}2 + λ { f (q) (x)}2 d x .

(4.24)

0

i=1

First, each f = f 0 + f 1 ( f 0 ∈ H0 and f 1 ∈ H1 in Wq [0, 1]) can be written with the appropriate linear operators P0 ∈ B(H, H0 ) and P1 ∈ B(H, H1 ) as f 0 = P0 f ∈ H0 , f 1 = P1 f ∈ H1 . Since f 0 , f 1 H = 0, f 0 , f 1 minimize f − f 0 H and f − f 1 H , respectively. Furthermore, P0 , P1 are self-adjoint. In fact, from Proposition 19, for each i = 0, 1, we have Pi f, g H = f i , g0 + g1 H = f i , gi H = f 0 + f 1 , gi H = f, Pi g H for f 0 , g0 ∈ H0 , f 1 , g1 ∈ H1 , f = f 0 + f 1 , g = g0 + g1 . Moreover, we have Pi f ∈ Hi and Pi2 f = Pi f . Thus, we can write the norm of the second term in (4.24) as 

1 0

| f (q) (x)|2 d x = P1 f 2H1 = P1 f, P1 f H1 = f, P12 f H = f, P1 f H

and can express (4.24) as N  {yi − f (xi )}2 + λ f, P1 f H

(4.25)

i=1

for f ∈ Wq [0, 1]. Let f = g + h ∈ H , g ∈ M := span{φ0 (·), . . . , φq−1 (·), k(x1 , ·), . . . k(x N , ·)}, and h ∈ M ⊥ . Then, for i = 1, . . . , N , we have f (xi ) = g + h, k(xi , ·) H = g(xi ) P1 f H1 ≥ P1 g H1 (the representation theorem). Thus, we may restrict the range of f to M for searching the optimum to find α1 , . . . , α N , β1 , . . . , βq in g(·) =

q−1  i=0

βi φi (·) +

N 

αi k(xi , ·) .

(4.26)

i=1

In natural spline functions, we regard the differential of order q at x = x N as zero, which means that (4.27) g (q) (x N ) = . . . = g (2q−1) (x N ) = 0, and the dimensionality of span{k1 (xi , ·)|i = 1, . . . , N } is N − q. For spline functions of order three (q = 2), (4.27) corresponds to (4.22). The basis {1, x} w.r.t. the lines in x ≤ x1 corresponds to {φ0 (x), . . . , φq−1 }. Thus, we find the optimal solution in the subspace of Wq [0, 1] for N .

110

4 Kernel Computations

Proposition 44 Let r ∈ Wq [0, 1] be a natural spline with knots x1 , . . . , x N and a maximum order of 2q − 1, and suppose that g ∈ Wq [0, 1] satisfies g(xi ) = r (xi ) for i = 1, 2, . . . , N . Then, we have 

1

{r

(q)



1

(x)} d x ≤ 2

0

{g (q) (x)}2 d x.

0

Moreover, the maximum order of s is q − 1 such that s(xi ) = 0 for s := g − r and i = 1, 2, . . . , N , and if N ≥ q, then the function s is zero. Proof: See the appendix at the end of this chapter. Since the natural splines of the highest order 2q − 1 possess N dimensions, an r ∈ Wq [0, 1] exists that shares the values r (x1 ) = g(x1 ), . . . , r (x N ) = g(x N ) at the N boundaries x1 , . . . , x N . Among them, since the second term in (4.25) is the optimum, the natural spline of the highest order 2q − 1 is optimal. To summarize the above, the problem of finding the f that minimizes (4.25) in Wq [0, 1] reduces to finding the solution over the range of (4.26), (4.27). In other words, we can think of the problem in a subspace with N dimensions. Moreover, the  basis consists of N elements regardless of whether q ≥ 1, and if we set g(·) = Nj=1 β j g j (·), the problem is to find the β1 , . . . , β N that minimize N  i=1

{yi −

N n  

β j g j (xi )} + λ 2

i=1 j=1

N N   i=1 j=1

 βi β j 0

1

(q)

(q)

gi (x)g j (x)d x.

 1 (q) (q) Let X = (g j (xi )) ∈ R N ×N , G = ( 0 gi (x)g j (x)d x) ∈ R N ×N , and y = [y1 , . . . , y N ] . The optimal solution β = [β1 , . . . , β N ] is given by β = (X  X + λG)−1 X  y .

4.5 Random Fourier Features In the following, we examine computational cost reduction methods. In particular, in this section, we learn about random Fourier features, which we can apply to the case where the kernel k(x, y) (x, y ∈ E) is a function of x − y. Proposition 45 (Rahimi and Recht [23]) Suppose that k : E × E  (x, y) → k(x, y) ∈ R is a function of x − y. Then, we have k(x, y) = 2Eω,b cos(w  x + b) cos(w  y + b) ,

(4.28)

where the expectation Eω,b is calculated over ω ∼ μ (the probability of k in Proposition 5) and b ∈ [0, 2π ) (the uniform distribution).

4.5 Random Fourier Features

111

Proof: The claim is due to Bochner’s theorem (Proposition 5). See the appendix at the end of this chapter for details. √ Based on Proposition 45, we generate 2 cos(ω x + b) m ≥ 1 times, i.e., (wi , bi ), i = 1, . . . , m, and construct the function z i (x) =

√ 2 cos(ωi x + bi ) i = 1, . . . , m.

From the law of large numbers, the constructed m  ˆ y) := 1 z i (x)z i (y) k(x, m i=1

approaches k(x, y). Utilizing this fact, when m is small compared to N , the method to reduce the complexity of kernel computation is called random Fourier features (RFF). We claim that the RFF possesses the following property: ˆ y)| ≥ ) ≤ 2 exp(−m 2 /8) . P(|k(x, y) − k(x,

(4.29)

Proposition 46 (Hoeffding’s Inequality) For independent random variables X i , i = 1, . . . , n, each of which takes values in [ai , bi ], and an arbitrary > 0, we have 2n 2 2 ), 2 i=1 (bi − ai )

P(|X − E[X ]| ≥ ) ≤ 2 exp(− n

(4.30)

where X denotes the sample mean (X 1 + . . . + X n )/n. Proof: We use the Chernoff bound and Hoeffding’s lemma, which are shown below. Lemma 5 (Chernoff Bound) For a random variable X and an arbitrary > 0, we have (4.31) P(X ≥ ) ≤ inf e−s E[es X ] . s>0

To prove this lemma, we use the following lemma. Lemma 6 (Markov’s Inequality) For a random variable X that takes nonnegative values, we have E[X ] P(X ≥ ) ≤ . Lemma 6 is due to E[X ] = E[X · I (X ≥ )] + E[X · I (X < )] ≥ E[X · I (X ≥ )] ≥ P(X ≥ ) . Lemma 5 follows from lemma 6 and the fact that

112

4 Kernel Computations

P(X ≥ ) = P(s X ≥ s ) = P(exp(s X ) ≥ exp(s )) ≤ e−s E[es X ] for s > 0. To prove Proposition 46, we use the following lemma: Lemma 7 (Hoeffding) Suppose that a random variable X satisfies E[X ] = 0 for a ≤ X ≤ b. Then, for an arbitrary > 0, we have   2 2 E e X ≤ e (b−a) /8 .

(4.32)

Proof: See the appendix at the end of this chapter. n X i , and apply Lemma 5 Returning to the proof of Proposition 46, let Sn := i=1 to obtain P(Sn − E[Sn ] ≥ ) ≤ min e−s E[exp{s(Sn − E[Sn ])}] . s>0

In particular, since X 1 , . . . , X n are independent, we have e−s E[exp{s(Sn − E[Sn ])}] = e−s

n 

E[es(X i −E[X i ]) ] .

i=1

Moreover, by applying Lemma 7, we obtain n s2  (bi − ai )2 } , 8 i=1

P(Sn − E[Sn ] ≥ ) ≤ min exp{−s + s>0

in which the minimum value is attained when s := 4 / P(Sn − E[Sn ] ≥ ) ≤ exp{−2 2 /

n

i=1 (bi

− ai )2 , and we have

n  (bi − ai )2 } . i=1

Furthermore, if we replace X 1 , . . . , X n with −X 1 , . . . , −X n , we obtain P(Sn − E[Sn ] ≤ − ) ≤ exp{−2 2 /

n  (bi − ai )2 } . i=1

Hence, we have P(|Sn − E[Sn ]| ≥ ) = 1 − P(|Sn − E[Sn ]| ≤ ) ≤ P(Sn − E[Sn ] ≥ ) + P(Sn − E[Sn ] ≤ − ) ≤ 2 exp{−2 2 /

n  (bi − ai )2 } . i=1

If we substitute X¯ = Sn /n, we obtain Proposition 46.



113

-0.4

-0.2

0.0

0.2

0.4

0.6

4.5 Random Fourier Features

m=20

m=100

m=400

ˆ y) 1000 times by changing m. We observe Fig. 4.6 In the RFF approximation, we generated k(x, that they all have zero centers, and the larger m is, the smaller the estimation error is

ˆ y)] = k(x, y) and −2 ≤ z i (x)z i (y) ≤ 2, using Proposition 46, we Since E[k(x, obtain (4.29)4 . Example 68 From Example 19, since the probability of a Gaussian kernel has a mean of 0 and a covariance matrix σ −2 I ∈ Rd×d , we generate the d-dimensional random numbers and √ uniform random numbers independently and construct the m ˆ y) − functions z i (x) = 2 cos(ωi x + bi ), i = 1, . . . , m. We draw a boxplot of k(x, k(x, y) by generating (x, y) 1000 times with d = 1 and m = 20, 100, 400 in Fig. 4.6. ˆ y) − k(x, y) has a mean of 0 (k(x, ˆ y) is an unbiased estimator), We observe that k(x, and the larger m is, the smaller the variance is. The program is written as follows: s i g m a =10 s i g m a 2 = s i g m a ∗∗2 def k ( x , y ) : r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) ) def z ( x ) : r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b ) def zz ( x , y ) : r e t u r n np . sum ( z ( x ) ∗ z ( y ) ) u=np . z e r o s ( ( 1 0 0 0 , 3 ) ) m_seq = [ 2 0 , 1 0 0 , 4 0 0 ] f o r i i n range ( 1 0 0 0 ) : x= r a n d n ( 1 ) y= r a n d n ( 1 ) f o r j i n range ( 3 ) : m=m_seq [ j ] w= r a n d n (m) / s i g m a b=np . random . r a n d (m) ∗2∗ np . p i u [ i , j ] = z z ( x , y )−k ( x , y )

4

The original paper by Rahimi and Recht (2007) and subsequent work proved more rigorous upper and lower bounds than these [2].

114

4 Kernel Computations

fig = plt . figure () ax = f i g . a d d _ s u b p l o t ( 1 , 1 , 1 ) ax . b o x p l o t ( [ u [ : , 0 ] , u [ : , 1 ] , u [ : , 2 ] ] , l a b e l s = [ ’20’ , ax . s e t _ x l a b e l ( ’m’ ) ax . s e t _ y l i m ( − 0 . 5 , 0 . 6 ) p l t . show ( )

’100’ , ’400’ ] )

N The solution α = [α1 , . . . , α N ] with f (·) = i=1 αi k(xi , ·) for kernel ridge regression with the Gram matrix K is given by (4.6) (Sect. 4.1). If we obtain the fˆ that approximates f and K via RFF as Kˆ = Z Z  ,  Napproximates the Gram matrix N ˆ ˆ then we obtain f (·) = i=1 αˆ i k(xi , ·) by using αˆ ∈ R for ( Kˆ + λI N )αˆ = y for Z = (Z j (xi )) ∈ R N ×m and the unit I N ∈ R N ×N . Using Woodbury’s formula, for U ∈ Rr ×s , V ∈ Rs×r , r, s ≥ 1, U (Is + V U ) = (Ir + U V )U . And we have

Z  (Z Z  + λI N )−1 = (Z  Z + λIm )−1 Z  .

Let x ∈ E be a value other than the x1 , . . . , x N used for estimation, and let z(x) := [z 1 (x), . . . , z m (x)] (row vector). Then, for βˆ := (Z  Z + λIm )−1 Z  y ,

(4.33)

we have fˆ(x) =

N  i=1

ˆ αi k(x, xi ) = z(x)

N 

z  (xi )αˆ i = z(x)Z  αˆ = z(x)Z  ( Kˆ + λI N )−1 y

i=1

= z(x)(Z  Z + λIm )−1 Z  y = z(x)βˆ . ˆ The computaThen, for the new x ∈ E, we can find its value from fˆ(x) = z(x)β. tional complexity of (4.33) is O(m 2 N ) for the multiplication of Z  Z , O(m 3 ) for finding the inverse of Z  Z + λIm ∈ Rm×m , O(N m) for the multiplication of Z  y, and O(m 2 ) for multiplying (Z  Z + λIm )−1 and Z  y. Thus, overall, the process requires only O(N 2 m) complexity at most. On the other hand, the process takes O(N 3 ) time when using the kernel without approximation. If m = N /10, the computational time becomes 1/100. Obtaining fˆ(x) from a new x ∈ E also takes only O(m) time. Example 69 We applied RFF to kernel ridge regression. For N = 200 data, we used m = 20 for the approximation. We plotted the curve for λ = 10−6 , 10−4 (Fig. 4.7). The program is as follows:

4.5 Random Fourier Features

115

λ = 10−4 , m = 20, N = 200 W/O. Approx. W. Approx.

6

8

W/O. Approx. W. Approx.

-2

0

0

2

4

2

y

y

6

4

8

10

λ = 10−6 , m = 20, N = 200

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 x

-1.5 -1.0 -0.5 0.0

0.5

1.0

x

Fig. 4.7 We applied RFF to kernel ridge regression. On the left and right are λ = 10−6 and λ = 10−4 , respectively

s i g m a =10 s i g m a 2 = s i g m a ∗∗2 # Function z m=20 w= r a n d n (m) / s i g m a b=np . random . r a n d (m) ∗2∗ np . p i d e f z ( u ,m) : r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗u+b ) # Gaussian Kernel def k ( x , y ) : r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) ) # Data G e n e r a t i o n n =200 x= r a n d n ( n ) / 2 y =1+5∗ np . s i n ( x / 1 0 ) +5∗ x ∗∗2+ r a n d n ( n ) x_min=np . min ( x ) ; x_max=np . max ( x ) ; y_min=np . min ( y ) ; y_max=np . max ( y ) lam = 0 . 0 0 1 # lam =0.9 # Low Rank A p p r o x i m a t e d F u n c t i o n d e f a l p h a _ r f f ( x , y ,m) : n= l e n ( x ) Z=np . z e r o s ( ( n ,m) ) f o r i i n range ( n ) : Z [ i , : ] = z ( x [ i ] ,m) b e t a =np . d o t ( np . l i n a l g . i n v ( np . d o t ( Z . T , Z ) +lam ∗ np . e y e (m) ) , np . d o t ( Z . T , y ) ) return ( beta ) # Usual Function def alpha ( k , x , y ) : n= l e n ( x ) K=np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) a l p h a =np . d o t ( np . l i n a l g . i n v (K+lam ∗ np . e y e ( n ) ) , y ) return alpha # N u m e r i c a l Comparison alpha_hat=alpha (k , x , y ) b e t a _ h a t = a l p h a _ r f f ( x , y ,m) r =np . s o r t ( x )

116

4 Kernel Computations

u=np . z e r o s ( n ) v=np . z e r o s ( n ) f o r j i n range ( n ) : S=0 f o r i i n range ( n ) : S=S+ a l p h a _ h a t [ i ] ∗ k ( x [ i ] , r [ j ] ) u [ j ]=S v [ j ] = np . sum ( b e t a _ h a t ∗ z ( r [ j ] ,m) ) plt plt plt plt plt plt plt plt plt

. s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" ) . p l o t ( r , u , c = "r" , l a b e l = "w/oApprox" ) . p l o t ( r , v , c = "b" , l a b e l = "withApprox" ) . xlim ( −1.5 , 2) . y l i m ( −2 , 8 ) . x l a b e l ( "x" ) . y l a b e l ( "y" ) . t i t l e ( "KernelRegression" ) . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

The RFF are said to have no significant degradation due to approximation in practice. Still, this does cause an issue regarding theoretical guarantees.

4.6 Nyström Approximation We consider finding the coefficient estimates (K + λI )−1 y in kernel ridge regression. Suppose that we can realize a low-rank matrix decomposition of K = R R  with R ∈ R N ×m in a computationally inexpensive way. In this case, we can complete the estimation task quickly. Note that we have (R R  + λI N )−1 =

1 {I N − R(R  R + λIm )−1 R  } , λ

(4.34)

which is due to Sherman-Morrison-Woodbury’s formula5 : r, s ≥ 1, A ∈ Rs×s , U ∈ Rs×r , C ∈ Rr ×r , V ∈ Rr ×s (A + U C V )−1 = A−1 − A−1 U (C −1 + V A−1 U )−1 V A−1

(4.35)

with r = m, s = N ,A = λI N , U = R, C = Ir , and V = R  . Computing the left side of (4.34) requires an inverse matrix operation of size N , while computing the right side involves the product of N × m and m × m matrices and an inverse matrix operation of size m. The computations on the left- and righthand sides require O(N 3 ) and O(N 2 m) complexity, respectively. In the following part of this section, we show that with some approximation, the decomposition of K = R R  is completed in O(N m 2 ) time, i.e., the calculation of the ridge regression is performed in O(N m 2 ). In other words, if N /m = 10, the computational time is only 1/100. 5

Joe Suzuki, “Statistical Learning with Math and R/Python”.

4.6 Nyström Approximation

117

In Sect. 3.3, based on (3.18), we considered approximating the eigenfunctions from x1 , . . . , xm ∈ E by φi (·) =

√  m m λi(m)

k(x j , ·)U j,i .

j=1

Let m ≤ N ; from the first m samples x1 , . . . , xm of x1 , . . . , xm , xm+1 , . . . , x N , we construct φi and λi . Then, via √ √ vi := [φi (x1 )/ N , . . . , φi (x N )/ N ] ∈ R N λi(N ) := N λi KN =

m 

λi(N ) vi vi ,

i=1

we approximate the Gram matrix K N w.r.t. x1 , . . . , x N . In order to decompose R R  , we may set it as  R=

λi(N ) [v1 , . . . , vm ] .

To compute R, we require O(m 3 ) and O(N m 2 ) time complexities for obtaining the eigenvalue and eigenvector of K m and v1 , . . . , vm ∈ R N , respectively. Thus, the computation completes O(N m 2 ) time in total. Example 70 We compared the results of kernel ridge regression with N = 300, m = 10, 20, and λ = 10−5 , 10−3 (Fig. 4.8). For these data, when λ ≥ 1, the graphs obtained with and without approximation were consistent. For m = 10, 20, the curves were almost identical. We observed that the approximation error was smaller when λ was small for RFF, while the error was smaller when λ was large for the Nyström approximation. s i g m a 2 =1 def k ( x , y ) : r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / ( 2 ∗ s i g m a 2 ) ) n =300 x= r a n d n ( n ) / 2 y=3 − 2∗ x ∗∗2 + 3∗ x ∗∗3 + 2∗ r a n d n ( n ) lam =10∗∗( − 5) m=10 K=np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) # Low Rank A p p r o x i m a t e d F u n c t i o n

118

4 Kernel Computations

d e f alpha_m (K, x , y ,m) : n= l e n ( x ) U, D, V=np . l i n a l g . s v d (K [ : m , : m] ) u=np . z e r o s ( ( n ,m) ) f o r i i n range (m) : f o r j i n range ( n ) : u [ j , i ] = np . s q r t (m/ n ) ∗ np . sum (K[ j , : m] ∗U [ : m, i ] / D[ i ] ) mu=D∗n /m R=np . z e r o s ( ( n ,m) ) f o r i i n range (m) : R [ : , i ] = np . s q r t ( mu [ i ] ) ∗u [ : , i ] Z=np . l i n a l g . i n v ( np . d o t (R . T , R) +lam ∗ np . e y e (m) ) a l p h a =np . d o t ( ( np . e y e ( n )−np . d o t ( np . d o t ( R , Z ) ,R . T ) ) , y ) / lam return ( alpha ) # Usual Function d e f a l p h a (K, x , y ) : a l p h a =np . d o t ( np . l i n a l g . i n v (K+lam ∗ np . e y e ( n ) ) , y ) return alpha # N u m e r i c a l Comparison a l p h a _ 1 = a l p h a (K, x , y ) a l p h a _ 2 = alpha_m (K, x , y ,m) r =np . s o r t ( x ) w=np . z e r o s ( n ) v=np . z e r o s ( n ) f o r j i n range ( n ) : S_1 =0 S_2 =0 f o r i i n range ( n ) : S_1=S_1+ a l p h a _ 1 [ i ] ∗ k ( x [ i ] , r [ j ] ) S_2=S_2+ a l p h a _ 2 [ i ] ∗ k ( x [ i ] , r [ j ] ) w[ j ] = S_1 v [ j ] = S_2 p l t . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" ) p l t . p l o t ( r , w, c = "r" , l a b e l = "w/oApprox" ) p l t . p l o t ( r , v , c = "b" , l a b e l = "withApprox" ) p l t . xlim ( −1.5 , 2) p l t . y l i m ( −2 , 8 ) p l t . x l a b e l ( "x" ) p l t . y l a b e l ( "y" ) p l t . t i t l e ( "KernelRegression" ) p l t . l e g e n d ( l o c = "upperleft" , f r a m e o n = True , p r o p ={’size’ : 1 4 } )

4.7 Incomplete Cholesky Decomposition In general, we can decompose a positive definite matrix A ∈ R N ×N into A = R R  . By using a lower triangular matrix R with nonnegative diagonal components. Such a decomposition is called the Cholesky decomposition of A. Proposition 47 For a positive definite matrix A ∈ Rn×n , there exists a Cholesky decomposition A = R R  that is unique if and only if A is positive definite. Many books cover this material. For the proofs, see, for example, [9]. The following is the Cholesky decomposition procedure. We construct the process so that we can stop anytime to obtain an approximation of R R  with rank r ≤ N . 1. In the initial stage, B = A, and R is a zero matrix.

4.7 Incomplete Cholesky Decomposition

119

λ = 10−5 , m = 10

-1 0

-1 0

1

2

y

2 1

y

3

3

4

4

5

5

λ = 10−5 , m = 20

-1.0

-0.5

0.0 x

0.5

1.0

-1.0

0.0 x

0.5

1.0

λ = 10−3 , m = 20

2 -1 0

-1 0

1

1

y

2

3

3

4

4

5

5

λ = 10−3 , m = 10

y

-0.5

-1.0

-0.5

0.0 x

0.5

1.0

-1.0

-0.5

0.0 x

0.5

1.0

Fig. 4.8 We approximated data with N = 300 and ranks m = 10, 20. The upper and lower subfigures display the results obtained when running λ = 10−5 and λ = 10−3 , respectively. The red and blue lines are the results obtained without approximation and with approximation, respectively. The accuracy is almost the same as that in the case without approximation when m = 20. The larger the value of λ is, the smaller the approximation error becomes

N 2. For each i = 1, . . . , r , the first i columns of R are set so that B j,i = h=1 R j,h Ri,h for j = 1, . . . , N . In other words, the setup is complete through the ith column of B. ⎡ ⎤ R1,1 0 · · · · · · · · · 0 ⎢ .. . . . . ⎥ ⎢ . . . ··· ··· 0⎥ ⎢ ⎥ ⎢ ⎥ . ⎢ Ri,1 .. Ri,i 0 · · · 0 ⎥ ⎢ ⎥. R=⎢ ⎥ ⎢ Ri+1,1 ... Ri+1,i 0 · · · 0 ⎥ ⎢ ⎥ ⎢ . ⎥ . . . . . . . . . . . ⎣ . . .⎦ . . . R N ,1 · · · R N ,i

0 ··· 0

In this case, we swap the two subscripts in B by multiplying a matrix Q from the front and rear of B. 3. The final result is that R R  = B = P  A P with P = Q 1 · · · Q N . Therefore, A = P R R  P  , and we have that P R(P R) is the Cholesky decomposition.

120

4 Kernel Computations

Here, to replace the (i, j) rows and (i, j) columns of the symmetric matrix B, let Q be the matrix obtained by replacing the (i, j), ( j, i) and (i, i), ( j, j) components of the unit matrix with 1 and 0, respectively, and multiplying B by the symmetric matrix Q from the front and rear of B. For example, ⎡

⎤⎡ ⎤⎡ ⎤ ⎡ ⎤ 100 b11 b12 b13 100 b11 b13 b12 Q B Q = ⎣ 0 0 1 ⎦ ⎣ b21 b22 b23 ⎦ ⎣ 0 0 1 ⎦ = ⎣ b31 b22 b32 ⎦ . 010 010 b31 b32 b33 b21 b23 b33 Specifically, for i = 1, 2, · · · , r , we perform the following steps: Assume that > 0.  2 1. Let k be the j (i ≤ j ≤ N ) that maximizes R 2j, j = B j, j − i−1 h=1 R j,h . (a) Swap the ith and kth rows and ith and kth columns of B. (b) Let Q i,k := 1, Q k,i := 1, Q i,i := 0, Q k,k := 0. (c) Swap Ri,1 , · · · , Ri,i−1 and Rk,1 , · · · , Rk,i−1 .  2 (d) Ri,i = Bk,k − i−1 h=1 Rk,h . 2. End if Ri,i < . i−1  1 (B j,i − R j,h Ri,h ) for each j = i + 1, · · · , N . 3. R j,i = Ri,i h=1

N Once the ith column is completed, B j,i = h=1 R j,h Ri,h follows, and R j,i remains the same after that for each j = 1, . . . , N . Then, B = R R  follows if the procedure completes up to r = N . At the beginning of each i = 1, 2, . . . , r , we select the j that maximizes R 2j, j =  2 B j, j − i−1 h=1 R j,h ≥ 0. In step 3, the components of the jth ( j = i + 1, . . . , N ) rows of the ith column join, but we divide them by Ri,i . Compared to the case where other values are selected as Ri,i in step 1, the absolute value of R j,i after dividing by  Rii becomes smaller, and the B j, j − ih=1 R 2j,h in the next step becomes larger for 2 takes a negative value, then regardless of the selection order, there each j. If Rr,r is no solution to the Cholesky decomposition, contradicting Proposition 47 (the uniqueness of the solution is also guaranteed). Even in the case of an incomplete Cholesky decomposition, we use the first r columns when running r = N . We show the code for executing the incomplete Cholesky decomposition below: d e f im_ch (A,m) : n=A . s h a p e [ 1 ] R=np . z e r o s ( ( n , n ) ) P=np . e y e ( n ) f o r i i n range (m) : max_R=− np . i n f f o r j i n range ( i , n ) : RR=A[ j , j ] f o r h i n range ( i ) : RR=RR−R[ j , h ] ∗ ∗ 2 i f RR>max_R : k= j

4.7 Incomplete Cholesky Decomposition max_R=RR R[ i , i ] = np . s q r t ( max_R ) i f k != i : f o r j i n range ( i ) : w=R [ i , j ] ; R [ i , j ] =R [ k , j ] ; R [ k , j ] =w f o r j i n range ( n ) : w=A[ j , k ] ; A[ j , k ] =A[ j , i ] ; A[ j , i ] =w f o r j i n range ( n ) : w=A[ k , j ] ; A[ k , j ] =A[ i , j ] ; A[ i , j ] =w Q=np . e y e ( n ) ; Q[ i , i ] = 0 ; Q[ k , k ] = 0 ; Q[ i , k ] = 1 ; Q[ k , i ] = 1 P=np . d o t ( P , Q) i f i 0. Since e x is convex w.r.t. x, if we take the expectation on the both sides of X − a b b − X a e + e e X ≤ b−a b−a for b > a, then E[e X ] ≤

−a b b e + e a = θe (1−θ )(b−a) + (1 − θ)e− θ (b−a) = exp{−θs + log(1 − θ + θes )} b−a b−a

−a . Therefore, it is sufficient for the exponent f (s) := b−a −θ s + log(1 − θ + θ es ) to be at most s 2 /8. Since for s = (b − a) and θ =

f (s) = −θ +

θ es 1 − θ + θ es

and f (0) = f (0) = 0, we have f

(s) = for φ =

(1 − θ ) · θ es 1 = φ(1 − φ) ≤ s 2 (1 − θ + θ e ) 4

θ es . Hence, a μ ∈ R exists such that 1 − θ + θ es

Appendix

125

f (s) = f (0) + f (0)(s − 0) +

1

s2 f (μ)(s − 0)2 ≤ , 2 8

which implies (4.31).



Exercises 46∼64 46.  Let k be a kernel and (x1 , y1 ),  . . . , (x N , y N ) be samples, and let f (·) := N N 2 2 α k(x , ·). If we minimize i i i=1 i=1 {yi − f (x i )} + λ f , λ > 0 (kernel ridge regression),why does this mean that we have minimized over f ∈ H ? In addition, express the optimal value of α = [α1 , . . . , α N ] using the Gram matrix K ∈ R N ×N and y = [y1 , . . . , y N ] . 47.  In kernel PCA, let k be a kernel and x1 , . . . , x N be samples, and let f (·) := N i=1 αi k(x i , ·). If we maximize (4.8), why does this mean that we have maximized it over f ∈ H ? Additionally, express the eigenequations obtained when β = K 1/2 α by using the Gram matrix K ∈ R N ×N . 48. In kernel PCA, we wish to find α for a centered Gram matrix, as in (4.9). Complete the function kernel_pca_train by filling in the space below. def k e r n e l _ p c a _ t r a i n ( x , k ) : n = x . shape [ 0 ] K = np . z e r o s ( ( n , n ) ) S = [0] ∗ n ; T = [0] ∗ n f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i , : ] , x [ j , : ] ) f o r i i n range ( n ) : S [ i ] = np . sum (K[ i , : ] ) f o r j i n range ( n ) : T [ j ] = np . sum (K [ : , j ] ) U = np . sum (K) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = K[ i , j ] − S [ i ] / n − T [ j ] / n + U / n ∗∗2 v a l , v e c = np . l i n a l g . e i g (K) idx = v a l . a r g s o r t ( ) [ : : − 1 ] # d e c r e a s i n g order as R val = val [ idx ] vec = vec [ : , idx ] a l p h a = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : a l p h a [ : , i ] = vec [ : , i ] / v a l [ i ] ∗ ∗ 0 . 5 return alpha

Based on the α obtained from the data X , kernel k, and function kernel_pca_train, we wish to calculate the score of z ∈ R N × p (any of the x1 . . . , x N ) (up to 1 ≤ m ≤ p dimensions). Complete the function below: d e f k e r n e l _ p c a _ t e s t ( x , k , a l p h a , m, z ) : n = x . shape [ 0 ]

126

4 Kernel Computations p c a = np . z e r o s (m) f o r i i n range ( n ) : p c a = p c a + a l p h a [ i , 0 :m] ∗ k ( x [ i , : ] , z ) return pca

Check whether the constructed function works with the following program: sigma2 = 0.01 def k ( x , y ) : r e t u r n np . exp ( − np . l i n a l g . norm ( x − y ) ∗∗2 / 2 / s i g m a 2 ) X = pd . r e a d _ c s v ( ’https://raw.githubusercontent.com/selva86/datasets/ master/USArrests.csv’ ) x = X. v a l u e s [ : , : − 1] n = x . shape [ 0 ] ; p = x . shape [ 1 ] alpha = kernel_pca_train (x , k ) z = np . z e r o s ( ( n , 2 ) ) f o r i i n range ( n ) : z [ i , : ] = k e r n e l _ p c a _ t e s t ( x , k , alpha , 2 , x [ i , : ] ) min1 = np . min ( z [ : , 0 ] ) ; min2 = np . min ( z [ : , 1 ] ) max1 = np . max ( z [ : , 0 ] ) ; max2 = np . max ( z [ : , 1 ] ) p l t . x l i m ( min1 , max1 ) p l t . y l i m ( min2 , max2 ) p l t . x l a b e l ( "First" ) p l t . y l a b e l ( "Second" ) p l t . t i t l e ( "KernelPCA(Gauss0.01)" ) f o r i i n range ( n ) : i f i != 4 : pl t . text (x = z [ i , 0] , y = z [ i , 1] , s = i ) p l t . t e x t ( z [ 4 , 0 ] , z [ 4 , 1 ] , 5 , c = "r" )

49. Show that the ordinary PCA and kernel PCA with a linear kernel output the same score. 50. Derive the KKT condition for the kernel SVM (4.12). 51. In Example 66, instead of linear and polynomial kernels, use a Gaussian kernel with different values of σ 2 (three different types), and draw the boundary curve in the same graph.   52. From (4.21) and (4.22), derive Jj=1 β j+4 = 0 and Jj=1 β j+4 ξ j = 0. 53. Prove Proposition 44 according to the following steps:  1 (a) Show that r (q) (x)s (q) (x)d x = 0. 0 1 1 (b) Show that 0 {g (q) (x)}2 d x ≥ 0 {r (q) (x)(x)}2 d x. q−1 (i)  s (0) i x. (c) When the equality in (b) holds, show that s(x) = i! i=0 (d) Show that the function s decreases when the equality in (b) holds and N exceeds the degree q − 1 of the polynomial. 54. In RFF, instead of finding the kernel k(x, y), we find its unbiased estimator ˆ y). Show that the average of k(x, ˆ y) is k(x, y). Moreover, construct a funck(x, ˆ tion that outputs k(x, y) from (x, y) ∈ E for m = 100 by using the constants and functions in the program below. Furthermore, compare the result with the value output by the Gaussian kernel and confirm that it is correct.

Exercises 46∼64

127

s i g m a =10 s i g m a 2 = s i g m a ∗∗2 def z ( x ) : r e t u r n np . s q r t ( 2 /m) ∗ np . c o s (w∗x+b ) def zz ( x , y ) : r e t u r n np . sum ( z ( x ) ∗ z ( y ) )

55. Derive the Chernoff bound. 56. Show that Proposition 46 implies (4.29). 57. The RFF are based on Bochner’s theorem (Proposition 5). What relationship exists between them? 58. In RFF, after randomly generating (w1 , b1 ), . . . , (wm , bm ), we obtain Z = (z j (xi )) ∈ R for i = 1, . . . , N and j = 1, . . . , m. If we use Kˆ = Z Z  rather m ˆ than K = (k(xi , x j )) ∈ R N ×N , show that fˆ(x) = i=1 αˆ i k(x, xi ) (x ∈ E) can be expressed by fˆ(x) = z(x)βˆ using βˆ in (4.33). Moreover, prove Woodbury’s formula: U (Is + V U ) = (Ir + U V )U for U ∈ Rr ×s , V ∈ Rs×r , r, s ≥ 1. 59. Evaluate the number of computations required to obtain (4.33) for the RFF. In addition, evaluate the computational complexity of finding fˆ(x) for the new x ∈ E. 60. To find the coefficient estimates (K + λI )−1 y in kernel ridge regression, we wish to decompose the low-rank matrix K = R R  with R ∈ R N ×m . If we can decompose K = R R  , evaluate the computations on the left- and right-hand sides, where we assume that finding the inverse of the matrix A ∈ Rn×n takes O(n 3 ). 61. We wish to find the coefficient αˆ of the kernel ridge regression by using the Nyström approximation. If we use the left-hand side of (4.34) instead of the right-hand side, what changes would be necessary in the following code? d e f alpha_m (K, x , y ,m) : n= l e n ( x ) U, D, V=np . l i n a l g . s v d (K [ : m , : m] ) u=np . z e r o s ( ( n ,m) ) f o r i i n range (m) : f o r j i n range ( n ) : u [ j , i ] = np . s q r t (m/ n ) ∗ np . sum (K[ j , : m] ∗U [ : m, i ] / D[ i ] ) mu=D∗n /m R=np . z e r o s ( ( n ,m) ) f o r i i n range (m) : R [ : , i ] = np . s q r t ( mu [ i ] ) ∗u [ : , i ] Z=np . l i n a l g . i n v ( np . d o t (R . T , R) +lam ∗ np . e y e (m) ) a l p h a =np . d o t ( ( np . e y e ( n )−np . d o t ( np . d o t ( R , Z ) ,R . T ) ) , y ) / lam return ( alpha )

128

4 Kernel Computations

s i g m a = 1 0 ; s i g m a 2 = s i g m a ^2 z= f u n c t i o n ( x ) s q r t ( 2 /m) ∗ c o s (w∗x+b ) z z = f u n c t i o n ( x , y ) sum ( z ( x ) ∗ z ( y ) )

a l p h a .m= f u n c t i o n ( k , x , y ,m) { n= l e n g t h ( x ) ; K= m a t r i x ( 0 , n , n ) ; f o r ( i i n 1 : n ) f o r ( j i n 1 : n )K[ i , j ] = k ( x [ i ] , x[ j ]) A= s v d (K [ 1 : m, 1 : m] ) u= a r r a y ( dim=c ( n ,m) ) ; f o r ( i i n 1 :m) f o r ( j i n 1 : n ) u [ j , i ] = s q r t (m/ n ) ∗sum (K[ j , 1 : m] ∗ A$u [ 1 : m, i ] ) / A$d [ i ] mu=A$d∗n /m R= s q r t ( mu [ 1 ] ) ∗u [ , 1 ] ; f o r ( i i n 2 :m) R= c b i n d ( R , s q r t ( mu [ i ] ) ∗u [ , i ] ) a l p h a = ( d i a g ( n )−R%∗%s o l v e ( t (R)%∗%R+lambda∗ d i a g (m) )%∗%t (R) )%∗%y / lambda return ( as . v e c t o r ( alpha ) ) }

62. In Step 1 of the procedure for the incomplete Cholesky decomposition, each  2 time we choose the j (i ≤ j ≤ N ) that maximizes R 2j, j = B j, j − i−1 h=1 R j,h as  2 k. Show that Bk,k − i−1 h=1 Rk,h in Step 1(d) is nonnegative. 63. Show that when the incomplete Cholesky decomposition process is completed up to the r th column, we have B ji =

i 

R jh R jh

h=1

for each i = 1, . . . , r and j = i + 1, . . . , N . 64. Generate a nonnegative definite matrix of size 5 × 5 and run im_ch to perform the incomplete Cholesky decomposition of rank three.

Chapter 5

The MMD and HSIC

In this chapter, we introduce the concept of random variables X : E → R in an RKHS and discuss testing problems in RKHSs. In particular, we define a statistic and its null hypothesis for the two-sample problem and the corresponding independence test. We do not know the distribution according to the null hypothesis under a finite sample in either case. Therefore, we introduce a permutation test and a U-statistic with which we construct the process and run the program. Then, we study the notions of characteristics and universal kernels to learn what kernels are valid for such tests. Finally, we learn about empirical processes, which are often used in the mathematical analyses of machine learning and deep learning methods.

5.1 Random Variables in RKHSs In Chap. 1, we proved that a function X : E → R that takes values in R is measurable if {ω ∈ E|X (ω) ∈ B} for any Borel set B is an event (element) in F, and we call such an X a random variable. In the following, we say that a kernel k is measurable if the set of (x, y) such that k(x, y) ∈ B is an event in E × E, and we assume that any kernel k is measurable. Moreover, in this chapter, the expectation E[k(X, √ X )] of k(x, x) ∈ R, x ∈ E, is √ bounded, which means that both E[ k(X, X )] ≤ E[k(X, X )] are bounded. Proposition 48 Let k : E × E → R be measurable. Then, the map  : E  x → k(x, ·) ∈ H is measurable. Thus, k(X, ·) is a random variable in H for any random variable X that takes values in E. Proof: See the appendix at the end of this chapter. Let X : E → R be a random variable. The linear functional T : H → R with   T ( f ) := E[ f (X )] = E[ f (·), k(X, ·) H ] ≤ E[ f H k(X, X )] ≤ f H E[ k(X, X )] © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_5

129

130

5 The MMD and HSIC

 T( f ) ≤ E[ k(X, X )] < ∞. From Proposition 22, there exists an m X ∈ H

f H such that E[ f (X )] =  f (·), m X (·) H

satisfies

for any f ∈ H . We call such an m X the expectation of k(X, ·), and we write m X (·) = E[k(X, ·)]. Then, we have E[ f (·), k(X, ·) H ] =  f (·), E[k(X, ·)] H , which means that we can change the order of the inner product and expectation operations. Let E X , E Y be sets. We define the tensor product H0 of RKHSs H X and and kY : E Y → R, respectively, by the set HY consisting of kernels k X : E X → R  m f X,i (x) f Y,i (y), f X,i ∈ H X , f Y,i ∈ HY of functions E X × E Y → R, f (x, y) = i=1 for (x, y) ∈ E X × E Y , and we define the inner product and norm by  f, g H0 =

n m    f X,i , g X, j HX  f Y,i , gY, j HY i=1 j=1

 and f 2H0 =  f, f H0 , respectively for f = mj=1 f X, j f Y, j , f X,i ∈ H X , f Y,i ∈ HY  and g = nj=1 g X, j gY, j , g X, j ∈ H X , gY, j ∈ HY . In fact, we have  f, g H0 =

m  n   i=1 j=1

=

r

m   i=1

r

s

αi,r γ j,t k X (xr , xt )



t

αi,r βi,s g(xr , ys ) =

s

βi,s δ j,u kY (ys , yu )

u

n   j=1

t

γ j,t δ j,u f (xt , yu )

u

   for f X,i (·) = r αi,r k X (xr , ·), f Y,i (·) = s βi,s kY (ys , ·), g X, j (·) = t γ j,t k X (xt , ·), and gY, j (·) = u δ j,u kY (yu , ·), which means that the functions do not depend on the expressions of f, g. H consisting of the funcIf we complete 0 , we can construct a linear space ∞ H ∞ ∞ ∞ 2 2 a e e such that

f

:= a < ∞, and tions f = i=1 j j=1 i, j X.i Y, i=1 j=1 ∞i, j ∞ ∞ the inner product is  f, g H = i=1 j=1 ai, j bi, j , where g = j=1 bi, j e X,i eY, j ∞ ∞ 2 ( i=1 j=1 bi, j < ∞) and {e X,i }, {eY, j } are orthonormal bases of H X , HY , respectively. Then, H0 is a dense subspace in H , and H is a Hilbert space. We say that H0 is the direct product of H X , HY and write H X ⊗ HY . H is the set of functions f such that f (x) := limn→∞ f n (x) for any Cauchy sequence { f n } in H0 and x ∈ E. The claim follows from a similar discussion as that in Steps 1-5 of Proposition 34. Proposition 49 (Neveu [22]) The direct product H X ⊗ HY of RKHSs H X , HY with reproducing kernels k X , kY is an RKHS with a reproducing kernel k X kY . Proof: The derivation utilizes the following steps [1].

5.1 Random Variables in RKHSs

131

√ √ 1. Show that |g(x, y)| ≤ k X (x, x) kY (y, y) g for g ∈ H X ⊗ HY and x ∈ E X , y ∈ E Y , which means that H is an RKHS due to Proposition 33. 2. Show that k(x, ·, y, ) := k X (x, ·)kY (y, ) ∈ H when we fix x ∈ E X , y ∈ E Y . 3. Show that g(x, y) =  f (·, ), k(x, ·, y, ) H . For details, consult the proof at the end of this chapter.  Then, we introduce the notion of expectation w.r.t. the variables X, Y . If we assume that E[k X (X, X )] and E[kY (Y, Y )] are finite, then E X Y [k X (X, ·)kY (Y, ·)] is obtained by taking the expectation of k X (x, ·)kY (y, ·) ∈ H X ⊗ HY w.r.t. X Y : E X Y [ k X (X, ·)kY (Y, ·) HX ⊗HY ] = E X Y [ k X (X, ·) HX kY (Y, ·) HY ]   = E X Y [ k X (X, X )kY (Y, Y )] ≤ E X [k X (X, X )]EY [kY (Y, Y )] . Thus, the left-hand side takes a finite value, and we have E X Y [ f (X, Y )] = E X Y [ f, k X (X, ·)kY (Y, ·) ] ≤ f H X ⊗HY E X Y [ k X (X, ·)kY (Y, ·) H X ⊗HY ]

for f ∈ H X ⊗ HY . From Proposition 22 (Riesz’s representation theorem), there exists an m X Y ∈ H X ⊗ HY such that E X Y [ f (X, Y )] =  f, m X Y , and we write m X Y := E X Y [k X (X, ·)kY (Y, ·)] , which means that we can change the order of the inner product and expectation operations: E X Y [ f, k X (X, ·)kY (Y, ·) ] =  f, E X Y [k X (X, ·)kY (Y, ·)] . Moreover, for the m X , m Y of X, Y , the expectation m X m Y belongs to H X ⊗ HY , and we have  f g, m X m Y HX ⊗HY =  f, m X HX g, m Y HY = E X [ f (X )]EY [g(Y )] f ∈ H X , g ∈ HY , which means that we multiply the expectations of X, Y even if they are not independent. Thus, we call m XY − m X mY the covariate of (X, Y ) in H X ⊗ HY , which belongs to H X ⊗ HY . Proposition 50 For each f ∈ H X , g ∈ HY , there exist X Y ∈ B(HY , H X ) and Y X ∈ B(H X , HY ) such that

132

5 The MMD and HSIC

 f g, m X Y − m X m Y HX ⊗HY =  Y X f, g HY =  f, X Y g HX .

(5.1)

Proof: The operators Y X , X Y are conjugates of each other, and from Proposition 22, if one exists, so does the other. We prove the existence of X Y . The linear functional Tg : H X  f →  f g, m X Y − m X m Y HX ⊗HY ∈ R for an arbitrary g ∈ HY is bounded from  f g, m X Y − m X m Y HX ⊗HY ≤ f HX g HY m X Y − m X m Y HX ⊗HY , and there exists an h g ∈ H X such that Tg f =  f, h g HX from Proposition 22. Thus, there exists X Y : HY  g → h g ∈ H X such that  f g, m X Y − m X m Y HX ⊗HY =  f, X Y g HX . The boundness of X Y is due to

X Y g HX = h g HX = Tg ≤ g HY m X Y − m X m Y HX ⊗HY .  We call X Y , Y X the mutual covariance operators. Let H and k be an RKHS and its reproducing kernel respectively, and let P be the set of distributions that X follows. Then, we can define the map  P  μ →

k(x, ·)dμ(x) ∈ H ,

which we call the embedding of probabilities in the RKHS. Suppose that the map is  injective, i.e., if the expectations k(x, ·)dμ1 (x) and k(x, ·)dμ2 (x) have the same value, then the probabilities μ1 , μ2 coincide. We call such a reproducing kernel k of an RKHS H characteristic. We learn some applications by using characteristic kernels, such as two-sample problems and independence tests, and we consider the associated theory in later sections of this chapter.

5.2 The MMD and Two-Sample Problem Gretton et al. (2008), [11] proposed a statistical testing approach for testing whether two distributions share given independent sequences x1 , . . . , xm ∈ R and y1 , . . . , yn ∈ R. We write the two distributions as P, Q and regard P = Q as the null hypothesis. Let H and k be an RKHS and its reproducing kernel respectively; we define m P :=  E P [k(X, ·)] = E k(x, ·)d P(x), m Q := E Q [k(X, ·)] = E k(x, ·)d Q(x) ∈ H . We

5.2 The MMD and Two-Sample Problem

133

note that the random variable X : E → R is measurable, and either P or Q is the probability distribution that X follows. Let F be a set of functions that satisfies a condition. In general, the quantity defined by sup {E P [ f (X )] − E Q [ f (X )]} f ∈F

is called the MMD (maximum mean discrepancy), and we assume that F := { f ∈ H | f H ≤ 1} , which means that we regard the MMD as MMD2 = sup {E P [ f (X )] − E Q [ f (X )]}2 = sup {m P , f − m Q , f }2 f ∈F

f ∈F

= sup {m P − m Q , f } = m P − m Q 2H . 2

f ∈F

If the kernel k is characteristic, then we have MMD = 0 ⇐⇒ m P = m Q ⇐⇒ P = Q

(5.2)

and MMD2 = m P , m P + m Q , m Q − 2m P , m Q = E X [k(X, ·)], E X  [k(X  , ·)] + EY [k(Y, ·)], EY  [k(Y  , ·)] − 2E X [k(X, ·)], EY [k(Y, ·)] = E X X  [k(X, X  )] + EY Y  [k(Y, Y  )] − 2E X Y [k(X, Y )] ,

where X  and X (Y  and Y ) are independent and follow the same distribution. However, we do not know m X , m Y from the two-sample data. Thus, we execute the test using their estimates: 2

 M MD B :=

m n m m n n 1  1  2  k(x , x ) + k(y , y ) − k(xi , y j ) i j i j m 2 i=1 j=1 n 2 i=1 j=1 mn i=1 j=1

(5.3) m  n  m n   1 1 2  k(xi , x j ) + k(yi , y j ) − k(xi , y j ) . m(m − 1) i=1 j=i n(n − 1) i=1 j=i mn i=1 j=1 (5.4) Then, the estimate (5.4) is unbiased while (5.3) is biased: E[

m  m  1 1  1  k(X i , X j )] = EXi [ E X j [k(X i , X j )]] = E X X  [k(X, X  )]. m(m − 1) m m−1 i=1 j=i

i=1

j=i

134

5 The MMD and HSIC Different Dist. (Permutation)

30 0 10

Density

50

20 40 60 0

Density

The Same Dist. (Permutation)

-0.01

0.00

0.01

2

0.02

0.03

0.00 0.02 0.04 0.06 0.08

MMDU

2

MMDU

Fig. 5.1 Permutation test for the two-sample problem. The distributions of X, Y are the same (left) and different (right). The blue and red dotted lines show the statistics and the borders of the rejection region, respectively.

However, similar to the HSIC in the next section, we do not know the distribution of the MMD estimate under P = Q. We consider executing one of the following processes. 1. Construct a histogram of the MMD estimate values randomly by changing the values of x1 , . . . , xm and y1 , . . . , yn (permutation test). 2. Compute an asymptotic distribution from the distribution of U statistics. For the former, for example, we may construct the following procedure. Example 71 We perform a permutation test on two sets of 100 samples that follow the standard Gaussian distribution (Fig. 5.1 Left). For the unbiased estimator of 2  M M D 2 , we use M M DU in (5.6) instead of (5.4) for a later comparison. We also double the standard deviation of one set of samples and perform the permutation 2  test again (Fig. 5.1 Right). The reason why M MDU also takes negative values is that when the true value of the M M D is close to zero, the value can also be negative since it is an unbiased estimator. # I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d . import numpy a s np from s c i p y . s t a t s import kde import i t e r t o o l s import math import m a t p l o t l i b . p y p l o t a s p l t from m a t p l o t l i b import s t y l e s t y l e . use ( " seaborn −t i c k s " )

sigma = 1 def k ( x , y ) : r e t u r n np . exp ( − ( x − y ) ∗∗2 / s i g m a ∗ ∗ 2 ) # Data G e n e r a t i o n n = 100 xx = np . random . r a n d n ( n ) yy = np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e e q u a l

5.2 The MMD and Two-Sample Problem

135

# y y = 2 ∗ np . random . r a n d n ( n ) # The d i s t r i b u t i o n s a r e n o t e q u a l x = xx ; y = yy # Distribution of the null hypothesis T = [] f o r h i n range ( 1 0 0 ) : i n d e x 1 = np . random . c h o i c e ( n , s i z e = i n t ( n / 2 ) , r e p l a c e = F a l s e ) i n d e x 2 = [ x f o r x i n range ( n ) i f x n o t i n i n d e x 1 ] x = l i s t ( xx [ i n d e x 2 ] ) + l i s t ( yy [ i n d e x 1 ] ) y = l i s t ( xx [ i n d e x 1 ] ) + l i s t ( yy [ i n d e x 2 ] ) S = 0 f o r i i n range ( n ) : f o r j i n range ( n ) : i f i != j : S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \ − k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ]) T . append ( S / n / ( n − 1) ) v = np . q u a n t i l e ( T , 0 . 9 5 ) # Statistics S = 0 f o r i i n range ( n ) : f o r j i n range ( n ) : i f i != j : S = S + k(x[ i ] , x[ j ]) + k(y[ i ] , y[ j ]) \ − k(x[ i ] , y[ j ]) − k(x[ j ] , y[ i ]) u = S / n / ( n − 1) # D i s p l a y o f t h e graph x = np . l i n s p a c e ( min ( min ( T ) , u , v ) , max ( max ( T ) , u , v ) , 2 0 0 ) d e n s i t y = kde . g a u s s i a n _ k d e ( T ) plt . plot (x , density (x) ) p l t . a x v l i n e ( x = u , c = " r " , l i n e s t y l e = "−− " ) plt . axvline (x = v , c = "b" )

For the latter approach, we construct the following quantities. For m ≥ 1 symmetric variables and h : E m → R, we call the quantity U N := 

 1  h(xi1 , . . . , xim ) N 1≤i ,...,i ≤N 1 m m

(5.5) 

 N the U-statistic w.r.t. h of order m, where i1 ,...,im ranges over (i 1 , . . . , i m ) ∈ m m {1, . . . , N } ’s. We use this quantity for estimating the expectation E[h(X 1 , . . . , X m )] given samples x1 , . . . , x N . Note that any U statistic is unbiased. In fact, we have  1  h(X i1 , . . . , X im )] N i 0 from the origin and E is a compact set that includes some points outside the support, then its kernel is not universal.

5.5 Introduction to Empirical Processes

153

5.5 Introduction to Empirical Processes In this section, we study a mathematical approach to machine learning called the empirical process. We analyze the accuracy of the MMD estimators by using the Rademacher complexity and concentration inequalities. Through this example, we learn the concept of empirical processes. The derivation performed in this section is based on Gretton et al.’s [11] proof of a proposition regarding the accuracy of the two-sample problem. In this section, we prove the following proposition. We define the MMD by sup E P [ f (X )] − E Q [ f (X )], f ∈F

where F is a class of functions. This chapter also deals with the case in which F := { f ∈ H | f H ≤ 1}. Proposition 58 Suppose that a kmax exists such that 0 ≤ k(x, y) ≤ kmax for each x, y ∈ E. Then, for any  > 0, we have 

2 4kmax  P |M M D B − M M D2| > + N



  2 N , ≤ 2 exp − 4kmax

2

 where the estimator M M D B of M M D 2 is given by (5.3), and we assume that the number of samples for x, y is equal to N and that P = Q. For the proof of Proposition 58, we use an inequality that slightly generalizes Proposition 46. Proposition 59 (McDiarmid) Let f : E m → R imply that a ci < ∞ (i = 1, · · · , m) exists satisfying sup

x,x1 ,...,xm

| f (x1 , . . . , xm ) − f (x1 , . . . .xi−1 , x, xi+1 , . . . , xm )| ≤ ci .

For any probability measure P,  > 0 and X 1 , . . . , X m , we have P ( f (x1 , . . . , xm ) − E X 1 ···X m

 2 2 f (X 1 , . . . , X m ) > ) < exp − m



2 i=1 ci

(5.12)

and  2 2 P (| f (x1 , . . . , xm ) − E X 1 ···X m f (X 1 , . . . , X m )| > ) < 2 exp − m

2 i=1 ci

 . (5.13)

Proof: Hereafter, we denote f (X 1 , · · · , X N ) and E[ f (X 1 , · · · , X N )] by f and E[ f ], respectively. If we define

154

5 The MMD and HSIC

V1 := E X 2 ···X N [ f |X 1 ] − E X 1 ···X N [ f ] .. . Vi := E X i+1 ···X N [ f |X 1 , · · · , X i ] − E X i ···X N [ f |X 1 , · · · , X i−1 ] .. . VN := f − E X N [ f |X 1 , · · · , X N −1 ] for i = 1, i = 2, · · · , N − 1, and i = N , then we have f − E X 1 ···X N [ f ] =

N 

Vi .

(5.14)

i=1

From E X i {E X i+1 ···X N [ f |X 1 , · · · , X i ]|X 1 , · · · , X i−1 } = E X i ··· ,X N [ f |X 1 , · · · , X i−1 ] , we have E X i [Vi |X 1 , · · · , X i−1 ] = 0 .

(5.15)

From (5.14), we have f − E[ f ] >  ⇐⇒ exp{t

N 

Vi } > et for arbitrary t > 0 .

i=1

If we apply Markov’s inequality (Lemma 6) to the latter equation, then we have P( f − E[ f ] ≥ ) ≤ inf e t>0

−t

E[exp{t

N 

Vi }] .

(5.16)

i=1

Moreover, from (5.15), we apply Lemma 7 to obtain E[exp{t

N 

Vi }] = E X 1 ···X N −1 [exp{t

i=1

N −1 

Vi }E X N [exp{t VN }|X 1 , · · · , X N −1 ]]

i=1

≤ E X 1 ···X N −1 [exp{t

N −1  i=1

= exp{

2

t 8

N  i=1

ci2 } .

Vi }] exp{t 2 c2N /8}

5.5 Introduction to Empirical Processes

155

Therefore, from (5.16), we have P( f − E[ f ] ≥ ) ≤ inf exp{−t + t>0

N t2  2 c }. 8 i=1 i

N 2 The right-hand side is minimized when t = 4/ i=1 ci , and we obtain (5.12). Replacing f with − f , we obtain the other inequality. From both inequalities, we have (5.13).  In the following, we denote by F := { f ∈ H | f H ≤ 1} the unit ball in the universal (see Sect. 5.4 for the definition of universality) RKHS H w.r.t. a compact E and assume that the kernel of H is less than or equal to kmax . Hereafter, let X 1 , . . . .X m be independent random variables that follow probability P, and let σ1 , . . . , σm be independent random variables, each of which takes a value of ±1 equiprobably. Then, we say that the quantity R N (F) := Eσ sup | f ∈F

m 1  σi f (xi )| m i=1

(5.17)

is an empirical Rademacher complexity, where Eσ is the operation that takes the expectation w.r.t. σ1 , . . . , σm . If we further take the expectation of (5.17) w.r.t. the probability P, then we call the obtained value R(F, P) the Rademacher complexity. Proposition 60 (Bartlett-Mendelson [4]) Let kmax := maxx,y∈E k(x, y). Then, we have the following inequality: R N (F) ≤

kmax . N

In particular, for an arbitrary probability P, we have R(F, P) ≤

kmax . N

Proof: From f H ≤ 1 and k(x, x) ≤ kmax , we have N N 1  1  R N (F) = Eσ [sup | σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |] f ∈F N i=1 f ∈F N i=1

= Eσ [sup | f, f ∈F

N 1  σi k(xi , ·) H |] N i=1

156

5 The MMD and HSIC



N N

1  1  ≤ Eσ [sup f H  σi k(xi , ·), σi k(xi , ·) H ] N i=1 N i=1 f ∈F

N N N  N  

1 

 Eσ [ 1 ≤ Eσ [ σ σ k(x , x )] ≤ σi σ j k(xi , x j )] i j i j N 2 i=1 j=1 N 2 i=1 j=1

N  N

1  kmax  , δi, j k(xi , x j ) ≤ = 2 N i=1 j=1 N where we use E[σi σ j ] = σi2 δi, j = δi, j in the derivation. We obtain the other inequality by taking the expectation w.r.t. the probability P.  Propositions 59 and 60 are inequalities used for mathematical analysis in machine learning as well as for the proof of Proposition 58. Proof of Proposition 58: If we define f (x1 , . . . , x N , y1 , . . . , y N ) 1 1 1 1 := k(x1 , ·) + . . . + k(x N , ·) − k(y1 , ·) − . . . − k(y N , ·) , N N N N then from the triangular inequality, we obtain | f (x1 , . . . , x N , y1 , . . . , y N ) − f (x1 , . . . , x j−1 , x, x j+1 , . . . , x N , y1 , . . . , y N )| 2 1 kmax . (5.18) ≤ k(x j , ·) − k(x, ·) ≤ N N Next, we obtain the upper bound of the expectation of N N 2 1  1   |M M D 2 − M M D B | = | sup {E P ( f ) − E Q ( f )} − sup { f (xi ) − f (y j )}| N f ∈F f ∈F N i=1

≤ sup |E P ( f ) − E Q ( f ) − { f ∈F

1 N

N 

f (xi ) −

i=1

j=1

1 N

N 

f (y j )}| .

j=1

Then, we perform the following derivation: E X,Y sup |E P ( f ) − E Q ( f ) − { f ∈F

= E X,Y

N N 1  1  f (X i ) − f (Yi )}| N N i=1

i=1

N N N N 1  1  1  1  sup |E X  { f (X i ) − f (X i )} − EY  { f (Y j )) − f (Y j )}| N N N N f ∈F i=1

i=1

j=1

j=1

5.5 Introduction to Empirical Processes ≤ E X,Y,X  ,Y  sup | f ∈F

157

N N N N 1  1  1  1  f (X i ) − f (X i ) − f (Yi ) + f (Yi )| N N N N i=1

= E X,Y,X  ,Y  ,σ,σ  sup | f ∈F

1 N

i=1

N 

σi { f (X i ) − f (X i )} +

i=1

i=1

1 N

N 

i=1

σi { f (Yi ) − f (Yi )}|

i=1

N n 1  1   σi { f (X i ) − f (X i )}| + EY,Y  ,σ  sup | σ j { f (Y j ) − f (Y j )}| f ∈F N i=1 f ∈F N j=1 kmax ≤ 2[R(F , P) + R(F , Q)] ≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ] = 4 (5.19) , N

≤ E X,X  ,σ sup |

where the first inequality is due to Jensen’s inequality, the second stems from the triangular inequality, the third is derived from the definition of Rademacher complexity, and the fourth due is obtained from the inequality of Rademacher complexity (Propo2 √  MD , sition 60). From (5.18) and (5.19), for ci = N2 kmax and f = M M D 2 − M we have kmax . E X 1 ...X N f ≤ 4 N Finally, we obtain Proposition 58 from Proposition 59. Hence, Proposition 60 follows from (5.18) and Proposition 59.



Appendix The essential part of the proof of Proposition 54 was given by Fukumizu [7] but has been rewritten as a concise derivation to make it easier for beginners to understand.

Proof of Proposition 48 The fact that E  x → k(x, ·) ∈ H is measurable means that E[k(X, ·)] can be treated as a random variable. However, the events in E × E are the direct products of the events generated by each E (the elements of F × F). Therefore, if the function E × E  (x, y) → k(x, y) ∈ R is measurable, then the function E  y → k(x, y) ∈ R is measurable for each x ∈ E (even if y ∈ E is fixed, (x, y) → k(x, y) is still measurable). In the following, we show that any function belonging to H is measurable. First, we note that H0 = span{k(x, ·)|x ∈ E} is dense in H . Additionally, we note that for the sequence { f n } in H0 , f − f n H → 0 (n → ∞) means that | f (x) − f n (x)| → 0 for each x ∈ E (Proposition 35). The following lemma implies that f is measurable. Lemma 8 If f n : E → R is measurable and f n (x) converges to f (x) for each x ∈ E, then f : E → R is also measurable. Proof: The proof follows after the proof of this proposition.

158

5 The MMD and HSIC

We assume that Lemma 8 is valid. We define the measurability of  : E  x → k(x, ·) ∈ H by {x ∈ E | f − k(x, ·) H < δ} ∈ F for any f ∈ H and δ > 0 (this is an extension to the case where H = R). Moreover, we have

f − k(x, ·) H < δ ⇐⇒ k(x, x) − 2 f (x) < δ 2 − f 2H . In addition, since k(·, ·) is measurable, E  x → k(x, x) ∈ R is also measurable. Moreover, since f (x) is measurable, so is k(x, x) − 2 f (x). Thus,  is measurable. 

Proof of Lemma 8 It is sufficient to show that f −1 (B) ∈ F for any open set B. We fix B ⊆ R arbitrarily and let Fm := {y ∈ B|U (y, 1/m) ⊆ B}, where U (y, r ) := {x ∈ R | d(x, y) < r }. From the definition, we have the following two equations. f (x) ∈ B ⇐⇒ for some m, f (x) ∈ Fm f (x) ∈ Fm ⇐⇒ for some k, f n (x) ∈ Fm , n ≥ k . In other words, we have f −1 (B) = ∪m f −1 (Fm ) = ∪m ∪k ∩n≥k f n−1 (Fm ) ∈ F . 

Proof of Proposition 49 The evaluation is finite for arbitrary g = (x, y) ∈ E. In fact, we have |g(x, y)| ≤

∞  ∞ 

|ai, j | · |e X,i (x)| · |eY, j (y)|≤

i=1 j=1

∞ 

∞ ∞ i=1

j=1

⎛ |e X,i (x)| · ⎝

i=1

∞  j=1

e X,i eY, j ∈ H X ⊗ HY and ⎞1/2 ⎛

2 (y)⎠ eY, j



∞ 

⎞1/2 2 (y)⎠ ai, j

,

j=1

(5.20) ∞ . If we set k (y, ·) = where we apply Cauchy-Schwarz’s inequality (2.5) to Y j=1  j (·), then from eY,i (·), k Y (y, ·) = eY,i (y), we have h i (y) = eY,i (y) and j h j (y)eY,  kY (y, ·) = ∞ j=1 eY, j (y)eY, j (·). Thus, we obtain ∞  j=1

2 eY, j (y) = k Y (y, y)

(5.21)

Appendix

159

and ∞ 

⎛ |e X,i (x)| · ⎝

i=1

∞ 

⎞1/2 ai,2 j ⎠

j=1

⎛ ⎞1/2 ⎛ ⎞1/2 ∞ ∞  ∞    2 2 ≤⎝ e X,i (x)⎠ ⎝ ai, j ⎠ = k X (x, x) g , i=1

i=1 j=1

(5.22) ∞ . Note that (5.20), (5.21), where we apply Cauchy-Schwarz’s√ inequality √ (2.5) to i=1 and (5.22) imply that |g(x, y)| ≤ k X (x, x) kY (y, y) g . Thus, H X ⊗ HY is an RKHS. From k X (x, ·) ∈ H X , kY (y, ·) ∈ HY , we have that k(x, ·, y, ) := k X (x, ·) kY (y, ·) ∈ H X ⊗ HY for k(x, x  , y, y  ) := k X (x, x  )kY (y, y  ). From g(x, y) =

∞ ∞  

ai, j e X,i (x)eY, j (y) =

i=1 j=1

=

∞  ∞ 

∞ ∞  

ai, j e X,i (·), k X (x, ·) H X eY,i (), kY (y, ) HY

i=1 j=1

ai, j e X,i (·)eY, j (), k(x, ·, y, ) H = 

i=1 j=1

∞  ∞ 

ai, j e X,i (·)eY, j (·), k(x, ·, y, ) H

i=1 j=1

= g(·, ), k(x, ·, y, ) ,

k is the reproducing kernel of H X ⊗ HY .



Proof of Proposition 54 (Necessity) Let W be an open set centered at the origin with a radii of  > 0 and w0 ∈ E. We assume that w0 + W has a measure of 0 and show that this contradicts another assumption, i.e., that k(x, y) = φ(x − y) is a characteristic kernel. In this case, η is an even function, and −w0 + W is also of measure 0 (±w0 + W ⊆ E\E(η)), (d+1)/2 is nonnegative definite when where we use the fact that g(w) := ( − w 2 )+ d 5 (Bochner’s Theorem), E = R (d ≥ 1) (see [8] for the proof). From Proposition   there exists a finite measure μ such that g(w) = E eiw x μ(x). Moreover, the closure of ±w0 + W is the support of 



h(w) = g(w − w0 ) + g(w + w0 ) = E

eiw x 2 cos(w0 x)dμ(x) .

/ W , we have h(0) = Since the support of h has no intersection with E(η) and ±w0 ∈ 0. Therefore, we obtain ν(E) = 0 for  2 cos(w0 x)dμ(x) , B ∈ F . ν(B) := B

Since g is not zero, ν is not the zero measure. Thus, using the total variation |ν|(B) := sup

n 

∪Bi =B i=1

|ν(Bi )| , B ∈ F ,

160

5 The MMD and HSIC 1 fn 0

F

f1

U

Fig. 5.5 Proof of Proposition 57.   As n grows, the slope of f n rapidly increases at the border of F, U . Therefore, if E f n d P = E f n d Q for all { f n }, we require P = Q (Dudley “Real Analysis and Probability” [6])

where sup is the supremum when dividing F into Bi ∈ F, we define the constant c := |ν|(E) and the finite measures μ1 := 1c |ν| and μ2 := 1c {|ν| − ν}. From ν(E) = 0, we observe that μ1 and μ2 are both probabilities and that μ1 = μ2 . Additionally, we have c(dμ1 − dμ2 ) = dν = 2 cos(w0 x)dμ . From Fubini’s theorem, we can write the difference between the expectations w.r.t. probabilities μ1 , μ2 as   1 φ(x − y)dμ1 (y) − φ(x − y)dμ2 (y) = φ(x − y)2 cos(w0 y)dμ(y) c E E E    1 1  i(x−y) w i xw = dηdμ(y) = e h(w)dη(w) . 2 cos(w0 y) e c c E 

However, since the supports of h and η do not intersect, the value is zero, which contradicts the assumption that φ(x − y) is a characteristic kernel. 

Proof of Proposition 57

  For any bounded continuous f , if E f d P = E f d Q holds, this implies that P = Q (Fig. 5.5). In fact, let U be an open subset of E, and let V be its complement. Furthermore, let d(x, V ) := inf y∈V d(x, y) and f n (x) := min(1, nd(x, V )). Then, f n is a bounded continuous function on E, and f n (x) ≤ I (x ∈ U ) and f n (x) → I (x ∈ U ) as n → ∞ foreach x ∈ R; Thus, by the monotonic convergence  theorem, f d P → P(U ) and f d Q → Q(U ) hold. By our assumption, E n E fn d P = E n 2 f d Q and P(U ) = Q(U ), i.e., P(V ) = Q(V ) holds In other words, every event E n is guaranteed to be a closure event. Let E be a compact set. For each element g ∈ H in the RKHS H of the universal kernel, the same argument follows  since supx∈E | f (x) − g(x)| can be arbitrarily small for any f ∈ C(E). That is, if gd P = gd Q holds for any g ∈ H , then P = Q, so the universal kernel is characteristic. 

If E is compact, then for any A ∈ F , P(A) = {P(V )|V is a closed set, V ⊆ A, V ∈ V } (Theorem 7.1.3, Dudley [6]).

2

Exercises 65∼83

161

Exercises 65∼83 65. Proposition 49 can be derived according to the following steps. Which part of the proof in the Appendix does each step correspond to? √ √ (a) Show that |g(x, x, y, y)| ≤ k X (x, x) kY (y, y) g for g ∈ H X ⊗ HY and x ∈ E X , y ∈ E Y (from Proposition 33, this implies that H is some RKHS). (b) Show that k(x, ·, y, ) := k X (x, ·)kY (y, ) ∈ H when x ∈ E X , y ∈ E Y are fixed. (c) Show that f (x, y) =  f (·, ), k(x, ·, y, ) H . 66. How can we define the average of the elements of H X ⊕Y , m X Y =E X Y [k X (·)kY (·)]? Define the average in the same way that we defined m X using Riesz’s lemma (Proposition 22). 67. Show that Y X ∈ B(H X , HY ) exists such that  f g, m X Y − m X m Y HX ⊗HY =  Y X f, g HY for each f ∈ H X , g ∈ HY . 68. The MMD is generally defined as sup f ∈F {E P [ f (X )] − E Q [ f (X )]} for some set F of functions. Assuming that F := { f ∈ H | f H ≤ 1}, show that the MMD is

m P − m Q H . Furthermore, show that we can transform the MMD as follows. MMD2 = E X X  [k(X, X  )] + EY Y  [k(Y, Y  )] − 2E X Y [k(X, Y )] ,

69. 70.

71. 72.

73.

where X  and X (Y  and Y ) are independent random variables that follow the same distribution. Show that the squared MMD estimator (5.4) is unbiased. In the two-sample problem solved by a permutation test in Example 71, for the case when the numbers of samples are m, n (can be different) instead of the same n and m, n are both even numbers, modify the entire program in Example 71 to examine whether it works correctly (m = n in Example 71). For the function h in (5.6), show that h 1 is a function that always takes a value of zero and that h˜ 2 and h coincide as functions. Show that the fact that random variables X, Y that follow Gaussian distributions are independent is equivalent to the condition that their correlation coefficient is zero. Additionally, give an example of two variables whose correlation coefficient is zero but that are not independent. Prove the following equation.

m X Y − m X m Y 2 = E X X  Y Y  [k X (X, X  )kY (Y, Y  )] −2E X Y {E X  [k X (X, X  )]EY  [kY (Y, Y  )]} + E X X  [k X (X, X  )]EY Y  [kY (Y, Y  )].

74. Show that the HSIC estimator

162

5 The MMD and HSIC

 2  1   k X (xi , x j )kY (yi , y j ) − 3 k X (xi , x j ) kY (yi , yh ) H S I C := 2 N N i

j

i

j

h

 1  + 4 k X (xi , x j ) kY (yh , yr ) N r i

j

h

 can be written as H S I C = trace(K X H K Y H ) using K X = (k X (xi , x j ))i, j , K Y = (kY (yi , y j ))i, j , and H = I − N1 E, where I ∈ R N ×N is the unit matrix and E ∈ R N ×N is the matrix such that all the elements are ones. Additionally, construct Python programs for each computation. Moreover, examine that both output the same results for the Gaussian kernels k_x and k_y with σ 2 = 1; generate random numbers for the standard Gaussian variables X and Y whose correlations are a = 0, 0.1, 0.2, 0.4, 0.6, 0.8. 75. When we test the independence X ⊥⊥ {Y, Z } of X and {Y, Z }, the HSIC is extended as ||m X Y Z − m X m Y Z ||2 . That is, we can transform kY (y, ·) into kY (y, ·)k Z (z, ·). Construct the function HSIC_2 by adding arguments to the function HSIC_1; generate a random number according to X ⊥⊥ {Y, Z }, and verify that the obtained value is sufficiently small. 76. Utilizing the class of LiNGAM and the function def cc ( x , y ) : r e t u r n np . sum ( np . d o t ( x . T , y ) ) / l e n ( x ) def f ( u , v ) : return u − cc ( u , v ) / cc ( v , v ) ∗ v

we wish to estimate whether each variable X, Y, Z is either upstream, midstream, or downstream. Fill in the blanks by generating random numbers X, Y, Z that do not follow the Gaussian distribution, and estimate which variables among X, Y, Z are upstream, midstream, and downstream from the random numbers alone. # Data g e n e r a t i o n n = 30 x = np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2 y = 2 ∗ x + np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2 z = x + y +np . random . r a n d n ( n ) ∗∗2 − np . random . r a n d n ( n ) ∗∗2 x = x − np . mean ( x ) y = y − np . mean ( y ) z = z − np . mean ( z ) # # E s t i m a t e UpStream ## def cc ( x , y ) : r e t u r n np . sum ( np . d o t ( x . T , y ) / l e n ( x ) ) def f ( u , v ) : return u − cc ( u , v ) / cc ( v , v ) ∗ v x_y = f ( x , y ) ; y_z = f ( y , z ) ; z_x = f ( z , x ) x_z = f ( x , z ) ; z_y = f ( z , y ) ; y_x = f ( y , x ) v1 = HSIC_2 ( x , y_x , z_x , k_x , k_y , k_z ) v2 = HSIC_2 ( y , z_y , x_y , k_y , k_z , k_x ) v3 = HSIC_2 ( z , x_z , y_z , k_z , k_x , k_y )

Exercises 65∼83

i f v1 < v2 : i f v1 < top else : top else : i f v2 < top else : top

163

v3 : = 1 = 3 v3 : = 2 = 3

# # E s t i m a t e MidStream ## x_yz = f ( x_y , z_y ) y_zx = f ( y_z , x_z ) z_xy = f ( z_x , y_x ) i f top v1 v2 if

== 1 : = ## Blank ( 1 ) = ## Blank ( 2 ) v1 < v2 : middle = 2 bottom = 3 else : middle = 3 bottom = 2 i f t o p == 2 : v1 = # # B l a n k ( 3 ) v2 = # # B l a n k ( 4 ) i f v1 < v2 : middle = 3 bottom = 1 else : middle = 1 bottom = 3

## ##

## ##

i f top v1 v2 if

== 3 : = # # B l a n k ( 5 ) ## = # # B l a n k ( 6 ) ## v1 < v2 : middle = 1 bottom = 2 else : middle = 2 bottom = 1 # # O u t p u t t h e R e s u l t s ## print ( " top = " , top ) print ( " middle = " , middle ) print ( " bottom = " , bottom )

77. We wish to make two sequences independent by shifting one of x1 , . . . , x N or  y1 , . . . , y N , and then we want to repeat the process of calculating H S I C. We wish to create a histogram that expresses a distribution that follows the null hypothesis. For this purpose, we constructed the following program. Why can we obtain the null hypothesis (X, Y are independent) by permutation? Where in the program do we obtain the HSIC statistics, and where do we obtain the multiple HSIC values that follow the null hypothesis? # Data G e n e r a t i o n f x = np . random . r a n d n ( n ) y = np . random . r a n d n ( n )

164

5 The MMD and HSIC u = m = w = for

HSIC_1 ( x , y , k_x , k_y ) 100 [] i i n range (m) : x = x [ np . random . c h o i c e ( n , n , r e p l a c e = F a l s e ) ] w . a p p e n d ( HSIC_1 ( x , y , k_x , k_y ) ) v = np . q u a n t i l e (w, 0 . 9 5 ) x = np . l i n s p a c e ( min ( min (w) , u , v ) , max ( max (w) , u , v ) , 2 0 0 ) d e n s i t y = kde . g a u s s i a n _ k d e (w) plt . plot (x , density (x) ) p l t . a x v l i n e ( x = v , c = " r " , l i n e s t y l e = "−− " ) plt . axvline (x = u , c = "b" )

78. In the MMD (Sect. 5.2) and HSIC (Sect. 5.3), we cannot apply Mercer’s theorem because the kernel of the integral operator is not nonnegative definite. However, in both cases, the integral operator possesses eigenvalues and eigenfunctions. Why? 79. Show that k(x, y) = φ(x − y), φ(t) = e−|t| is a characteristic kernel. 80. In the proof of Proposition 54 (necessity, Appendix), we used the fact that (d+1)/2 is nonnegative definite [8]. Verify that this fact is g(w) := ( − w 2 )+ correct for d = 1 by proving the following equality. 1 2π





−

g(w)e−iwx dw =

1 − cos(x) . π x2

81. Why is the exponential type a universal kernel? Why is the characteristic kernel based on a triangular distribution not a universal kernel? 82. Explain why the three equations and four inequalities hold in the following derivation of the upper bound on the Rademacher complexity. R N (F) = Eσ [sup | f ∈F

N N 1  1  σi f (xi )|] = Eσ [sup | σi k(xi , ·), f (·) H |] N i=1 f ∈F N i=1

1  σi k(xi , ·) H |] N i=1 f ∈F

N N

1  1   ≤ Eσ [sup f H  σi k(xi , ·), σi k(xi , ·) H ] N i=1 N i=1 f ∈F

N  N

1  ≤ Eσ [ 2 σi σ j k(xi , x j )] N i=1 j=1

N  N 

kmax 1 . k(xi , x j )] ≤ ≤ Eσ [ 2 N i=1 j=1 N N

= Eσ [sup | f,

Exercises 65∼83

165

83. Explain why the one equality and four inequalities hold for the derivation of the 2  M D B | below. upper bound of |M M D 2 − M E X,Y sup |E X  { f ∈F

N N N N 1  1  1  1  f (xi ) − f (xi )} − EY  { f (y j )) − f (y j )}| N N N N i=1

i=1

N 

N 

j=1

j=1

N 

N 

1 1 1 1 f (xi ) − f (xi ) − f (yi ) + f (yi )| N N N N f ∈F i=1 i=1 i=1 i=1

≤ E X,Y,X  ,Y  sup |

N N 1  1   σi { f (xi ) − f (xi )} + σi { f (yi ) − f (yi )}| = E X,Y,X  ,Y  ,σ,σ  sup | N f ∈F N i=1

i=1

N n 1  1   σi { f (xi ) − f (xi )}| + EY,Y  ,σ  sup | σ j { f (y j ) − f (y j )}| ≤ E X,X  ,σ sup | N N f ∈F f ∈F i=1

≤ 2[R(F , P) + R(F , Q)] ≤ 2[(kmax /N )1/2 + (kmax /N )1/2 ].

j=1

Chapter 6

Gaussian Processes and Functional Data Analyses

A stochastic process may be defined either as a sequence of random variables {X t }t∈T , where T is a set of times, or as a function X t (ω) : T → R of ω ∈ . We define a Gaussian process as a stochastic process {X t } such that X t (t ∈ T  ) follows a multivariate Gaussian distribution for any finite subset T  of T . In this chapter, we generalize the one-dimensional T to a multidimensional set E for the consideration of Gaussian processes. We mainly deal with the variations of ω ∈  in f (ω, x), while thus far, we have dealt with the variations of x ∈ E in f (ω, x). The Gaussian process has been applied to various aspects of machine learning. We examine the relation between Gaussian processes and kernels. The chapter’s first half consists of regression, classification, and computational reduction treatments, and the last part studies the Karhunen-Lóeuvre expansion and its surrounding theory. Finally, we study functional data analyses, which are closely related to stochastic processes.

6.1 Regression Let E and (, F , μ) be a set and a probability space. If the correspondence between   ω → f (ω, x) ∈ R is measurable for each x ∈ E, i.e., if f (ω, x) is a random variable at each x ∈ E, then we say that f :  × E → R is a stochastic process. Moreover, if the random variables f (ω, x1 ), . . . , f (ω, x N ) follow an N -variable Gaussian distribution for any N ≥ 1 and any finite number of elements x1 , . . . .x N ∈ E, then we call f a Gaussian process. We define the covariance between xi , x j ∈ E by  

{ f (ω, xi ) − m(xi )}{ f (ω, x j ) − m(x j )}dμ(ω) ,

 where m(x) :=  f (ω, x)dμ(ω) is the expectation of f (ω, x) for x ∈ E. Then, no matter what N and x1 , . . . , x N we choose, their covariance matrices are nonnegative © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1_6

167

168

6 Gaussian Processes and Functional Data Analyses

definite. Thus, we can write the covariance matrix by using a positive definite kernel k : E × E → R. Therefore, the Gaussian process can be uniquely expressed in terms of a pair (m, k) containing the mean m(x) of each x ∈ E and the covariance k(x, x  ) of each (x, x  ) ∈ E × E. In general, a random variable is a map of  → R, and we should make ω explicit, i.e., f (ω, x), but for simplicity, for the time being, we make ω implicit, i.e., f (x), even if it is a random variable. Example 83 Let m X ∈ R N and k X X ∈ R N ×N be the mean and covariance matrix, respectively, of the Gaussian process (m, k) at x1 , . . . , x N ∈ E := R. In general, for a mean μ and a covariance matrix  ∈ R N ×N ,  is nonnegative definite, and there exists a lower triangular matrix R ∈ R N ×N with  = R R  (Cholesky decomposition). Therefore, to generate random numbers that follow N (m X , k X X ) from N independent random numbers u 1 , . . . , u N that follow the standard Gaussian distribution, we can calculate f X := R X u + m X ∈ R N for k X X := R X R  X with u = [u 1 , . . . , u N ]. In fact, the expectation and the covariance matrix of f X are m X and    E[( f X − m X )( f X − m X ) ] = E[R X uu  R  X ] = R X E[uu ]R X = R X R X = k X X ,

respectively This procedure can be described in Python as follows. # I n s t a l l t h e module s k f d a v i a p i p i n s t a l l s c i k i t −f d a

# I n t h i s c h a p t e r , we assume t h a t t h e f o l l o w i n g h a s b e e n e x e c u t e d . import numpy a s np import m a t p l o t l i b . p y p l o t a s p l t from m a t p l o t l i b import s t y l e from s k l e a r n . d e c o m p o s i t i o n import PCA import s k f d a

# D e f i n i t i o n o f (m, k ) d e f m( x ) : return 0 def k ( x , y ) : r e t u r n np . exp ( − ( x−y ) ∗ ∗ 2 / 2 ) # D e f i n i t i o n o f gp_sample d e f g p _ s a m p l e ( x , m, k ) : n = len ( x ) m_x = m( x ) k_xx = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : k_xx [ i , j ] = k ( x [ i ] , x [ j ] ) R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x u = np . random . r a n d n ( n ) r e t u r n R . d o t ( u ) + m_x

6.1 Regression

169

# G e n e r a t e t h e random numbers and c o n s t r u c t t h e c o v a r i a n c e m a t r i x t o compare i t with k_xx x = np . a r a n g e ( − 2 , 3 , 1 ) n = len ( x ) r = 100 z = np . z e r o s ( ( r , n ) ) f o r i i n range ( r ) : z [ i , : ] = g p _ s a m p l e ( x , m, k ) k_xx = np . z e r o s ( ( n , 2 ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : k_xx [ i , j ] = k ( x [ i , : ] , x [ j , : ] ) p r i n t ( "cov(z):\n" , np . cov ( z ) , "\n" ) p r i n t ( "k_xx:\n" , k_xx ) cov ( z ) :

[[ 2.37424382e-01 1.09924256e-03 4.32681142e-01 ... -1.92328002e-01 1.46703390e-03 -7.31927129e-01] [ 1.09924256e-03 6.17989727e-02 -8.41300879e-02 ... 1.37955079e-01 1.22155905e-01 3.53607212e-02] [ 4.32681142e-01 -8.41300879e-02 1.51374514e+00 ... -6.82936090e-01 -8.44371416e-02 -1.82595208e+00] ... [-1.92328002e-01 1.37955079e-01 -6.82936090e-01 ... 1.16493686e+00 4.07945653e-01 7.10299634e-01] [ 1.46703390e-03 1.22155905e-01 -8.44371416e-02 ... 4.07945653e-01 2.88425891e-01 -4.34847631e-03] [-7.31927129e-01 3.53607212e-02 -1.82595208e+00 ... 7.10299634e-01 -4.34847631e-03 2.60520510e+00]]

k_xx: [[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02 3.35462628e-04] [6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02] [1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01] [1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01] [3.35462628e-04 1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00]]

In general, E does not have to be R. Gaussian processes are a class of stochastic processes, and we might have the impression that the set E is the entire real number set or a subset of it, but in fact, there is no further restriction as long as we define the positive definite kernel k on E × E. Once we choose (m, k), we generate N -variate Gaussian random variables according to (m, k), regardless of the selected E.

170

6 Gaussian Processes and Functional Data Analyses

Example 84 For E = R2 , we can similarly obtain random numbers that follow the N -variate multivariate Gaussian distribution. # D e f i n i t i o n o f (m, k ) d e f m( x ) : return 0 def k ( x , y ) : r e t u r n np . exp ( − np . sum ( ( x−y ) ∗ ∗ 2 ) / 2 ) # D e f i n i t i o n o f Function gp_sample d e f g p _ s a m p l e ( x , m, k ) : n = x . shape [ 0 ] m_x = m( x ) k_xx = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : k_xx [ i , j ] = k ( x [ i ] , x [ j ] ) R = np . l i n a l g . c h o l e s k y ( k_xx ) # l o w e r t r i a n g u l a r m a t r i x u = np . random . r a n d n ( n ) r e t u r n R . d o t ( u ) + m_x

# G e n e r a t e t h e random numbers and c o n s t r u c t t h e c o v a r i a n c e m a t r i x t o compare i t with k_xx n = 5 r = 100 z = np . z e r o s ( ( r , n ) ) f o r i i n range ( r ) : z [ i , : ] = g p _ s a m p l e ( x , m, k ) k_xx = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : k_xx [ i , j ] = k ( x [ i ] , x [ j ] ) p r i n t ( "cov(z):\n" , np . cov ( z ) , "\n" ) p r i n t ( "k_xx:\n" , k_xx ) cov ( z ) :

[[ 1.52140938 0.2585143 0.67840535 ... -0.57075902 -0.53404263 0.20940014] [ 0.2585143 0.47450778 0.36954164 ... -0.30224707 -0.47093046 -0.03341687] [ 0.67840535 0.36954164 0.77124364 ... -0.34846439 -0.40226404 0.37837525] ... [-0.57075902 -0.30224707 -0.34846439 ... 0.42199909 0.46715201 0.04255153] [-0.53404263 -0.47093046 -0.40226404 ... 0.46715201 0.59461419 0.09420804] [ 0.20940014 -0.03341687 0.37837525 ... 0.04255153 0.09420804 0.40676413]]

6.1 Regression

171

k_xx: [[1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02 3.35462628e-04] [6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01 1.11089965e-02] [1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01 1.35335283e-01] [1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00 6.06530660e-01] [3.35462628e-04 1.11089965e-02 1.35335283e-01 6.06530660e-01 1.00000000e+00]]

Then, as in the usual regression procedure, we assume that x1 , . . . , x N ∈ E and y1 , . . . , y N ∈ R are generated according to yi = f (xi ) + i

(6.1)

through the use of an unknown function f : E → R, where i follows a Gaussian distribution with a mean of 0 and a variance of σ 2 and is independent for each i = 1, . . . , N . The likelihood is N 

1 (yi − f (xi ))2 [√ exp{− }] 2σ 2 2π σ 2 i=1

when the function f is known (fixed). In the following, we assume that the function f randomly varies, and we regard the Gaussian process (m, k) as its prior distribution. That is, we consider the model f X ∼ N (m X , k X X ) with yi | f (xi ) ∼ N ( f (xi ), σ 2 ) as f X = ( f (x1 ), . . . , f (x N )). Then, we calculate the posterior distribution of f (z 1 ), . . . , f (z n ) corresponding to z 1 , . . . , z n ∈ E, which is different from x1 , . . . , x N . The variations in y1 , . . . , y N is due to the variations in f and i . Thus, the covariance matrix is k X X + σ 2 I = (k(xi , x j ) + σ 2 δi, j )i, j=1,...,N ∈ R N ×N . On the other hand, the variation in f (z 1 ), . . . , f (z n ) is due only to the variation in f . Therefore, the covariance matrix is k Z Z = (k(z i , z j ))i, j=1,...,n ∈ Rn×n . Moreover, the variances of yi and f (z j ) are those of f (xi ) and f (z j ), respectively, and the covariance matrix of Y = [y1 , . . . , y N ] and f Z = [ f (z 1 ), · · · , f (z n )] is k X Z = (k(xi , z j ))i=1,...,N , j=1,...,n . In summary, the simultaneous distribution of Y and f Z is       Y mX kX X + σ 2 I kX Z ∼N , . mZ kZ X kZ Z fZ

172

6 Gaussian Processes and Functional Data Analyses

In the following, we show that the posterior probability of the function f (·) given the value of Y is still a Gaussian process. To this end, we use the following proposition. Proposition 61 Suppose that the simultaneous distribution of random variables a ∈ R N , b ∈ Rn can be expressed by       a μa A C ∼N , , b μb C B where μa , μb are the expectations, A ∈ R N ×N , B ∈ Rn×n are the covariance matrices (A: positive definite; B: nonnegative definite), and C ∈ R N ×n is the covariance matrix between them. Then, the conditional probability of b given a is b|a ∼ N (μb + C  A−1 (a − μa ), B − C  A−1 C) .

(6.2)

Proof: Consult Lauritzen, “Graphical Models” [20] p256. Hence, from Proposition 61, the posterior distribution of f Z ∈ Rn under Y ∈ R N is N (μ ,   ), where μ := m Z + k Z X (k X X + σ 2 I )−1 (Y − m X ) ∈ Rn and

  := k Z Z − k Z X (k X X + σ 2 I )−1 k X Z ∈ Rn×n .

If we set n = 1 and z 1 = x, then the distribution of f (x) becomes m  (x) := m(x) + k x X (k X X + σ 2 I )−1 (Y − m X )

(6.3)

k  (x, x) := k(x, x) − k x X (k X X + σ 2 I )−1 k X x .

(6.4)

We summarize the discussion as follows. Proposition 62 Suppose that the prior distribution of f (·) is a Gaussian process (m, k). If we obtain x1 , . . . , x N , y1 , . . . , y N according to (6.1), the posterior probability of f (·) is a Gaussian process (m  , k  ), where m  , k  are given by (6.3) and (6.4), respectively. In the actual calculation, it takes O(N 3 ) time to calculate (K + σ 2 I )−1 . To complete the whole process in O(N 3 /3), we use the following method. By Cholesky decomposition, we obtain an L ∈ R N ×N such that L L = kX X + σ 2 I , which can be completed in O(N 3 /3) time. Then, let the solutions of Lγ = k X x , Lβ = y − m(x), and L  α = β be γ ∈ R N , β ∈ R N , and α ∈ R N , respectively. Since L is

6.1 Regression

173

a lower triangular matrix, these calculations take at most O(N 2 ) time. Additionally, we have (k X X + σ 2 I )−1 (Y − m X ) = (L L  )−1 Lβ = (L L  )−1 L L  α = α and

k x X (k X X + σ 2 I )−1 k X x = (Lγ ) (L L  )−1 Lγ = γ  γ .

Finally, from α, β, γ , we have m  (x) = m(x) + k x X α and

k  (x, x) = k(x, x) − γ  γ .

We can write the calculations of m(x), k(x, x) in forms that are completed in O(N 3 ) and O(N 3 /3) time in Python as follows. d e f gp_1 ( x _ p r e d ) : h = np . z e r o s ( n ) f o r i i n range ( n ) : h [ i ] = k ( x_pred , x [ i ] ) R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 ) C o m p u t a t i o n mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , ( y−mu ( x ) ) ) s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h ) r e t u r n {"mm" :mm, "ss" : s s } d e f gp_2 ( x _ p r e d ) : h = np . z e r o s ( n ) f o r i i n range ( n ) : h [ i ] = k ( x_pred , x [ i ] ) L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) # O( n ^ 3 / 3 ) Computation a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) ) # O( n ^ 2 ) Computation mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) ) gamma = np . l i n a l g . s o l v e ( L . T , h ) # O( n ^ 2 ) Computation s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 ) r e t u r n {"mm" :mm, "ss" : s s }

Example 85 For comparison purposes, we executed the functions gp_1 and gp_2. We can see the difference achieved by Cholesky decomposition, which reduced the computational complexity (Fig. 6.1). sigma_2 = 0 . 2 def k ( x , y ) : r e t u r n np . exp ( − ( x − y ) ∗∗2 / 2 / s i g m a _ 2 ) d e f mu ( x ) : return x

# Covariance Function # Mean F u n c t i o n

174

n x y K

6 Gaussian Processes and Functional Data Analyses

= = = =

100 np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3 np . s i n ( x / 2 ) + np . random . r a n d n ( n ) np . z e r o s ( ( n , n ) )

f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) # # Measure E x e c u t i o n Time import t i m e s t a r t 1 = time . time ( ) gp_1 ( 0 ) end1 = t i m e . t i m e ( ) p r i n t ( "time1=" , end1 − s t a r t 1 ) s t a r t 2 = time . time ( ) gp_2 ( 0 ) end2 = t i m e . t i m e ( ) p r i n t ( "time2=" , end2 − s t a r t 2 ) # The 3 s i g m a w i d t h a r o u n d t h e a v e r a g e u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 ) v _ s e q = [ ] ; w_seq = [ ] for u in u_seq : r e s = gp_1 ( u ) v _ s e q . a p p e n d ( r e s [ "mm" ] ) w_seq . a p p e n d ( r e s [ "ss" ] ) plt plt plt plt plt plt plt

. figure () . x l i m ( −3 , 3 ) . y l i m ( −3 , 3 ) . s c a t t e r ( x , y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" ) . p l o t ( u_seq , v _ s e q ) . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ 3 f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b" ) . p l o t ( u_seq , np . sum ( [ v_seq , [ i ∗ ( − 3) f o r i i n w_seq ] ] , a x i s = 0 ) , c = "b ") p l t . show ( ) n = 100 plt . figure () p l t . x l i m ( −3 , 3 ) p l t . y l i m ( −3 , 3 ) ## Five times , changing t h e samples c o l o r = [ "r" , "g" , "b" , "k" , "m" ] f o r h i n range ( 5 ) : x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3 y = np . s i n ( np . p i ∗ x / 2 ) + np . random . r a n d n ( n ) sigma_2 = 0 . 2 K = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) u _ s e q = np . a r a n g e ( − 3 , 3 . 1 , 0 . 1 ) v_seq = [ ] for u in u_seq : r e s = gp_1 ( u ) v _ s e q . a p p e n d ( r e s [ "mm" ] ) p l t . p l o t ( u_seq , v_seq , c = c o l o r [ h ] ) time1 = 0.009966373443603516 time2 = 0.057814598083496094

If we compare the equation m  (x) := k x X (k X X + σ 2 I )−1 Y

175

-1

0

0

0

-3

-3

-2

-2

-1

f (z)

1

1

2

2

3

3

6.1 Regression

-3

-2

-1

0 z

1

2

3

-3

-2

-1

0 Index

1

2

3

Fig. 6.1 We show the range of 3 σ above and below the average (left) and executed different samples five times (right)

obtained by substituting m X = m(x) = 0 into the average formula of the Gaussian process (6.3) with the equation k x,X αˆ = k x X (K + λI )−1 Y obtained by multiplying the kernel ridge regression formula (4.6) by k x,X from the left, we observe that the former is a specific case of the latter when setting λ = σ 2 .

6.2 Classification We consider the classification problem next. We assume that the random variable Y takes the value Y = ±1 and that its conditional probability given x ∈ E is P(Y = 1|x) =

1 , 1 + exp(− f (x))

(6.5)

where the Gaussian process f :  × E → R is used. We wish to estimate f from the actual x1 , . . . , x N ∈ R p (row vector) and y1 , . . . , y N ∈ {−1, 1}. To maximize the likelihood, we minimize the negative log-likelihood N

log[1 + exp{−yi f (xi )}] .

i=1

If we set f X = [ f 1 , . . . , f N ] = [ f (x1 ), . . . , f (x N )] ∈ R N , vi := e−yi fi , and N log(1 + vi ), then we have l( f X ) := i=1

176

6 Gaussian Processes and Functional Data Analyses

∂vi ∂l( f X ) yi vi ∂ 2 l( f X ) vi = −yi vi , =− , = , ∂ fi ∂ fi 1 + vi (1 + vi )2 ∂ f i2 where we use yi2 = 1. Given an initial value, we wait for the Newton-Raphson update f X ← f X − (∇ 2 l( f X ))−1 ∇l( f X ) to converge. The update formula is f X ← f X + W −1 u ,    yi vi vi where u = and W = diag . In other words, we 1 + vi i=1,...,N (1 + vi )2 i=1,...,N repeat the following two steps: 

1. Obtain v, u, and W from f X . 2. Calculate f X + W −1 u and substitute it into f X for v := [v1 , . . . , v N ] ∈ R N .

N 

1 multiplied 1 + exp{−y i f (x i )} i=1 by the prior distribution of f X , i.e., finding the solution with the maximum posterior probability. Here, the mean is often set to 0 as the prior probability of f in the formulation of (6.5). Suppose first that the prior probability of f X ∈ R N is Next, we consider maximizing the likelihood



1 (2π ) N det k X X

exp{−

f X k −1 X X fX }, 2

where k X X is the Gram matrix (k(xi , x j ))i, j=1,...,N . If we set L( f X ) = l( f X ) + then we have

and

1  −1 1 N f k f X + log det k X X + log 2π , 2 X XX 2 2

(6.6)

−1 ∇ L( f X ) = ∇l( f X ) + k −1 X X f X = −u + k X X f X

(6.7)

−1 ∇ 2 L( f X ) = ∇ 2 l( f X ) + k −1 X X = W + kX X .

(6.8)

Thus, we may express the update formula as −1 −1 f X ← f X + (W + k −1 X X ) (u − k X X f X ) −1 −1 −1 = (W + k −1 X X ) {(W + k X X ) f X − k X X f X + u} −1 = (W + k −1 X X ) (W f X + u) .

However, since the size of f X is the number of samples N , it takes an enormous amount of time to calculate the inverse matrix. We try to improve the efficiency of this process as follows. Utilizing the Woodbury-Sherman-Morrison formula

6.2 Classification

177

(A + U W V  )−1 = A−1 − A−1 U (W −1 + V  A−1 U )−1 V  A−1

(6.9)

with A ∈ Rn×n (nonsingular), W ∈ Rm×m , and U, V ∈ Rn×m , if we set A = k −1 X X and U = V = I , we obtain −1 (W + k −1 XX) = k X X − k X X (W −1 + k X X )−1 k X X

= k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X .

(6.10)

Thus, we can obtain an L such that I + W 1/2 k X X W 1/2 = L L  (Cholesky decomposition) in O(N 3 /3) time. Letting γ := W f X + u, we find a β such that Lβ = W 1/2 k X X γ and an α such that L  W −1/2 α = β in O(N 2 ) time, and we substitute k X X (γ − α) into f X . We repeat this procedure until convergence is achieved. In fact, we have the following equation: L L  W −1/2 α = Lβ = W 1/2 k X X γ k X X (γ − α) = k X X {γ − W 1/2 (L L  )−1 W 1/2 k X X γ } = {k X X − k X X W 1/2 (L L  )−1 W 1/2 k X X }γ −1 = {k X X − k X X W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 k X X }γ = (W + k −1 X X ) (W f + u) ,

where the last equality is due to (6.10). Example 86 By using the first N = 100 of the 150 Iris data (the first 50 points and the next 50 points are Setosa and Versicolor data, respectively), we found the f X = [ f 1 , . . . , f N ] with the maximum posterior probability. The output showed that f 1 , . . . , f 50 and f 51 , . . . , f 100 were positive and negative, respectively. from s k l e a r n . d a t a s e t s import l o a d _ i r i s df = l o a d _ i r i s ( ) # # I r i s Data x = df . data [0:100 , 0:4] y = np . a r r a y ( [ 1 ] ∗ 5 0 + [ − 1 ] ∗ 5 0 ) n = len ( y ) # # Compute K e r n e l v a l u e s f o r t h e f o u r c o v a r i a t e s def k ( x , y ) : r e t u r n np . exp ( np . sum( − ( x − y ) ∗ ∗ 2 ) / 2 ) K = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i , : ] , x [ j , : ] ) eps = 0.00001 f = [0] ∗ n g = [0.1] ∗ n while i g f y v u

np . sum ( ( np . a r r a y ( f )−np . a r r a y ( g ) ) ∗ ∗ 2 ) > e p s : = i + 1 = f ## Save t h e data b e f o r e update f o r comparison = np . a r r a y ( f ) = np . a r r a y ( y ) = np . exp ( − y∗ f ) = y ∗ v / (1 + v )

178

6 Gaussian Processes and Functional Data Analyses

w = ( v / (1 + v ) ∗∗2) W = np . d i a g (w) W_p = np . d i a g (w∗ ∗ 0 . 5 ) W_m = np . d i a g (w∗ ∗ ( − 0 . 5 ) ) L = np . l i n a l g . c h o l e s k y ( np . i d e n t i t y ( n ) +np . d o t ( np . d o t ( W_p , K) , W_p) ) gamma = W. d o t ( f ) + u b e t a = np . l i n a l g . s o l v e ( L , np . d o t ( np . d o t ( W_p , K) , gamma ) ) a l p h a = np . l i n a l g . s o l v e ( np . d o t ( L . T , W_m) , b e t a ) f = np . d o t (K, ( gamma− a l p h a ) ) print ( l i s t ( f ) )

[2.9017597728506903, 2.6661877410648125, 2.735999714975541, 2.5962146616446793, 2.8888259653434902, 2.4229904289075734, 2.7128370653298717, 2.8965829899125057, 2.263839959450692, 2.722794155018708, 2.6757868220665557, 2.80427691289934, 2.629916582197861, 2.129058875598969, 1.9947371858903622, 1.7255773341842824, 2.502403800007298, 2.894767948521167, 2.211715451090947, 2.7578887454424845, 2.5807025000167654, 2.7884335002993703, 2.4501472162360978, 2.598252566107158, 2.49363291477457, 2.5892721299617927, 2.7995603132602014, 2.854337885531593, 2.8580336326051525, 2.682198311711416, 2.6631803480277263, 2.6529515170091735, 2.409809417765029, 2.1570288906747956, 2.738196179682446, 2.777507355522734, 2.6054932709605585, 2.8486244905053546, 2.342636360704147, 2.8826825981318938, 2.887406385036485, 1.561916989035174, 2.4541693614670925, 2.64939855108404, 2.4071165717812315, 2.633906076076528, 2.7271240196093944, 2.6732162909902857, 2.749570997237667, 2.884288422112919, -1.870441763888594, -2.5373817133813237, -1.9327372577182746, -2.5318858849974895, -2.579366785986732, -2.7850146869568233, -2.381782737110541, -1.4675208896283152, -2.48651854962823, -2.356973904782517, -1.6007721537062138, -2.8113237365649777, -2.3813705315823492, -2.734406212419866, -2.3307697118699044, -2.3416642385980246, -2.6148174861101388, -2.748088114594368, -2.3729716167430452, -2.6288003128782993, -2.2519078422840106, -2.7892659188354783, -2.3456928166252453, -2.7042381425156226, -2.712488583609864, -2.502955837329986, -2.1624797840208307, -2.01571028937641, -2.8402609758000015, -2.19474914254942, -2.474090451450695, -2.3426425865837377, -2.755051287500358, -2.1890836679248933, -2.415174453933202, -2.3822860688193157, -2.275106022961632, -2.4967546817420585, -2.6958800533399927, -2.6643571619340682, -2.64863153643803, -2.7718459450746087, -2.8075095809378467, -1.5468264491591686, -2.81472767183598, -2.756340788852421, -2.8346439145882236, -2.8498009687645762, -1.1822836980388691, -2.8458627499007214]

To execute classification for a new value x by using the estimated fˆ ∈ R N , we perform the following steps. Similar to the regression case, if we apply Proposition 61 to       0 kX X kXx fX ∼N , , f (x) 0 kx X kx x then we obtain

f (x)| f X ∼ N (m  (x), k  (x, x)) ,

where m  (x) = k x X k −1 X X fX and

k  (x, x) = k x x − k x X k −1 X X kXx .

(6.11)

6.2 Classification

179

If we observe Y ∈ {−1, 1} N , we obtain the estimate fˆ with the Newton-Raphson method and calculate Wˆ . We consider the Laplace approximation of f X |Y , i.e., we approximate the Gaussian distribution as follows (Rasmussen-Williams [25]): −1 f X |Y ∼ N ( fˆ, (Wˆ + k −1 X X ) ).

(6.12)

That is, the covariance matrix is the inverse of Wˆ + k −1 X X , which is the Hessian 2 ˆ ∇ L( f ) of (6.8). Then, the variations in (6.11), (6.12) are independent, and f (x|Y ) = N (m ∗ , k∗ ). Note that we can calculate the posterior probability for each x ∈ E: ˆ m ∗ = k x X k −1 XX f

(6.13)

−1 ˆ −1 −1 −1 k∗ = k x x − k x X k −1 X X k X x + k x X k X X (W + k X X ) k X X k X x

= k x x − k x X (Wˆ −1 + k X X )−1 k X x , where the last transformation is due to A = k X X and W = Wˆ −1 in (6.9), and U, V are the unit matrices. Thus, we can compute the expectation (prediction value) w.r.t. f (x)|Y ∼ N (m ∗ , k∗ ) in the sigmoid function P(Y = 1|x) =  E

1 : 1 + exp(− f (x))

1 1 1 exp[− {z − m ∗ }2 ]dz. √ 1 + exp(−z) 2π k∗ 2k∗

(6.14)

To implement this step, we only need to compute uˆ from fˆ. Since (6.7) is zero when the updates converge, from (6.13), we have that m ∗ = k x X uˆ and (k X X + W −1 )−1 = W 1/2 W −1/2 (k X X + W −1 )−1 W −1/2 W 1/2 = W 1/2 (I + W 1/2 k X X W 1/2 )−1 W 1/2 . Hence, we compute α ∈ R N such that I + Wˆ 1/2 k X X Wˆ 1/2 = L L  (Cholesky decomposition) and Lα = Wˆ 1/2 k X x in O(N 3 /3) time. Then, we have k∗ = k x x − α  α since we have k x X W 1/2 (L L  )−1 W 1/2 k X x = k x X W 1/2 (L −1 ) L −1 W 1/2 k X x = α  α .

180

6 Gaussian Processes and Functional Data Analyses

We can describe the procedure for finding the value of (6.14) in Python as follows. We assume that the procedure starts immediately after the procedure of Example 86 completes. 6-I def pred ( z ) : kk = np . z e r o s ( n ) f o r i i n range ( n ) : kk [ i ] = k ( z , x [ i , : ] ) mu = np . sum ( kk ∗ u ) # Mean a l p h a = np . l i n a l g . s o l v e ( L , np . d o t ( W_p , kk ) ) s i g m a 2 = k ( z , z ) − np . sum ( a l p h a ∗ ∗ 2 ) # Variance m = 1000 b = np . random . n o r m a l ( mu , sigma2 , s i z e = m) p i = np . sum ( ( 1 + np . exp ( − b ) ) ∗∗( − 1) ) / m # Prediction return pi

Example 87 Immediately after processing Example 86, we entered numerical values for the four covariates of Iris into the function pred and calculated the probability of them being Setosa values (1 minus the probability of them being Versicolor values). When we input the average values of the covariates for Setosa and Versicolor, we observed that the probabilities were close to 1 and 0, respectively. z = np . z e r o s ( 4 ) f o r j i n range ( 4 ) : z [ j ] = np . mean ( x [ : 5 0 , j ] ) pred ( z )

0.9466371383148896

f o r j i n range ( 4 ) : z [ j ] = np . mean ( x [ 5 0 : 1 0 0 , j ] ) pred ( z )

0.05301765489687672

6.3 Gaussian Processes with Inducing Variables A Gaussian process generally involves O(N 3 ) calculations. To avoid such an inconvenience, we choose Z := [z 1 , · · · , z M ] ∈ E M and approximate the generation process f X ∼ N (m X , k X X ) −1 f (x)| f X ∼ N (m(x) + k x X k −1 X X ( f X − m X ), k(x, x) − k x X k X X k X x )

6.3 Gaussian Processes with Inducing Variables

181

y| f (x) ∼ N ( f (x), σ 2 ) by f Z ∼ N (m Z , k Z Z )

(6.15)

−1 f (x)| f Z ∼ N (m(x) + k x Z k −1 Z Z ( f Z − m Z ), k(x, x) − k x Z k Z Z k Z x )

y| f (x) ∼ N ( f (x), σ 2 ) , where m Z = (m(z 1 ), · · · , m(z M )), k Z Z = (k(z i , z j ))i, j=1,··· ,M , [k(x, z 1 ), · · · , k(x, z M )] (row vector). Under the following assumption, we obtain Proposition 63.

(6.16) (6.17)

and

kx Z =

Assumption 1 Each occurrence of (6.16) is independent for x = x1 , · · · , x N . Proposition 63 Let  ∈ R N ×N be a diagonal matrix whose elements are λ(xi ), i = 1, · · · , N . Under the generation process outlined in (6.15), (6.16), (6.17) and Assumption 1, we have f Z |Y ∼ N (μ f Z |Y ,  f Z |Y ) ,

and

where

μ f Z |Y =m Z +k Z Z Q −1 k Z X (+σ 2 I N )−1 (Y − m X ) ,

(6.18)

 f Z |Y = k Z Z Q −1 k Z Z ,

(6.19)

Q := k Z Z + k Z X ( + σ 2 I N )−1 k X Z ∈ R M×M

(6.20)

with λ(x) := k(x, x) − k x Z k −1 Z Z kZx . Proof: From (6.16) and Assumption 1, f X | f Z ∼ N (m X + k X Z k −1 Z Z ( f Z − m Z ), ) for f X := [ f (x1 ), · · · , f (x N )]. Moreover, the expectations of Y and f X are equal, and only the variance σ 2 is different, so we have 2 Y | f Z ∼ N (m X + k X Z k −1 Z Z ( f Z − m Z ),  + σ I N ) .

Thus, the simultaneous distribution of f Z , Y is (the exponents of p( f Z ) and p(Y | f Z )) 1 = − ( f Z − m Z ) k −1 Z Z ( fZ − mZ ) 2 1  2 −1 − {Y − (m X + k Z X k −1 Z Z ( f Z − m Z ))} ( + σ I N ) 2

182

6 Gaussian Processes and Functional Data Analyses

·{Y − (m X + k Z X k −1 Z Z ( f Z − m Z ))} .

(6.21)

If we differentiate (6.21) by f Z , setting a = f Z − m Z and b = Y − m X yields −1 −1 2 −1 −k −1 Z Z a + k Z Z k Z X ( + σ I N ) (b − k Z X k Z Z a) −1 −1 2 −1 2 = k −1 Z Z k Z X ( + σ I N ) b − k Z Z {k Z Z + k Z X ( + σ I N ))k X Z }k Z Z a −1 −1 2 −1 = k −1 Z Z k Z X ( + σ I N ) b − k Z Z Qk Z Z a

−1 −1 2 −1 = k −1 Z Z Qk Z Z {k Z Z Q k Z X ( + σ I N ) b − a} = − −1 f Z |Y ( f Z − μ f Z |Y ) .

(6.22)

Therefore, the terms w.r.t. f Z in (6.21) are only 1 − ( f Z − μ f Z |Y )  −1 f Z |Y ( f Z − μ f Z |Y ), 2

(6.23) 

and we obtain the proposition.

Proposition 64 Under the generation process outlined in (6.15), (6.16), (6.17) and Assumption 1, we have Y ∼ N (μY , Y ) with

and

μY := m X

(6.24)

Y :=  + σ 2 I N + k X Z k −1 Z Z kX Z .

(6.25)

Proof: Since the expectation μY of Y is m X , we obtain the covariance matrix Y . Let a := f Z − m Z and b := Y − m X . Then, the exponents (6.21) and (6.23) of p(Y, f Z ) and p( f Z |Y ) are, respectively, 1 1 −1 −1  2 −1 − a  k −1 Z Z a − (b − k Z X k Z Z a) ( + σ I ) (b − k Z X k Z Z a) 2 2

(6.26)

and −

1 (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) (k Z Z Q −1 k Z Z )−1 (a − k Z Z Q −1 k Z X ( + σ 2 I N )−1 b) . 2

(6.27)

From (6.20), we have 1 1 1  −1 −1 −1 −1  2 −1 − a  k −1 Z Z a − (k Z X k Z Z a) ( + σ I ) k Z X k Z Z a = − a k Z Z Qk Z Z a . 2 2 2 From p(Y, f 2 ) = p( f 2 |Y ) p(Y ), the difference between (6.26) and (6.27) is the exponent of p(Y ), which is

6.3 Gaussian Processes with Inducing Variables

183

1 1 − b ( + σ 2 I )−1 b + b ( + σ 2 I N )−1 k X Z Q −1 k Z X ( + σ 2 I N )−1 b , 2 2 where we may set a = 0 because no terms will remain w.r.t. a. Furthermore, if we set A =  + σ 2 I N , U = k X Z , V = k Z X , and W = k −1 Z Z in the Woodbury-ShermanMorrison formula (6.9), then we have 1 −1 − b ( + σ 2 I N + k X Z k −1 Z Z kX Z ) b 2 

and obtain (6.25).

Proposition 65 Under the generation process outlined in (6.15), (6.16), (6.17) and Assumption 1, for each x ∈ E, we have f (x)|Y ∼ N (μ(x), σ 2 (x)) −1 2 −1 μ(x):=m(x)+k x Z k −1 Z Z (μ f Z |Y − m Z )=m(x)+k x Z Q k Z X (+σ I N ) (Y − m X ) (6.28) σ 2 (x) := k(x, x) − k x Z (K Z−1Z − Q −1 )k Z x .

Proof: First, we note that Y → f Z → f (x) forms a Markov chain in this order. In the following, we consider the distribution of f (x)|Y instead of f (x)| f Z , i.e., the distribution of f (x)| f Z and f Z |Y . In (6.16), the term with a mean of k x Z k −1 Z Z ( fZ − mZ ) becomes k x Z k −1 Z Z (μ f Z |Y − m Z ) when averaged over f Z |Y . Thus, we obtain (6.28). Moreover, if we take the variance of that term with respect to f Z |Y , we obtain the same value as the variance of k x Z k −1 Z Z ( f Z − μ f Z |Y ), so we have −1 −1  −1 −1 E[k x Z k −1 Z Z ( f Z − μ f Z |Y )( f Z − μ f Z |Y ) k Z Z k Z x ] = k x Z k Z Z  f Z |Y k Z Z k Z x = k x Z Q k Z x ,

(6.29) where f Z varies with the given Y . Furthermore, from (6.16), since the variance λ(x) = k(x, x) − k x Z k −1 Z Z k Z x of f (x)| f Z is independent of f Z , we can write the variance of f (x)|Y as the sum of the variance λ(x) of f (x)| f Z and (6.29). In other  words, we have σ 2 (x) = λ(x) + k x Z Q −1 k Z x . In a case when the inducing variable method is employed, the calculations of k Z Z , k x Z take O(M 2 ) and O(M), respectively, the calculation of  takes O(N ), and the calculations of Q and Q −1 take O(N M 2 ) and O(M 3 ), respectively. The multiplication process is also completed in O(N M 2 ). On the other hand, without the inducing variable method, it takes O(N 3 ) of computational time. In the inducing variable method, we do not use the matrix K X X ∈ R N ×N . We can randomly select the inducing points z 1 , · · · , z M from x1 , · · · , x N or via K-means clustering. Example 88 Based on the above discussion, we constructed the function gp_ind by using the inducing variable method and compared its performance with that of gp_1, which does not use the inducing variable method.

184

6 Gaussian Processes and Functional Data Analyses

sigma_2 = 0 . 0 5 # s h o u l d be e s t i m a t e d def k ( x , y ) : # Covariance f u n c t i o n r e t u r n np . exp ( − ( x − y ) ∗∗2 / 2 / s i g m a _ 2 ) d e f mu ( x ) : # Mean f u n c t i o n return x # Data G e n e r a t i o n n = 200 x = np . random . u n i f o r m ( s i z e = n ) ∗ 6 − 3 y = np . s i n ( x / 2 ) + np . random . r a n d n ( n ) e p s = 10∗∗( − 6)

m = 100 K = np . z e r o s ( ( n , n ) ) f o r i i n range ( n ) : f o r j i n range ( n ) : K[ i , j ] = k ( x [ i ] , x [ j ] ) i n d e x = np . random . c h o i c e ( n , s i z e = m, r e p l a c e = F a l s e ) z = x [ index ] m_x = 0 m_z = 0 K_zz = np . z e r o s ( ( m, m) ) f o r i i n range (m) : f o r j i n range (m) : K_zz [ i , j ] = k ( z [ i ] , z [ j ] ) K_xz = np . z e r o s ( ( n , m) ) f o r i i n range ( n ) : f o r j i n range (m) : K_xz [ i , j ] = k ( x [ i ] , z [ j ] ) K _ z z _ i n v = np . l i n a l g . i n v ( K_zz + np . d i a g ( [ 1 0 ∗ ∗ e p s ] ∗m) ) lam = np . z e r o s ( n ) f o r i i n range ( n ) : lam [ i ] = k ( x [ i ] , x [ i ] ) − np . d o t ( np . d o t ( K_xz [ i , 0 :m] , K _ z z _ i n v ) , K_xz [ i , 0 : m] ) l a m _ 0 _ i n v = np . d i a g ( 1 / ( lam+ s i g m a _ 2 ) ) Q = K_zz + np . d o t ( np . d o t ( K_xz . T , l a m _ 0 _ i n v ) , K_xz ) ## Computation o f Q d o e s n o t r e q u i r e O( n ^ 3 ) Q_inv = np . l i n a l g . i n v (Q + np . d i a g ( [ e p s ] ∗ m) ) muu = np . d o t ( np . d o t ( np . d o t ( Q_inv , K_xz . T ) , l a m _ 0 _ i n v ) , y−m_x ) d i f = K _ z z _ i n v − Q_inv R = np . l i n a l g . i n v (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) ## O( n ^ 3 ) o m p u t a t i o n is required

def gp_ind ( x_pred ) : ## I n d u c i n g V a r i a b l e Method h = np . z e r o s (m) f o r i i n range (m) : h [ i ] = k ( x_pred , z [ i ] ) mm = mu ( x _ p r e d ) + h . d o t ( muu ) s s = k ( x_pred , x_pred ) − h . dot ( d i f ) . dot ( h ) r e t u r n {"mm" : mm, "ss" : s s } d e f gp_1 ( x _ p r e d ) : # # W/ O I n d u c i n g V a r i a b l e Method h = np . z e r o s ( n ) f o r i i n range ( n ) : h [ i ] = k ( x_pred , x [ i ] ) mm = mu ( x _ p r e d ) + np . d o t ( np . d o t ( h . T , R ) , y−mu ( x ) ) s s = k ( x _ p r e d , x _ p r e d ) − np . d o t ( np . d o t ( h . T , R) , h ) r e t u r n {"mm" : mm, "ss" : s s } x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 ) mmv = [ ] ; s s v = [ ]

6.3 Gaussian Processes with Inducing Variables

185

for u in x_seq : mmv . a p p e n d ( g p _ i n d ( u ) [ "mm" ] ) s s v . a p p e n d ( g p _ i n d ( u ) [ "ss" ] ) plt . figure () p l t . p l o t ( x_seq , mmv, c = "r" ) p l t . p l o t ( x_seq , np . a r r a y (mmv) + 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) , c = "r" , l i n e s t y l e = "−−" ) p l t . p l o t ( x_seq , np . a r r a y (mmv) − 3 ∗ np . s q r t ( np . a r r a y ( s s v ) ) , c = "r" , l i n e s t y l e = "−−" ) p l t . x l i m ( −2 , 2 ) p l t . p l o t ( np . min (mmv) , np . max (mmv) ) x _ s e q = np . a r a n g e ( − 2 , 2 . 1 , 0 . 1 ) mmv = [ ] ; s s v = [ ] for u in x_seq : mmv . a p p e n d ( gp_1 ( u ) [ "mm" ] ) s s v . a p p e n d ( gp_1 ( u ) [ "ss" ] ) mmv = np . a r r a y (mmv) s s v = np . a r r a y ( s s v ) plt plt plt plt

. . . .

p l o t ( x_seq p l o t ( x_seq p l o t ( x_seq scatter (x ,

, mmv, c = "b" ) , mmv + 3 ∗ np . s q r t ( s s v ) , c = "b" , l i n e s t y l e = "−−" ) , mmv − 3 ∗ np . s q r t ( s s v ) , c = "b" , l i n e s t y l e = "−−" ) y , f a c e c o l o r s =’none’ , e d g e c o l o r s = "k" , m a r k e r = "o" )

6.4 Karhunen-Lóeve Expansion In this section, we continue to study the probability space (, F, P) and the map f :  × E  (ω, x) → f (ω, x) ∈ H . We assume that H is a general separable Hilbert space. In the following, we continue to denote f (ω, x) by f (x) as a random variable for each x ∈ E. In particular, we assume that f is a mean-square continuous process (Fig. 6.2), which is defined by lim E| f (ω, xn ) − f (ω, x)|2 = 0

n→∞

(6.30)

for an arbitrary {xn } in E that converges to x ∈ E. We do not assume a Gaussian process, and we give the expectation at x ∈ E and the covariance at x, y ∈ E by m(x) = E f (ω, x) and k(x, y) = Cov( f (ω, x), f (ω, y)) . In Chap. 5, we obtained the expectation and covariance of k(X, ·); in this section, however, x, y ∈ E are not random, and the randomness of m, k is due to that of f (ω, ·). In the following, we assume that E is compact.

186

6 Gaussian Processes and Functional Data Analyses

-3

-2

-1

0

0

1

2

3

Fig. 6.2 The red and blue curves show the results obtained by the inducing variable and standard Gaussian processes, respectively

-2

-1

0

1

2

x

Proposition 66 f is a mean-square continuous process if and only if m and k are continuous. Proof: See the Appendix at the end of this chapter. In the following, we assume that m ≡ 0 to simplify the discussion. Since E is compact, we assume that the diameter of each E i is less than or equal to 1/n. However, each E i is a metric space, and we define the diameter by the maximum M(n) E i = E, distance among the elements in E i . Thus, there exists a partition of E (∪i=1 E i ∩ E j = φ, i = j) and a number of partitions M(n). Then, we define I f (g; {(E i , xi )}1≤i≤M(n) ) :=

M(n)

 f (ω, xi )

g(y)dμ(y) Ei

i=1

for a pair of interior points {(E i , xi )}1≤i≤M(n) and g ∈ L 2 (E, B(E), μ). Hence, we have  

{I f (g; {(E i , xi )}1≤i≤M(n) )}2 d P(ω) ≤ M(n)

d P = M(n)

M(n) i=1

M(n)  i=1

 k(xi , xi )

 

{ f (ω, xi )}2 {

g(u)dμ(u)}2 Ei

{g(u)}2 dμ(u) < ∞ Ei

and I f (g; {(E i , xi )}1≤i≤M(n) ) ∈ L 2 (, F, P). Although this value is different depending on the choices of the region decomposition and the points inside the regions, the difference in I f converges to zero as n goes to infinity. In fact, we have

6.4 Karhunen-Lóeve Expansion

187

E |I f (g; {(E i , xi )}1≤i≤M(n) ) − I f (g; {(E j , x j )}1≤ j≤M(n  ) )|2   M(n) M(n) k(xi , xi  ) g(u)dμ(u) g(v)dμ(v) = Ei

i=1 i  =1 

+

Ei 





M(n ) M(n )



k(xi , x j  )

g(v)dμ(v) .

g(u)dμ(u) Ei

j=1

g(v)dμ(v) E j



 M(n) M(n )

i=1

g(u)dμ(u) Ej

j=1 j  =1

−2



k(x j , x j  )

E j

Since k is uniformly continuous, each double sum on the right-hand side converges to   k(u, v)g(u)g(v)dμ(u)dμ(v) . E

E

Since the Cauchy sequence converges to zero, its convergence destination I f (ω, g) is contained in L 2 (ω, F, P) regardless of the choice of {(E i , xi )}1≤i≤M(n) . If the eigenvalues and eigenfunctions obtained from the integral operator Tk ∈ B(L 2 (E, B(E), μ)),  Tk g(·) = k(y, ·)g(y)dμ(y) , g ∈ L 2 (E, B(E), μ), E ∞ are {λ j }∞ j=1 and {e j (·)} j=1 , by Mercer’s theorem, we can express the covariance function k as ∞ λ j e j (x)e j (y) , (6.31) k(x, y) = j=1

where the sum absolutely and uniformly converges on that support. Then, we have the following claim. Proposition 67 If { f (ω, x)}x∈E is a mean-square continuous process with a mean of zero, we have 1. E[I f (ω, g)] = 0.  2. E[I f (ω, g) f (ω, x)] = E k(x,  y)g(y)dμ(y), x ∈ E. 3. E[I f (ω, g)I f (ω, h)] = E E k(x, y)g(x)h(y)dμ(x)dμ(y). These properties hold for each g, h ∈ L 2 (E, F, μ), and in particular, we have that E[I f (ω, ei )I f (ω, e j )] = δi, j λi .

(6.32)

Proof: For the proofs of the above three items, see the Appendix at the end of this chapter. We obtain (6.32) by substituting Mercer’s theorem (6.31), g = ei , and h = e j into the third item:

188

6 Gaussian Processes and Functional Data Analyses

E[I f (ω, ei )I f (ω, e j )] =

  ∞ E

λr er (x)er (y)ei (x)e j (y)dμ(x)dμ(y).

E r =1

 Furthermore, we have the following theorem. Proposition 68 (Karhunen-Lóeve [17, 18])Suppose that { f (ω, x)}x∈E is a meansquare continuous process with a mean of zero. Then, we have lim sup E| f (ω, x) − f n (ω, x)|2 = 0

n→∞ x∈E

for f n (ω, x) :=

n j=1

I f (ω, e j )e j (x).

Proof: From (6.32), we have E[ f n (ω, x)2 ] = E[{

n

I f (ω, e j )e j (x)}2 ]

j=1

=

n n

E[I f (ω, ei )I f (ω, e j )]ei (x)e j (x) =

n

i=1 j=1

λ j e2j (x) .

j=1

Moreover, from (6.31) and the second item of Proposition 67, we have  n n E[ f n (ω, x) f (ω, x)] = E[ I f (ω, e j )e j (x) f (ω, x)] = e j (x) k(x, y)e j (y)dμ(y) =

n j=1

j=1

 λ j e2j (x)

E

e2j (y)dμ(y) =

j=1 n

E

λ j e2j (x) ,

j=1

which means that E| f n (ω, x) − f (ω, x)|2 = E[ f n (ω, x)2 ] − 2E[ f n (ω, x) f (ω, x)] + E[ f (ω, x)2 ] n n n λ j e2j (x) − 2 λ j e2j (x) + k(x, x) = k(x, x) − λ j e2j (x) . = j=1

j=1

j=1

 In a general mean-square continuous process (without assuming a Gaussian process), the series expansion provided by Karhunen-Lóeve’s theorem makes I f (ω, e j )/ λ j a random variable with a mean of 0 and a variance of 1. Instead, if we assume a Gaussian process such that f (x) (x ∈ E) follows a Gaussian distribution, then we can write n z j λ j e j (x) , (6.33) f n (x) = j=1

6.4 Karhunen-Lóeve Expansion

189

where z j follows an independent standard Gaussian distribution. Let E := [0, 1] and (, F, P) be a probability space. Then, we call the map  × E  (ω, x) → f (ω, x) ∈ R that satisfies the following conditions a Brownian motion. 1. f (ω, 0) = 0, f (ω, x) − f (ω, y) ∼ N (0, y − x), and 0 ≤ x < y. 2. f (ω, x2 ) − f (ω, x1 ), . . . , f (ω, xn−1 ) − f (ω, xn ) are independent for any n = 1, 2, . . . and 0 ≤ x1 < x2 < . . . < xn . 3. There exists an  ∈ F with a probability of 1, and E  x → f (ω, x) is continuous for each ω ∈ . In this case, we have the following proposition. Proposition 69 The map  × E  (ω, x) → f (ω, x) ∈ R is a Brownian motion if and only if the following three conditions are satisfied simultaneously. 1. It is a Gaussian process. 2. The covariance function of x, y ∈ E is given by k(x, y) = min(x, y). 3. f (ω, ·) is continuous with a probability of 1. Proof: The first two conditions in the definition imply the first condition in Proposition 69. Moreover, if x < y, then we have E[ f (ω, x) f (ω, y)] = E[ f (ω, x)2 ] + E[ f (ω, x){ f (ω, y) − f (ω, x)}] = x , which implies the second condition of Proposition 69. On the contrary, supposing that m ≡ 0 for simplicity, if we assume that the first two items of Proposition 69 hold, then because k(x, x) = x, when x ≤ y ≤ z, we have E[ f (ω, x){ f (ω, y) − f (ω, z)}] = k(x, y) − k(x, z) = x − x = 0 and E[ f (ω, y){ f (ω, y) − f (ω, z)}] = k(y, y) − k(y, z) = y − y = 0 , which implies that E[{ f (ω, x) − f (ω, y)}{ f (ω, y) − f (ω, z)}] = 0 . Moreover, from k(0, 0) = 0, the variance of f (ω, 0) is zero, so we have f (ω, 0) = 0. Furthermore, we have E[{ f (ω, x) − f (ω, y)}2 ] = k(x, x) − 2k(x, y) + k(y, y) = x − 2x + y = y − x .  Example 89 (Brownian Motion as a Gaussian Process) For the integral operator (Example 58) on the covariance function of a Brownian motion k(x, y) = min(x, y) (x, y ∈ E), its eigenvalues and eigenfunctions are (3.13) and (3.14), respectively.

190

6 Gaussian Processes and Functional Data Analyses

Utilizing these eigenvalues and eigenfunctions, we can expand f (ω, ·) as follows. In particular, from (6.33), we have f n (x) =

n

z j (ω) λ j e j (x)

j=1 n m times (Fig. 6.3; n = 10; for z j ∼ N (0, 1). We generate the series {z i (ω)}i=1 m = 7). We execute it with the following code. d e f lam ( j ) : return 4 / ( ( 2 ∗ def ee ( j , x ) : r e t u r n np . s q r t ( 2 )

## EigenValue j − 1 ) ∗ np . p i ) ∗∗2 ## D e f i n i t i o n o f E i g e n f u n c t i o n ∗ np . s i n ( ( 2 ∗ j − 1 ) ∗ np . p i / 2 ∗ x )

n = 10; m = 7 ## D e f i n i t i o n o f Gaussian Process def f ( z , x ) : n = len ( z ) S = 0 f o r i i n range ( n ) : S = S + z [ i ] ∗ e e ( i , x ) ∗ np . s q r t ( lam ( i ) ) return S plt . figure () p l t . xlim (0 , 1) p l t . x l a b e l ( "x" ) p l t . y l a b e l ( "f(omega,x)" ) c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , m) ] f o r j i n range (m) : z = np . random . r a n d n ( n ) x _ s e q = np . a r a n g e ( 0 , 3 . 0 0 1 , 0 . 0 0 1 ) y_seq = [ ] for x in x_seq : y_seq . append ( f ( z , x ) ) p l t . p l o t ( x_seq , y_seq , c = c o l o r s [ j ] ) p l t . t i t l e ( "BrownMotion" )

We introduce the Matérn class, which is a class of kernels used in stochastic processes rather than the RKHS of machine learning. Such a kernel is k(x, y) = ϕ(z) (z := x − y in x.y ∈ E), where ϕ(z) is ϕ(z) :=

√ 2νz 21−ν  √  Kν ( ), 2νz l (ν) l

(6.34)

ν, l > 0 are the parameters of the kernel, and K ν is a variant Bessel function of the second kind. π I−α (x) − Iα (x) K ν (x) := 2 sin(αx)

6.4 Karhunen-Lóeve Expansion

191

0 -2

-1

f (ω, x)

1

2

Brownian Motion

0.0

0.2

0.4

0.6

0.8

1.0

x Fig. 6.3 We generated the sample paths of Brownian motions seven times. Each run involved a sum of up to 10 terms

Iα (x) :=

 x 2m+α 1 . m!(m + α + 1) 2 m=0 ∞

In practice, we use (6.34), in which p is a positive integer and ν = p + 1/2. In the 1-dimensional case, we have  √  √  p−i p 2νz ( p + 1) ( p + i)! 8νz ϕν (z) = exp − . l (2 p + 1) i=0 i!( p − i)! l

(6.35)

For example, we express ν = 5/2, 3/2, 1/2 as follows. In particular, we call the stochastic process with ν = 1/2 the Ornstein-Uhlenbeck process.   √ √ 5z 2 5z 5z + 2 exp(− ) ϕ5/2 (z) = 1 + l 3l l  ϕ3/2 (z) = 1 +



3z l



√ exp(−

3z ) l

ϕ1/2 (z) = exp(−z/l). For example, if we write this process in Python, we have the following code.

192

6 Gaussian Processes and Functional Data Analyses

Matern Kernel (l = 0.02) 10

= 1.5 = 2.5 = 3.5 = 4.5 = 5.5 = 6.5 = 7.5 = 8.5 = 9.5 = 10.5

ν ν ν ν ν ν ν ν ν ν

= 1.5 = 2.5 = 3.5 = 4.5 = 5.5 = 6.5 = 7.5 = 8.5 = 9.5 = 10.5

0

0

Kernel Values ϕ(z) 2 4 6 8

ν ν ν ν ν ν ν ν ν ν

Kernel Values ϕ(z) 4 6 8 2

10

Matern Kernel (l=0.1)

0.0

0.1

0.2

0.3

0.4

z

0.5

0.0

0.1

0.2

0.3

0.4

0.5

z

Fig. 6.4 The values of the Matérn kernel for ν = 1/2, 3/2, . . . , m + 1/2 (Example 90). l = 0.1 (left) and l = 0.02 (right)

from s c i p y . s p e c i a l import gamma d e f m a t e r n ( nu , l , r ) : p = nu − 1 / 2 S = 0 f o r i i n range ( i n t ( p + 1 ) ) : S = S + gamma ( p + i + 1 ) / gamma ( i + 1 ) / gamma ( p − i + 1 ) \ ∗ ( np . s q r t ( 8 ∗ nu ) ∗ r / l ) ∗ ∗ ( p − i ) S = S ∗ gamma ( p + 2 ) / gamma ( 2 ∗ p + 1 ) ∗ np . exp ( − np . s q r t ( 2 ∗ nu ) ∗ r / l) return S

Example 90 We present the Matérn kernel values for l = 0.1, 0.02 with ν = 1/2, 3/2, . . . , m + 1/2 (Fig. 6.4). m = 10 l = 0.1 c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d c o l o r = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 1 , l e n ( range (m) ) ) ] x = np . l i n s p a c e ( 0 , 0 . 5 , 2 0 0 ) p l t . p l o t ( x , m a t e r n ( 1 − 1 / 2 , l , x ) , c = c o l o r [ 0 ] , l a b e l = r"$\nu=%d$"%1) p l t . ylim (0 , 10) f o r i i n range ( 2 , m + 1 ) : p l t . p l o t ( x , m a t e r n ( i − 1 / 2 , l , x ) , c = c o l o r [ i − 1 ] , l a b e l = r"$\nu=%d$" %i ) p l t . legend ( loc =

"upperright" ,

f r a m e o n = True , p r o p ={’size’ : 1 4 } )

In the case of the Matérn kernel and in general, we cannot analytically obtain the eigenvalues and eigenfunctions, as in the cases involving Gaussian kernels and Brownian motion. Even in those cases, if we assume a Gaussian process, then we can find x1 , . . . , xn ∈ E to obtain its Gram matrix, which will be a covariance matrix.

6.4 Karhunen-Lóeve Expansion

193

0 -3

-2

-1

y

1

2

3

OU Process (ν = 1/2, l = 0.1)

0.0

0.2

0.4

0.6

0.8

1.0

x

0 -3

-2

-1

y

1

2

3

Matern Process (ν = 3/2, l = 0.1)

0.0

0.2

0.4

0.6

0.8

1.0

x Fig. 6.5 The Orstein-Uhlenbeck process (ν = 1.2, top) and the Matérn process (ν = 3/2, top) for l = 0.1

Thus, it is sufficient to generate n-variate random numbers that follow a Gaussian distribution. The above method is approximate, but it is very versatile. Example 91 We display the Orstein-Uhlenbeck process (ν = 1.2, top) and the Matérn process (ν = 3/2, top) with n = 100 and l = 0.1 (Fig. 6.5). c o l o r m a p = p l t . cm . g i s t _ n c a r # n i p y _ s p e c t r a l , S e t 1 , P a i r e d c o l o r s = [ c o l o r m a p ( i ) f o r i i n np . l i n s p a c e ( 0 , 0 . 8 , 5 ) ] d e f r a n d _ 1 0 0 ( Sigma ) : L = np . l i n a l g . c h o l e s k y ( Sigma ) ## Cholesky d e c o m p o s i t i o n o f c o v a r i a n c e matrix

194

6 Gaussian Processes and Functional Data Analyses

u = np . random . r a n d n ( 1 0 0 ) y = L . dot ( u ) # # G e n e r a t e random numbers w i t h z e r o −mean and t h e covariance matrix return y x = np . l i n s p a c e ( 0 , 1 , 1 0 0 ) z = np . abs ( np . s u b t r a c t . o u t e r ( x , x ) ) # c o m p u t e d i s t a n c e m a t r i x , d_ { i j } = | x _ i − x_j | l = 0.1 Sigma_OU = np . exp ( − z / l ) # # OU: matern ( 1 / 2 , l , z ) i s slow y = r a n d _ 1 0 0 ( Sigma_OU ) plt plt plt for

. figure () . plot (x , y) . ylim ( −3 ,3) i i n range ( 5 ) : y = r a n d _ 1 0 0 ( Sigma_OU ) plt . plot (x , y , c = colors [ i ]) p l t . t i t l e ( "OUprocess(nu=1/2,l=0.1)" )

Sigma_M = m a t e r n ( 3 / 2 , l , z ) # # Matern y = r a n d _ 1 0 0 ( Sigma_M ) plt . figure () plt . plot (x , y) p l t . y l i m ( −3 , 3 ) f o r i i n range ( 5 ) : y = r a n d _ 1 0 0 ( Sigma_M ) plt . plot (x , y , c = colors [ i ]) p l t . t i t l e ( "Maternprocess(nu=3/2,l=0.1)" )

6.5 Functional Data Analysis Let (, F, P) and H be a probability space and a separable Hilbert space, respectively. Let F :  → H be a measurable map, i.e., {h ∈ H | g − h < r }) is an element of F for each open set (g ∈ H, r ∈ (0, ∞) in H . We call such an F :  → H a random element of H . Intuitively, a random element is a random variable that takes a value in H . Thus far, we have assumed that f :  × E → R is measurable at each x ∈ E (stochastic process). This section addresses situations in which we do not assume such measurability. For simplicity, we write F(ω) as F, similar to the elements of H . Although we do not go into details in this book, it is known that the following relationship holds between stochastic processes and random elements. It is only necessary to understand the close relationship between the two. Proposition 70 (Hsing-Eubank [14]) 1. If f :  × E → R is measurable w.r.t.  × E and f (ω, ·) ∈ H , for ω ∈ , then f (ω, ·) is a random element of H . 2. If f (·, x) → R is measurable for each x ∈ E and f (ω, ·) is continuous for each ω ∈ , then f (ω, ·) is a random element. 3. If f :  × E → R is a (zero-mean) mean-square continuous process and its covariance function is k, then  a random element of H exists such that the covariance operator is H  g → E k(·, y)g(y)dμ(y) ∈ H .

6.5 Functional Data Analysis

195

4. A random element in an RKHS H (k) with a measurable reproducing kernel k is a stochastic process, and a stochastic process that takes values in RKHS H (k) is a random element of H (k). For the proof, see Chap. 7 in [14]. In this section, we learn the properties of random elements and apply them to functional data analysis [24]. First, we consider the average E[F, g] of F, g for each g ∈ H under E[F] < ∞. Since g → E[F, g] is a linear functional, there exists a unique m ∈ H such that E[F, g] = m, g (6.36) from Proposition 22. We write this formally as m = E[F], which is the definition of the mean of a random element F. Proposition 71 If EF2 < ∞, then EF − m2 = EF2 − m2 holds. Proof: If we substitute g = m into (6.36), we obtain EF − m2 = EF2 − 2EF, m + m2 = EF2 − 2m, m + m2 .  Since EF2 < ∞ implies that EF < ∞, we proceed with our discussion by assuming the former case. Regarding covariance, if H = R p , then the covariance matrix is E[(F − E[F])(F − E[F]) ] = E[(F − E[F]) ⊗ (F − E[F])] ∈ R p× p for F ∈ R p . For the general Hilbert space H , the correspondence H 2  (g, h) → E[F − m, gF − m, h] ∈ R is linear for each of g and h. Moreover, if EF2 < ∞, then it is bounded from E[F − m, gF − m, h] ≤ EF − m2 · g h ≤ EF2 · g h . If we define u ⊗ v ∈ B(H ) by H  w → (u ⊗ v)w = u, wv ∈ H for u, v, w ∈ H , then a K ∈ B(H ) exists such that

196

6 Gaussian Processes and Functional Data Analyses

E[F − m, gF − m, h] = E[{(F − m) ⊗ (F − m)}g, h] = K g, h = g, K ∗ h .

If we exchange g, h, we obtain the same value, so K and K ∗ coincide, and each is self-conjugated. Such a K is called a covariance operator, and we formally write this as K = E[(F − m) ⊗ (F − m)]. Proposition 72 E[(F − m) ⊗ (F − m)] = E[F ⊗ F] − m ⊗ m Proof: From (6.36), for arbitrary g, h ∈ H , we have E[m, gF, h] = m, gE[F], h = m, gm, h = (m ⊗ m)g, h . From F − m, gF − m, h = F, gF, h − F, gm, h − F, gm, h + m, gm, h ,

we have E[{(F − m) ⊗ (F − m)}g, h] = E[{F ⊗ F − m ⊗ m}g, h] .  In the following, for simplicity, we proceed with our discussion by assuming that m = 0. Proposition 73 If m = 0 and EF2 < ∞, then 1. The covariance operator K is nonnegative definite and is a trace class operator whose trace is K T R = EF2 . 2. With probability 1, F ∈ Im(K ) holds. Proof: For g ∈ H ,

K g, g = E[F, gF, g] ≥ 0

holds, which means that K is nonnegative definite. Moreover, if {e j } is an orthonormal basis, we have K T R =

∞ ∞ K e j , e j  = E[F ⊗ F]e j , e j  = EF2 < ∞ . j=1

j=1

For the second item, note that in general, we have (Im(K ))⊥ = Ker(K ) .

(6.37)

6.5 Functional Data Analysis

197

In fact, if we set g ∈ Ker(K ), then K is self-adjoint (K = K ∗ ) and g, K h = K g, h = 0 , h ∈ H . Therefore, g is orthogonal to any element of I m(K ), and we require g ∈ Im(K )⊥ . Conversely, if g ∈ Im(K )⊥ , then K K g ∈ Im(K ) and K g = g, K K g = 0, i.e., we have g ∈ Ker(K ). This means that for g ∈ Im(K )⊥ , we have E[F, g2 ] = K g, g = 0, and F is orthogonal to any g ∈ Im(K )⊥ with a probability of 1. Therefore, from Proposition 20, with a probability of 1, we have F ∈ (Im(K ))⊥⊥ = Im(K ) .  Additionally, from Propositions 27 and 31 and the first item of Proposition 73, the following holds. Proposition 74 The eigenvalue function {e j } of the covariance operator K is an orthonormal basis of Im(K ); the corresponding eigenvalues {λ j }∞ j=1 are nonnegative, monotonically decrease, and converge to 0. Furthermore, the multitude of each of the nonzero eigenvalues is finite. Additionally, from Propositions 73 and 74, the following holds. Proposition 75 If { f j } is an orthonormal basis of H , then we have EF −

n n F, f j  f j 2 = EF2 − K f j , f j  , j=1

(6.38)

j=1

which is minimized when f j = e j (1 ≤ j ≤ n). Proof: The following two equations imply (6.38). ⎡ ⎤ n n n EF − F, f j  f j 2 =EF2 +E F, f j  f j 2 − 2E ⎣F, F, f j  f j ⎦ j=1

E

n j=1

j=1

j=1



⎤ n n n

F, f j  f j 2 = E ⎣F, F, f j  f j ⎦ = E F, f j 2 = K f j , f j  . j=1

j=1

j=1

Then, from EF2 = K T R = ∞ j=1 λ j (Proposition 73) and Proposition 28, we obtain Proposition 75.  For example, from the independent realizations F1 , . . . , FN of the random element F, via N 1 Fi (6.39) mN = N i=1

198

6 Gaussian Processes and Functional Data Analyses

KN =

N 1 (Fi − m N ) ⊗ (Fi − m N ) , N i=1

(6.40)

we can estimate the mean m and covariance operator1 K . In the following, we examine how to perform principal component analysis (PCA) based on functional data analysis [24]. Then, to obtain the eigenfunctions and eigenvalues, for x1 , . . . , xn ∈ E, 1 ≤ n ≤ N , Fi : E → R, we apply the ordinary (nonfunctional) PCA approach to X = (Fi (xk )) (i = 1, . . . , N and k = 1, . . . , n). 1. Prepare the basis functionη = [η1 , . . . , ηm ] : E → Rm .  2. Calculate W = (wi, j ) = E η(x)η(x) d x such that W = (wi, j ) = E ηi (x)η j (x)d x.

3. Find C = (ci, j )i=1,...,N , j=1,...,m ∈ R N ×m such that Fi (x) = mj=1 ci η j (x). . . . , dm of the estimated mean function m N (x) := 4. Find the coefficients d1 ,

m 1 N  F (x) (m (x) = i N i=1 j=1 d j η j (x)). N 5. Since the variance function is k(x, y) =

N 1 1 {Fi (x) − m N (x)}{Fi (y) − m N (y)} = η(x)T (C − d) (C − d)η(y) , N N i=1

if we set the eigenvectors as φ(x) = b η(x) (b ∈ Rm ), then the eigenvalue problem for the covariance operator  k(x, y)φ(y)dy = λφ(x) E

under b W b = 1 reduces to the problem of finding a b such that η(x)

1 (C − d) (C − d)η(x)η(x) b = λη(x) b , N

which is equivalent to 1 (C − d) (C − d)W b = λb . N In particular, if we set u := W 1/2 b, it becomes the problem of finding a u ∈ Rm such that 1 1/2 W (C − d) (C − d)W 1/2 u = λu N under u = 1. 1

The denominator of K N may be N − 1.

6.5 Functional Data Analysis

199

Example 92 For E = [−π, π ], if we set ⎧ 1 j =1 ⎪ ⎨ √2π , 1 √ cos kx, j = 2k , η j (x) = π ⎪ ⎩ √1 sin kx, j = 2k + 1 π we have



π

−π

ηi (x)η j (x)d x = δi, j ,

and W is the unit matrix of size p. Therefore, the eigenequation becomes n1 (C − d) (C − d)u = λu, and we can apply C ∈ Rn× p instead of the design matrix to the PCA procedure (even if we set d = 0 in the above procedure, the centering step will be completed automatically). In this example, we apply Canadian weather data from the fda package containing a daily list of, the temperature and precipitation for each day of the year in each Canadian city. We construct the following programs in various ways. We do not give the n functions from the beginning but from N = 365 days, as this represents the change in temperature by a linear sum of p bases (Fourie transformation). Therefore, we can say that the function is discretized using a sufficiently large p. X, y = df = X. def g ( j if

s k f d a . d a t a s e t s . f e t c h _ w e a t h e r ( r e t u r n _ X _ y = True , a s _ f r a m e = T r u e ) iloc [: , 0]. values , x) : ## B a s i s c o n s i s t i n g o f p e l e m e n t s j == 0 : r e t u r n 1 / np . s q r t ( 2 ∗ np . p i ) i f j % 1 == 0 : r e t u r n np . c o s ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i ) else : r e t u r n np . s i n ( ( j / / 2 ) ∗ x ) / np . s q r t ( np . p i )

def beta ( x , y ) : ## C o e f f i c i e n t s i n f r o n t o f t h e p e l e m e n t s X = np . z e r o s ( ( N, p ) ) f o r i i n range (N) : f o r j i n range ( p ) : X[ i , j ] = g ( j , x [ i ] ) b e t a = np . d o t ( np . d o t ( np . l i n a l g . i n v ( np . d o t (X . T , X) + 0 . 0 0 0 1 ∗ np . i d e n t i t y ( p ) ) , X . T ) , y ) r e t u r n np . s q u e e z e ( b e t a ) N = 365; n = 35; m = 5; p = 100; df = df . c o o r d i n a t e s [ 0 ] . data_matrix C = np . z e r o s ( ( n , p ) ) f o r i i n range ( n ) : x = np . a r a n g e ( 1 , N+ 1 ) ∗ ( 2 ∗ np . p i / N) − np . p i y = df [ i ] C[ i , : ] = b e t a ( x , y ) p c a = PCA ( ) pca . f i t (C) B = pca . components_ . T xx = C . d o t (B )

Each line of C ∈ Rn× p is the coefficient ( p) of a function. Then, B ∈ R p×m (m ≤ p) is the principal component vector, and x x is the score of each function. The mth column vector of B is the vector of m principal components (the coefficients

200

6 Gaussian Processes and Functional Data Analyses

20 10 0

Original m=2 m=3

-10

Temperature (Celsius)

Reconstructions for m = 2, 3, 4, 5, 6

-3

-2

-1

0

m=4 m=5 m=6 1

2

3

Dates (trandformed from Jan. 1 through Dec. 31 to −π through π)

Fig. 6.6 We present the output of approximating Toronto’s annual temperature by using m = 2, 3, 4, 5, 6 principal components. As m increases, the data are faithfully recovered from the original data

in front of η j (x) fpr j = 1, . . . , p). We change m and the function z and run the following program to see if we can recover the original function. d e f z ( i , m, x ) : # # The a p p r o x i m a t e d f u n c t i o n u s i n g m c o m p o n e n t s r a t h e r than m S = 0 f o r j i n range ( p ) : f o r k i n range (m) : f o r r i n range ( p ) : S = S + C[ i , j ] ∗ B[ j , k ] ∗ B[ r , k ] ∗ g ( r , x ) return S x _ s e q = np . a r a n g e ( − np . p i , np . p i , 2 ∗ np . p i / 1 0 0 ) plt . figure () p l t . x l i m ( − np . p i , np . p i ) # p l t . y l i m ( − 15 , 2 5 ) p l t . x l a b e l ( "Days" ) p l t . y l a b e l ( "Temp(C)" ) p l t . t i t l e ( "Reconstructionforeachm" ) p l t . p l o t ( x , d f [ 1 3 ] , l a b e l = "Original" ) f o r m i n range ( 2 , 7 ) : p l t . p l o t ( x_seq , z ( 1 3 , m, x _ s e q ) , c = c o l o r [m] , l a b e l = p l t . l e g e n d ( l o c = "lowercenter" , n c o l = 2 )

"m=%d"%m)

Figure 6.6 shows the output of approximating the annual temperature in Toronto by using m = 2, 3, 4, 5, 6 principal components. Next, we list the principal components in order of increasing eigenvalue and draw a graph of their contribution ratio (Fig. 6.7).

6.5 Functional Data Analysis

201

lam = p c a . e x p l a i n e d _ v a r i a n c e _ r a t i o = lam / sum ( lam ) # Or u s e pca . e x p l a i n e d _ v a r i a n c e _ r a t i o _ p l t . p l o t ( range ( 1 , 6 ) , r a t i o [ : 5 ] ) p l t . x l a b e l ( "PC1throughPC5" ) p l t . y l a b e l ( "Ratio" ) p l t . t i t l e ( "Ratio" )

The principal component function is a function with the principal component vector as the coefficients of the basis. It differs from the output of the scikit-fda in two ways. 1. Because the dates of the year are normalized from January 1-December 31 to √ [−π, π ], the value of the principal component√function is multiplied by 365/(2π ), and the score function is multiplied by 2π/365. 2. Some principal component vectors are multiplied by −1, resulting in an upsidedown function approximation (which is unavoidable if the packages are different). The first, second, and third principal component functions appear as shown in Fig. 6.8. We use the following program. The first principal component is the effect for the whole year, with winter temperatures influencing the variations between cities. def h ( coef , x ) : ## D e f i n e a f u n c t i o n u s i n g c o e f f i c i e n t s S = 0 f o r j i n range ( p ) : S = S + coef [ j ] ∗ g ( j , x ) return S p r i n t (B) plt . figure () p l t . x l i m ( − np . p i , np . p i ) p l t . y l i m ( −1 , 1 ) f o r j i n range ( 3 ) : p l t . p l o t ( x_seq , h ( B [ : , j ] , x _ s e q ) , c = c o l o r s [ j ] , l a b e l = p l t . l e g e n d ( l o c = "best" )

"PC%d"%( j + 1 ) )

[[-5.17047156e-01 -2.43880782e-01 7.48988279e-02 ... 5.48412387e-04 -1.22748578e-03 8.07866592e-01] [-7.31215100e-01 -3.44899509e-01 1.05922938e-01 ... 7.75572240e-04 -1.73592707e-03 -5.71437836e-01]

0.0 0.2 0.4 0.6 0.8

Contribution Ratio Contribution Ratio

Fig. 6.7 Contribution of temperature to Canadian weather. We can calculate the contribution rate as in the ordinary case where functional data analysis is not used

1

2

3

4

The first component through fifth

5

6 Gaussian Processes and Functional Data Analyses

Principal Component Functions

-0.5

0.0

0.5

1.0

First Second Third

-1.0

The values of principal component functions

202

-3

-2

-1

0

1

2

3

Date (transformed Jan. 1 throuugh Dec. 31 to −π through π) Fig. 6.8 The first, second, and third principal component functions for temperature in the Canadian weather data. Some of the principal component functions are multiplied by −1, which means that they are upside down compared to those of other packages. Additionally, because √ the horizontal axis is normalized by [−π, π ], the value of each eigenfunction is multiplied by 365/(2π )

[ 3.13430279e-01 -6.12932605e-01 1.50738649e-01 ... -6.99018707e-03 1.19586600e-03 -5.41866413e-02] ... [ 3.08129173e-05 -2.83373696e-03 8.28893867e-03 ... 1.35931675e-01 -2.34867483e-01 1.23616273e-03] [ 1.47021532e-03 3.25749669e-03 -4.83933350e-03 ... -1.32596334e-01 1.60270831e-01 8.94103792e-03] [ 1.47021531e-03 3.25749669e-03 -4.83933350e-03 ... -1.32596334e-01 1.60270831e-01 -1.17997796e-02]]

place = X. i l o c [ : , 1] index = [ 9 , 11 , 12 , 13 , 16 , 23 , 25 , 26] o t h e r s = [ x f o r x i n range ( 3 4 ) i f x n o t i n i n d e x ] f i r s t = [ place [ i ] [ 0 ] for i in index ] print ( f i r s t ) plt . figure () p l t . x l i m ( − 15 , 2 5 ) p l t . y l i m ( − 25 , − 5) p l t . x l a b e l ( "PC1" ) p l t . y l a b e l ( "PC2" ) p l t . t i t l e ( "CanadianWeather" ) p l t . s c a t t e r ( xx [ o t h e r s , 0 ] , xx [ o t h e r s , 1 ] , m a r k e r = f o r i i n range ( 8 ) : l = p l t . t e x t ( xx [ i n d e x [ i ] , 0 ] , xx [ i n d e x [ i ] , 1 ] , s = f i r s t [ i ] , c = color [ i ]) [ ’Q’ , ’M’ , ’O’ , ’T’ , ’W’ , ’C’ , ’V’ , ’V’ ]

"x" ,

c =

"k" )

Appendix

203

10

The Scores in Canadian Weather

5

W O M

Q C

V V Q Quebec W Winnipeg M Montreal C Calgary O Ottawa V Vancouver T Toronto V Victoria

-15

-10

-5

0

T Second

Fig. 6.9 Canadian weather temperature scores, with warmer regions such as Vancouver and Victoria appearing furthest to the left in the first principal component

-20

-10

0

10

20

30

First

Appendix Proof of Proposition 66 Proof: Since the expectation and variance of f (ω, x) − m(x) are 0 and k(x, x), respectively, and the covariance between f (ω, x) − m(x) and f (ω, y) − m(y) is k(x, y), from E[| f (ω, x) − f (ω, y)|2 ] = E[({ f (ω, x) − m(x)} − { f (ω, y) − m(y)} − {m(x) − m(y)})2 ] = k(x, x) + k(y, y) − 2k(x, y) + {m(x) − m(y)}2 . (6.41) The continuity of m, k implies (6.30). Conversely, if we assume (6.30), then the continuity of m is obtained from |m(x) − m(y)| = |E[ f (ω, x) − f (ω, y)]| ≤ {E[| f (ω, x) − f (ω, y)|2 ]}1/2 . Without loss of generality, if we assume that m ≡ 0, then we have k(x, y) − k(x  , y  ) = {k(x, y) − k(x  , y)} + {k(x  , y) − k(x  , y  )}, and each of the right-hand side terms are bounded by |k(x, y) − k(x  , y)| = |E[ f (ω, x) f (ω, y)] − E[ f (ω, x  ) f (ω, y)]| ≤ E[ f (ω, y)2 ]1/2 E[{ f (ω, x) − f (ω, x  )}2 ]1/2 = {k(y, y)}1/2 {E[| f (ω, x) − f (ω, x  )|2 ]}1/2

and

|k(x  , y) − k(x  , y  )| ≤ {k(x  , x  )}1/2 {E[| f (ω, y) − f (ω, y  )|2 ]}1/2 .

Thus, we have established the continuity of k.



204

6 Gaussian Processes and Functional Data Analyses

Proof of Proposition 67 (n) We define I (n) f (g) := I f (g; {(E i , x i )}1≤i≤M(n) ). Then, we have E[I f (g)] = 0. From E[ f (ω, x)] = 0, x ∈ E, and the convergence proven thus far, we obtain the first claim: (n) 2 1/2 →0 |E[I f (ω, g)]| = |E[I f (ω, g) − I (n) f (g)]| ≤ {E[{I f (ω, g) − I f (g)} ]}

as n → ∞. From the uniform continuity of k, we obtain the second claim:  |E[I f (ω, g) f (ω, x)] −

k(x, y)g(y)dμ(y)|

     (n) (n) ≤ E[{I f (ω, g) − I f (g)} f (ω, x)] +|E[I f (g) f (ω, x)− k(x, y)g(y)dμ(y)]| E

E

2 1/2 {E[ f (ω, x)2 ]}1/2 ≤ |{E[{I f (ω, g) − I (n) f (g)} ]} M(n)  +| |k(x, xi ) − k(x, y)|g(y)dμ(y)| → 0 Ei

i=1

as n → ∞. From (n) E[I (n) f (g)I f (h)] =

M(n) M(n)



i=1 j=1

  →



k(xi , x j )

g(x)dμ(x) Ei

h(y)dμ(y) Ej

k(x, y)g(x)h(y)dμ(x)dμ(y) , E

E

we obtain the third claim:   |E[I f (ω, g)I f (ω, h)] −

k(x, y)g(x)h(y)dμ(x)dμ(y)| E

E

(n) (n) (n) ≤ |E[{I f (ω, g) − I (n) f (g)}{I f (ω, h) − I f (h)} + {I f (ω, g) − I f (g)}I f (h) (n) +{I f (ω, h) − I (n) f (h)}I f (g)]   (n) k(x, y)g(x)h(y)dμ(x)dμ(y)| +|E[I (n) f (g)I f (h)] − E

E

2 1/2 2 1/2 (E[{I f (ω, h) − I (n) ≤ |(E[{I f (ω, g) − I (n) f (g)} ]) f (h)} ]) 2 1/2 2 1/2 +(E[{I f (ω, g) − I (n) (E[I (n) f (g)} ]) f (h) ]) 2 1/2 2 1/2 +(E[{I f (ω, h) − I (n) (E[I (n) f (h)} ]) f (g) ])   + |k(x, y) − k(xi , x j )|g(x)h(y)dμ(x)dμ(y) → 0 . i

j

Ei

Ej



Exercises 83∼100

205

Exercises 83∼100 84. Construct a function gp_sample that generates random numbers f (ω, x1 ), . . . , f (ω, x N ) from the mean function m, the covariance function k, and x1 , . . . , x N ∈ E for a set E. Then, set m, k to generate 100 random numbers and examine if the covariance matrix matches the m, k. 85. Using Proposition 61, prove (6.3) and (6.4). 86. In the following program, other than the Cholesky decomposition, is there any step that requires a calculation with O(N 3 ) complexity? d e f gp_2 ( x _ p r e d ) : h = np . z e r o s ( n ) f o r i i n range ( n ) : h [ i ] = k ( x_pred , x [ i ] ) L = np . l i n a l g . c h o l e s k y (K + s i g m a _ 2 ∗ np . i d e n t i t y ( n ) ) a l p h a = np . l i n a l g . s o l v e ( L , np . l i n a l g . s o l v e ( L . T , ( y − mu ( x ) ) ) ) mm = mu ( x _ p r e d ) + np . sum ( np . d o t ( h . T , a l p h a ) ) gamma = np . l i n a l g . s o l v e ( L . T , h ) s s = k ( x _ p r e d , x _ p r e d ) − np . sum ( gamma ∗ ∗ 2 ) r e t u r n {"mm" :mm, "ss" : s s }

87. Show from (6.5) that the negated log-likelihood of x1 , . . . , x N ∈ R p , y1 , . . . , y N ∈ {−1, 1} is N log[1 + exp{−yi f (xi )}] . i=1

88. Explain that Lines 19 through 24 of the program in Example 86 are used to −1 update f X ← (W + k −1 X X ) (W f X + u). 89. Replace the first 100 Iris data (50 Setosa, 50 Versicolor) with the 51st to 150th data (50 Versicolor, 50 Versinica) in Example 86 to execute the program. 90. In the proof of Proposition 65, why is it acceptable to replace f Z in (6.16) of the generation process by μ f Z |Y to μ(x)? In σ 2 (x), the variations due to f Z |Y and f (x)| f Z are independent. Why can we assume that they are independent? 91. In Example 88, there is a step in which the function gp_ind that realizes the inducing variable method avoids processing O(N 3 ) calculations. Where is this step? 92. Show that a stochastic process is a mean-square continuous process if and only if its mean and covariance functions are continuous. 93. From Mercer’s theorem (6.31) and Proposition 67, derive Karhunen-Lóeve’s theorem. Additionally, for n = 10, generate five sample paths of Brownian motion. 94. From the formula for the Matérn kernel (6.35), derive ϕ5/2 and ϕ3/2 . Additionally, illustrate the value of the Matérn kernel (ν = 1, . . . , 10) for l = 0.05, as in Fig. 6.4. 95. Illustrate the sample path of the Matérn kernel with ν = 5/2, l = 0.1. 96. Give an example of a random element that does not involve a stochastic process and an example of a stochastic process that does not involve a random element.

206

6 Gaussian Processes and Functional Data Analyses

97. Prepare a basis function η = [η1 , . . . , η p ] : E → R p and construct a procedure to find m N (x) in (6.39). Then, input the Canadian weather data for N = 35 and output the result. Additionally, construct a procedure to find K N (x) in (6.40) and output it as a matrix of size p × p. 98. Suppose that we prepare p basis functions as cos x sin x cos 2x sin 2x 1 {√ , √ , √ , √ , √ , · · · } π π π π 2π  for E = [−π, π ]. Why is W = (wi, j ) = E η(x)η(x) d x a unit matrix? 99. Using the Canadian weather data (precipitation for each day of the year) instead of temperature for each day of the year, find the principal component functions and eigenvalues and output graphs similar to those in Figs. 6.8 and 6.9. 100. Using the scikit-fda, find the principal component functions and eigenvalues for both temperature and precipitation for each day of the year and output graphs similar to those in Figs. 6.8 and 6.9.

Bibliography

1. N. Aronszajn, Theory of reproducing kernels. Trans. Am. Math. Soc. 68, 337–404 (1950) 2. H. Avron, M. Kapralov, C. Musco, A. Velingker and A. Zandieh. Random fourier features for kernel ridge regression: Approximation bounds and statistical guarantees. ArXiv, abs/1804.09893, 2017 3. C. Baker. The Numerical Treatment of Integral Equations (Claredon Press, 1978) 4. P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. In J. Mach. Learn. Res., 2001 5. K.P. Chwialkowski and A. Gretton. A kernel independence test for random processes. In ICML, 2014) 6. R. Dudley. Real Analysis and Probability (Cambridge Studies in Advanced Mathematics, 1989) 7. K. Fukumizu. Introduction to Kernel Methods (kaneru hou nyuumon) (Asakura, 2010). (In Japanese) 8. T. Gneiting, Compactly supported correlation functions. J. Multivar. Anal. 83, 493–508 (2002) 9. G.H. Golub, C.F. Van Loan, Matrix Computations, 3rd edn. (Johns Hopkins, Baltimore, 1996) 10. I.S. Gradshteyn, I.M. Ryzhik, R.H. Romer, Tables of integrals, series, and products. Am. J. Phys. 56, 958–958 (1988) 11. A. Gretton, K. Borgwardt, M. Rasch, B. Schölkopf, A. Smola, A kernel two-sample test. J. Mach. Learn. Res. 13, 723–773 (2012) 12. A. Gretton, R. Herbrich, A. Smola, O. Bousquet, B. Schölkopf, Kernel methods for measuring independence. J. Mach. Learn. Res. 6, 2075–2129 (2005) 13. D. Haussler. Convolution kernels on discrete structures. Technical Report UCSC-CRL-99-10, UCSC, 1999 14. T. Hsing and R. Eubank. Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators (Wiley, 2015) 15. K. Itõ. An Introduction to Probability Theory (Cambridge University Press, 1984) 16. Y. Kano and S. Shimizu. Causal inference using nonnormality. In Proceedings of the Annual Meeting of the Behaviormetric Society of Japan 47, 2004 17. K. Karhunen. Über lineare methoden in der wahrscheinlichkeitsrechnung. Ann. Acad. Sci. Fennicae. Ser. A. I. Math.-Phys 37, 1–79 (1947) 18. K. Karhunen. Probability theory. Vol. II (Springer-Verlag, 1978) 19. H. Kashima, K. Tsuda and A. Inokuchi. Marginalized kernels between labeled graphs. In ICML, 2003

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2022 J. Suzuki, Kernel Methods for Machine Learning with Math and Python, https://doi.org/10.1007/978-981-19-0401-1

207

208

Bibliography

20. S. Lauritzen. Graphical Models (Oxford Science Publications, 1996) 21. J. Mercer. Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society A, pp. 441–458, 1909 22. J. Neveu, Processus aIeatoires gaussiens, Seminaire Math. Sup. Les presses de I’Universite de Montreal, 1968 23. A. Rahimi and B. Recht, Random features for large-scale kernel machines. In Advances in neural information processing systems, 2007 24. J. Ramsay and B.W. Silverman. Functional Data Analysis (Springer Series in Statistics, 2005) 25. C. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning (MIT Press, 2006) 26. B. Schölkopf, A. Smola and K. Müller. Kernel principal component analysis, In ICANN, 1997 27. R. Serfling. Approximation Theorems of Mathematical Statistics (Wiley, 1980) 28. S. Shimizu, P. Hoyer, A. Hyvärinen, A.J. Kerminen, A linear non-gaussian acyclic model for causal discovery. J. Mach. Learn. Res. 7, 2003–2030 (2006) 29. I. Steinwart, On the influence of the kernel on the consistency of support vector machines. J. Mach. Learn. Res. 2, 67–93 (2001) 30. M.H. Stone, Applications of the theory of boolean rings to general topology. Trans. Am. Math. Soc. 41(3), 375–481 (1937) 31. M.H. Stone, The generalized weierstrass approximation theorem. Math. Mag. 21(4), 167–184 (1948) 32. K. Tsuda, T. Kin, K. Asai, Marginalized kernels for biological sequences. Bioinformatics 18(Suppl 1), S268-75 (2002) 33. J.-P. Vert. Aronszajn’s theorem, 2017. https://members.cbio.mines-paristech.fr/~jvert/svn/ kernelcourse/notes/aronszajn.pdf 34. K. Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin, pp. 633–639, 1885. Erste Mitteilung 35. K. Weierstrass. Über die analytische darstellbarkeit sogenannter willkürlicher functionen einer reellen veränderlichen. Sitzungsberichte der Königlich Preußischen Akademie der Wissenschaften zu Berlin, pp. 789–805, 1885. Zweite Mitteilung 36. H. Zhu, C.K.I. Williams, R. Rohwer and Michal Morciniec. Gaussian regression and optimal finite dimensional linear models. In Neural Networks and Machine Learning, 1997